Commit graph

11997 commits

Author SHA1 Message Date
dependabot[bot]
4f2d454211
Bump Sixlabors.ImageSharp from 2.1.1 to 2.1.7 in /csharp/sample/Microsoft.ML.OnnxRuntime.FasterRcnnSample (#19806)
Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.1 to 2.1.7.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.7</h2>
<h2>What's Changed</h2>
<ul>
<li>[release/2.1] Disallow allocation attempts of unrepresentable sizes
by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2553">SixLabors/ImageSharp#2553</a></li>
<li>[release/2.1] Tiff decoding robustness improvements (<a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2550">#2550</a>)
by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2554">SixLabors/ImageSharp#2554</a></li>
<li>[release/2.1] PBM decoder robustness improvements and
BufferedReadStream observability by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2555">SixLabors/ImageSharp#2555</a></li>
<li>Backport 2681 by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2688">SixLabors/ImageSharp#2688</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.6...v2.1.7">https://github.com/SixLabors/ImageSharp/compare/v2.1.6...v2.1.7</a></p>
<h2>v2.1.6</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport - Handle EOF in Jpeg bit reader when data is bad to prevent
DOS attack. by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2524">SixLabors/ImageSharp#2524</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.5...v2.1.6">https://github.com/SixLabors/ImageSharp/compare/v2.1.5...v2.1.6</a></p>
<h2>v2.1.5</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2501">#2501</a>
by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2509">SixLabors/ImageSharp#2509</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.4...v2.1.5">https://github.com/SixLabors/ImageSharp/compare/v2.1.4...v2.1.5</a></p>
<h2>v2.1.4</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport WebP fix to 2.1 by <a
href="https://github.com/antonfirsov"><code>@​antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2420">SixLabors/ImageSharp#2420</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.3...v2.1.4">https://github.com/SixLabors/ImageSharp/compare/v2.1.3...v2.1.4</a></p>
<h2>v2.1.3</h2>
<h2>What's Changed</h2>
<ul>
<li>V2 Backport: 2133, 2154 by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2157">SixLabors/ImageSharp#2157</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.2...v2.1.3">https://github.com/SixLabors/ImageSharp/compare/v2.1.2...v2.1.3</a></p>
<h2>v2.1.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Backport - Issue 2123 by <a
href="https://github.com/JimBobSquarePants"><code>@​JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2126">SixLabors/ImageSharp#2126</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.1...v2.1.2">https://github.com/SixLabors/ImageSharp/compare/v2.1.1...v2.1.2</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="fa7d712702"><code>fa7d712</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2688">#2688</a>
from SixLabors/js/backport-2681</li>
<li><a
href="36b3533cc3"><code>36b3533</code></a>
Use correct property to disable upstream warnings.</li>
<li><a
href="94bb7615a1"><code>94bb761</code></a>
Update ImageSharp.csproj</li>
<li><a
href="3ea2574726"><code>3ea2574</code></a>
Update PngDecoderCore.cs</li>
<li><a
href="e74a55fbfd"><code>e74a55f</code></a>
[release/2.1] PBM decoder robustness improvements and BufferedReadStream
obse...</li>
<li><a
href="749b1c04d7"><code>749b1c0</code></a>
[release/2.1] Tiff decoding robustness improvements (<a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2550">#2550</a>)
(<a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2554">#2554</a>)</li>
<li><a
href="3064b78927"><code>3064b78</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2553">#2553</a>
from SixLabors/backport/2.1.x/2545</li>
<li><a
href="f36ec12695"><code>f36ec12</code></a>
Disallow allocation attempts of unrepresentable sizes </li>
<li><a
href="688e242a84"><code>688e242</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2524">#2524</a>
from SixLabors/js/backport-fix-jpeg-dos</li>
<li><a
href="0f17a8be9c"><code>0f17a8b</code></a>
Handle EOF in Jpeg bit reader when data is bad to prevent DOS
attack.</li>
<li>Additional commits viewable in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.1...v2.1.7">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.1&new-version=2.1.7)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-05 08:32:18 -07:00
Edward Chen
2b3071119a
Add onnxruntime/test/run_benchmark.py helper script. (#19234)
### Description
Add onnxruntime/test/run_benchmark.py helper script to repeat benchmark
runs until a target coefficient of variance is reached. It works with
[Google Benchmark](https://github.com/google/benchmark) programs like
`onnxruntime_mlas_benchmark`.

### Motivation and Context
Sometimes there is variability in benchmark run results. This automates
the repeated running needed to get results that are stable enough.
2024-04-05 07:02:01 -07:00
Hans
6abfb6b928
[js/rn] Support load external data (#20090)
Support load external data by passing local model path
2024-04-05 05:55:03 -07:00
Scott McKay
f61cca1b8f
NNAPI: Improve MatMul diagnostic output (#19721)
### Description
<!-- Describe your changes. -->
Re-order so that we don't get two messages for the one node.

Currently the batched matmul 'not supported' message will appear for 2D
input which is valid, which can be confusing to understand.

Change the order so we only check if batched matmul can be used when the
input ranks are > 3, as that is one of the requirements.

c311d1faf5/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/op_builder_helpers.cc (L257-L264)
2024-04-04 21:58:39 -07:00
Thomas Boby
254bdbb19d
OneDNN/dnnl: Fix filepath after dnnl move (#20086)
### Description
This adjusts the path used in the nuget script for dnnl to the new
location of the file.

There isn't a CI pipeline for this as far as I can tell, and I can't
easily confirm this change works on master, so please check.

### Motivation and Context
It is currently not possible to build onednn nuget packages. It's
possible that the correct action would be to move the file not fix this
path, but I'm not familiar enough with the repository layout.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2024-04-04 21:24:49 -07:00
Yi Zhang
4ea54b82f9
[Fix] Upload training CUDA daily wheel (#20183)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-03 13:18:26 +08:00
Andrew Fantino
7303a90f49
Fix build errors from date/date.h C++20 compatibility (#20139)
### Description
For C++ standards >= 20, use `std::chrono::operator<<` in place of
`date::operator<<` to fix ambiguous operator compile error.

### Motivation and Context
The external dependency HowardHinnant/date has a conflict with
std::chrono for >=C++20.
Solves #20137
2024-04-02 22:10:25 -07:00
Yi Zhang
dae77e6014
Support building Windows CUDA with Ninja (#20176)
### How to run it locally
1. conda install ninja
2. "C:\Program Files\Microsoft Visual
Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
3. python.exe {ort_repo}\tools\ci_build\build.py --config RelWithDebInfo
--build_dir {ort_repo}\build_cuda --skip_submodule_sync --build_csharp
--update --parallel --cmake_generator "Ninja" --build_shared_lib
--enable_onnx_tests --enable_pybind --build_java --build_nodejs
--use_cuda "--cuda_home=C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v11.8" --enable_cuda_profiling --cmake_extra_defines
CMAKE_CUDA_ARCHITECTURES=60
4. cd build_cuda\RelWithDebInfo
5.  cmake --build . j16

### Motivation and Context
In packaging pipelines, we often come across a random issue that the
building with CUDA on Windows takes too much time.
Although it has been reduced much by moving the building to the CPU
machine.
We're planning to build with Ninja instead of msbuild in Packaging
pipelines, thus, nvcc can run parallelly.
It's the first step to support it locally.
2024-04-03 11:19:31 +08:00
Yulong Wang
fa1917b81b
[js/webgpu] add validation to workgroup size (#20110)
### Description
add validation to workgroup size in `shaderHelper.mainStart()`.
2024-04-02 19:29:20 -07:00
Shubham Bhokare
be831e1ba3
Export of Openai Whisper with batched prompts (#19854)
Adds an example to demonstrate the export of openai whipser
implemenation with batch_size > 1 and addition of prompts for each audio
snippet.

Also handles the scenario for when prompts are not of the same size. For
example if our prompt ids are [p1_id_1, p1_id_2] and [p2_id_1], the
final decoder_input_ids will look as such after padding:
`[prev_token, p1_id_1, p1_id_2, start_token, lang_token,
transcribe_token]
[prev_token, p2_id_1, PAD_TOKEN, start_token, lang_token,
transcribe_token]`

---------

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
2024-04-02 17:01:48 -07:00
Rachel Guo
19793de1b3
#19921 [Dup] LLC Core count calculations updated (#20171)
### Description
<!-- Describe your changes. -->

See #19921 Just to address one comment:
https://github.com/microsoft/onnxruntime/pull/19921#discussion_r1543398640

since this is an external branch. need to open another pull request for
this.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Jian Chen <cjian@microsoft.com>
2024-04-02 16:53:47 -07:00
Dmitri Smirnov
12e2538065
Add new SessionOptions config entry to disable specific transformers and rules (#20135)
### Description
<!-- Describe your changes. -->

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Certain transformers slow down session loading time while providing no
runtime perf benefits.
Allow clients to exclude them.
2024-04-02 16:33:05 -07:00
Chi Lo
e916929371
[TensorRT EP] Address compiler warnings on Windows (#20134)
Previous [PR
](https://github.com/microsoft/onnxruntime/pull/19663)changes msvc
compiler warning level from set_msvc_c_cpp_compiler_warning_level(3) to
set_msvc_c_cpp_compiler_warning_level(4) when using CUDA EP (it also
applies to TRT EP).
Some warnings still need to be addressed in TRT EP code.
2024-04-02 10:39:46 -07:00
Xu Xing
a2998e5d42
[js/webgpu] Use global id in attention and instance-norm (#20008)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-02 01:42:39 -07:00
Adam Pocock
262b6bd3b7
[java][DML EP] Modifying dml_provider_factory.h so it can compile as a C header file (#20157)
### Description
The dml_provider_factory header file can't be used in C programs as it
defines C++ inline operators. This PR rearranges that header file so
that it looks like valid C when used from C, and also makes a couple of
small modifications to the Java code so it correctly binds to the DML EP
at build time.

I'm having some difficulty testing it as I think it's pulling in the old
version of DirectML on my computer and I can't figure out what the
library loading path is in Java to make it look at the recent version I
downloaded. So the test I added fails with:

```
InferenceTest > testDirectML() FAILED
    ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Exception during initialization: <path-to-ort>\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(518)\onnxruntime.dll!00007FFF74819333: (caller: 00007FFF74793509) Exception(3) tid(4f58) 80070057 The parameter is incorrect.
        at app//ai.onnxruntime.OrtSession.createSession(Native Method)
        at app//ai.onnxruntime.OrtSession.<init>(OrtSession.java:74)
        at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:236)
        at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:221)
        at app//ai.onnxruntime.InferenceTest.openSessionSqueezeNet(InferenceTest.java:1961)
        at app//ai.onnxruntime.InferenceTest.runProvider(InferenceTest.java:665)
        at app//ai.onnxruntime.InferenceTest.testDirectML(InferenceTest.java:657)
```

But it does correctly compile, and this error seems very similar to
other issues with the DML provider when it doesn't like a model due to
the loaded library being old. The test is using the squeezenet file
that's been in the repo since 2019. If someone can help me figure out
how to get the right version of DML in the library path I can test it
more on my end. I tried adding the folder with the new version into the
system path, but I'm not very familiar with Windows' library loading
behaviour.

### Motivation and Context
Fixes #19656 to allow use of the DirectML EP from ORT Java.

cc @martinb35
2024-04-01 21:58:50 -07:00
Xiaoyu
3979f53aa4
Update api backward compatibility (#20136)
### Description
Update api backward compatibility

### Motivation and Context
Update api backward compatibility
2024-04-01 21:37:56 -07:00
wangshuai09
3e2b659fce
[CANN] Add dump_om_model flag (#20075)
### Description
New flag of `dump_om_model` for **CANN EP**, which defaults to "True".

### Motivation and Context
When building an onnx model with CANN EP, the intermediate **OM(offline
model for Ascend NPU)** is automatically saved. There are some users
don't want to dump OM when resources are limited.
This PR will resovle this situation with `dump_om_model=False`
2024-04-01 21:35:29 -07:00
Dhruv Matani
742d413586
Fix bug related to export failure for DynamicQuantizeLSTM [issue 15465] (#20160)
### Description

See issue 15465: https://github.com/microsoft/onnxruntime/issues/15465

This PR just applies the workaround suggested in the thread that I and
numerous others on the thread have validated to work for them and allows
them to successfully export a PyTorch model with LSTM layers that are
dynamically quantized by ONNX.



### Motivation and Context

It is not possible to successfully export a dynamically quantized LSTM
model that I have trained for use in the onnx runtime without this
change.

Currently, this workaround lives as a local change in my python package
directory, and makes it basically impossible for anyone else at the
place I work at to successfully export the quantized model that I am
exporting.


See issue 15465: https://github.com/microsoft/onnxruntime/issues/15465

Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com>
2024-04-01 21:33:00 -07:00
Yufeng Li
91654988fd
optimize threading of mha (#20088)
### Description
<!-- Describe your changes. -->
The cost computation of ComputeVxAttentionScore is wrong. It should be
sequence_length * v_head_size * total_sequence_length instead of
sequence_length * v_head_size * sequence_length.

The PR also fine-tuned the cost computation.

on my local box with i9 cpu, the performance is same as unfused version,
but it is much faster on an azure vm with 16 threads.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

https://github.com/microsoft/onnxruntime/issues/19924
2024-04-01 21:32:36 -07:00
Atanas Dimitrov
9d06e1bfa4
Label encoder fusion (#19761)
### Description
Created a new `LabelEncoderFusion` pass. This is useful in model that
result from automatic conversion tools related to data-science:
sometimes the produced model contains consecutive `LabelEncoder`-s.
To merge 2 `LabelEncoder`-s the optimizer propagates the outputs of the
first encoder through the second one.


### Motivation and Context
This enhances the capabilities of the `onnxruntime::optimizer` by fusing
consecutive `LabelEncoder` nodes.


### Fusion examples
```
Applying fusion
node1: (a,C) (b,B) (c,A) -> Default: _Unused
node2: (A,1) (B,2) (C,3) -> Default: -1
fused: (a,3) (b,2) (c,1) -> Default: -1
Applying fusion
node1: (a,C) (b,B) (c,A) -> Default: D
node2: (A,a) (B,b) (C,c) (D,d) -> Default: default
fused: (a,c) (b,b) (c,a) -> Default: d
Applying fusion
node1: (a,0) (b,1) (c,2) -> Default: -1
node2: (2,a) (1,b) (0,c) -> Default: default
fused: (a,c) (b,b) (c,a) -> Default: default
Applying fusion
node1: (a,3) (b,2) (c,1) -> Default: -1
node2: (1,a) (2,b) (3,c) -> Default: d
fused: (a,c) (b,b) (c,a) -> Default: d
```

---------

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-04-01 09:41:10 -07:00
Yi Zhang
523ef04240
enable lto in Python-CUDA-Packaging Pipline (#20164)
### Description
Except [Python-CUDA-Packaging
pipeline](https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1299&_a=summary),
all windows cuda packaging jobs have been running well now.
After comparison, enable_lto isn't added in the pipeline, which might be
one root cause of the random hang.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-01 15:42:28 +08:00
Sumit Agarwal
e1e292f94c
[DML EP] DML Graph Serialization Bug (#19748)
### Description
This pull request addresses several issues:

- The DML Graph's nodes were not sorted in a topologically ordered
sequence, leading to crashes during deserialization when a child node
preceded its parent node. This PR resolves this issue by implementing a
topological sorting algorithm before serialization.

- During the `RemoveUnconnectedNodes` process:
- we update `intermeidateEdge.FromNodeIndex`. Additionally, we must
update `intermediateEdge.Name` when it includes
`intermediateEdge.FromNodeIndex`, as serialization/deserialization
heavily relies on edge names.

- we also eliminate unused edges. Consequently, we must erase inputs
(now unused) from corresponding maps
`serializedGraphInputIndexToSubgraphInputIndex` and
`serializedGraphLargeConstantNameToSubgraphInputIndex`.


### Motivation and Context
Why is this change required? What problem does it solve?
There are few ONNX Zoo public models which were crashing during
deserialization.
<!-- - - If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
2024-03-31 14:41:42 -07:00
kunal-vaishnavi
a0ebd5fee5
Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking (#20149)
### Description
This PR adds flash attention v2 and support for INT4 CUDA benchmarking
in PyTorch.

### Motivation and Context
The [flash attention v2](https://github.com/Dao-AILab/flash-attention)
algorithm helps improve model performance in PyTorch. Support for INT4
CUDA in PyTorch is done through the
[`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.
2024-03-29 23:09:37 -07:00
mo-ja
00244ea143
fix quantization errors of ConvTranspose with per_channel=True (#19996)
### Description
<!-- Describe your changes. -->
 - update axis value for per_channel quantization of QDQConv
   - we should use `axis=1` for ConvTranspose operator.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- this PR fixes https://github.com/microsoft/onnxruntime/issues/19694,
which I have opened
2024-03-29 21:36:15 -07:00
Ye Wang
f3a864217f
Fix MoE tensor parallelism tests (#20147)
### Description
<!-- Describe your changes. -->
Previously the expert weights are in row-major. But with the updated
cutlass extension introduced by
https://github.com/microsoft/onnxruntime/pull/20108, weights are stored
in col-major that aligns with Pytorch implementation. This change fixes
the way the tensors are sliced across shards.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-29 16:10:09 -07:00
Jeff Bloomfield
2f31560430
Enable generic feature level devices in DML EP (#20114)
### Description
Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and
D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX
headers through the DirectX-Headers repo.

Note that this includes an update to cgamanifest.json for onnx-tensorrt
which is triggered during re-generation due to a prior changes to
deps.txt.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-29 14:37:30 -07:00
cao lei
604b284261
add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp (#20145)
### Description
<!-- Describe your changes. -->
Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp
2024-03-29 13:49:56 -07:00
inisis
8396845806
fix shape inference bug (#19848)
### Description
for nodes like add, their input should be merged dynamically

### Motivation and Context
when doing shape inference, for nodes like Add, currently when doing _onnx_infer_single_node, their inputs are generated from last node's output, but they should be merged.
2024-03-29 13:06:27 -07:00
Adrian Lizarraga
b1a5eb255e
[Quant] Fix accuracy_level config option for MatMul 4bits quantizer (#20146)
### Description
Fixes code that extracts the accuracy level when creating a MatMulNBits
node in the `DefaultWeightOnlyQuantizer` class.


### Motivation and Context
Error from line 443: `AttributeError: 'DefaultWeightOnlyQuantizer'
object has no attribute 'accuracy_level'`. The solution is to access
`self.config.accuracy_level` instead of `self.accuracy_level`.

Relevant commit: https://github.com/microsoft/onnxruntime/pull/19106
2024-03-29 11:54:55 -07:00
Ye Wang
17919717b5
add QMoE (#20108)
### Description
<!-- Describe your changes. -->
1. Introduce latest cutlass extension from TRTLLM that gives us cutlass
upgrade(to 3.4) opportunity from MoE side.
2. Fix Windows build issue
3. Add Int4 MoE op and ut



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-29 10:24:19 -07:00
pengwa
2092bebc78
Fix transformer layer detection for recompute (#20106)
### Fix transformer layer detection for recompute

Originally logic miss detecting the layer boudary node in Mistral model.
This PR simplifies the searching, by using more strong pattern's match,
to make sure it is flexible enough to cover different transformer
variants.

Also add a UT.

Add a warning when user enable layerwise recompute but no layer boudary
nodes are found.
2024-03-29 17:44:38 +08:00
cao lei
2a184ac1a1
use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo (#20037)
### Description
<!-- Describe your changes. -->
use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to
hook the inplace map of custom ops


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to use OrtCustomOp's new API GetMayInplace in
CreateKernelCreateInfo to hook the inplace map of custom ops
2024-03-28 20:45:37 -07:00
Adam Pocock
2f82400b13
[java] Java 21 build support (#19876)
### Description
Bump spotless and the Gradle wrapper to 6.25.0 and 8.6 respectively to
allow compiling ORT on Java 21. The build still targets Java 8.

I'm not sure if there will be CI changes necessary to use this PR,
specifically for the Gradle version as I don't know if that is cached
somewhere earlier in the CI build process.

The new Gradle version adds a warning that using `--source` and
`--target` to select the Java language version is obsolete which is
annoying, we can fix it if we decide to only allow building on newer
versions of Java, while still supporting running on Java 8.

### Motivation and Context
Java 21 is the latest LTS release of Java and ORT should be able to
build on it.
2024-03-28 15:51:22 -07:00
Yi Zhang
f7b52d2e3e
[Fix] Only copy java files when build_java is True (#20121)
### Description


### Motivation and Context
Fix error in Nuget-CUDA-Packaging-Pipeline
2024-03-28 14:06:28 -07:00
Pranav Sharma
3ed0c81b30
Expose Reserve() in OrtAllocator to allow custom allocators to work when session.use_device_allocator_for_initializers is specified. (#19904)
### Description
Expose Reserve() in OrtAllocator to allow custom allocators to work when
session.use_device_allocator_for_initializers is specified.
Update: this change has been verified by Bing Ads and brings a
significant benefit in terms of memory utilization: 30GB less memory and
also better CPU utilization.

### Motivation and Context

https://microsoft-my.sharepoint.com/:w:/p/prs/Eeidf5YNtWtKrPVkfuTDsuABak1oL4QRpuBGuhqRbLKoJg?e=Zl3bah
2024-03-28 12:28:37 -07:00
Yi Zhang
2a38168f0b
increase cl mpcount since Compilation is moved on CPU machine (#20116)
### Description
The CPU machine has 16 cores, so we can increase the parallel count.
Compared with 2 runs.
1.
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=432328&view=results
2.
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=432331&view=results
The compilation took about 25 minutes if the parallel count is 15, while
it took 41 minutes if the parallel count is 3


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Yi Zhang <your@email.com>
2024-03-28 13:30:33 +08:00
Yi Zhang
c5d7310f1b
Remove TSA upload in testing stage (#20115)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yi Zhang <your@email.com>
2024-03-28 13:15:03 +08:00
Yi Zhang
8f069f81c4
Split more windows GPU workflow into 2 stages, building and testing, to make them more stable (#20080)
### Description
reactor win-ci.yml to solve the random hang issue in more GPU workflows,
move nugget-zip packages and python cuda12 packages building to CPU
machine.

---------

Co-authored-by: Yi Zhang <your@email.com>
2024-03-28 12:55:44 +08:00
wejoncy
16af7adc70
[llm exporter]auto infer output shape (#20071)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-28 09:52:10 +08:00
pengwa
55f63a48ca
Keep original name during fusion (#20097)
### Keep original name during fusion

This could be helpful to know where the fused node coming from, I feel
this is very useful when debugging the execution order issues between
different transformer layers.

For example:

- A node named
`/_original_module/model/layers.1/self_attn/MatMul/MatmulTransposeFusion//MatMulScaleFusion/`
goes through two fusion paths in the 1st transformer layer - e.g.
`MatmulTransposeFusion` and `MatMulScaleFusion`.

-
`/_original_module/model/layers.2/post_attention_layernorm/Mul_1/SimplifiedLayerNormFusion/`
node is a fused node by `SimplifiedLayerNormFusion`.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-28 08:40:34 +08:00
Ye Wang
a9d9b083e4
Fix py package pipeline (#20065)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Fixes #20068
2024-03-27 15:59:35 -07:00
Dmitri Smirnov
b95fd4e644
Enable CUDA EP unit testing on Windows (#20039)
### Description
Address build issues and source code discrepancies.
Fix cuda_test_provider gtest argument stack corruption.

### Motivation and Context
`OpTester` class that is widely used for kernel testing is not
suitable for testing internal classes for EPs that are built as shared
objects.
Currently, CUDA EP tests run only on Linux.
We want to enable testing and developments on Windows,
and create a usable pattern for testing of other EPs internals.

Alternatives considered: 
Abstracting EP unit tests into separate test executable such as
`onnxruntime_test_all`.
This alternative was rejected as it would create a lot more changes in
the established patterns,
and potentially interfere with CUDA functionality with more complex
source code maintanence.
2024-03-27 13:32:36 -07:00
Yi Zhang
ab2eaedfaa
Install ONNX by buildling source code in Windows DML stage (#20079)
### Description
In #20073, I use pin onnx version to unblock the whole PR CI.
In fact, we could use the onnx that installed by building source code,
that the onnx version is controlled by deps.txt.
For some history reason, DML stage installed onnx from pypi. Now, the
onnx can be installed as other stages.

add an option to skip installing onnx in win-ci-prebuild-step
2024-03-27 12:29:34 -07:00
Yi Zhang
4df9d16f98
[Fix] TSAUpload task must be in building stage (#20098)
### Description
In #20085, TSAUpload was in testing stage so main branch failed.
2024-03-27 12:20:57 -07:00
Xiaoyu
c8676ffbff
Add ModelProto support for quantize api (#20018)
### Description
Add ModelProto support for `quantize` api



### Motivation and Context
Currently, the `quantize` API only accepts a model path as the input
model. However, for large models, saving and loading from disk can be
time-consuming. By adding `ModelProto` as an input option to the
`quantize` API, significant time can be saved.
2024-03-27 10:40:08 -07:00
Yulong Wang
47903e701a
fix condition in web CI YAML (#20095)
### Description
fix condition in web CI YAML
2024-03-27 10:35:43 -07:00
Nanashi
ca465dc087
[js] Make error friendly when isOrtFormat is undefined (#19958)
### Description
Make error friendly when isOrtFormat is undefined
(`onnxruntime.InferenceSession.create` is called with ArrayBuffer or
Uint8Array).

### Motivation and Context
I was trying to run my onnx model in WebGL EP, but it gave me the error
"Cannot read properties of null (reading 'irVersion')".
I used debugger to find that actual error is `int64 is not supported`,
but the error was invisible for me.
So I made it to show both error when isOrtFormat is undefined.
<s>I haven't written unit test yet, so I'm making it draft. (I have no
idea about how do I test this though...)</s>
[d62d942](d62d9425ba)
2024-03-27 02:07:00 -07:00
guyang3532
4aa84003ca
support Pow/Div/Sqrt in PaddingElimination (#20083) 2024-03-27 16:10:07 +08:00
Yulong Wang
28907d8c59
[js/web] workaround NPM test fetch failure (#20020)
### Description

Sometimes the `npm test` failed with an error of "TypeError: Failed to
fetch".

I checked the callback entry of the localhost server started by karma.
When the "Failed to fetch" happens, no request is reflected on the
server side. The root cause is still not identified. However, as this
issue only happens sometimes when the browser is just launched by karma
runner, doing retry can workaround this issue for most of the time.
2024-03-26 21:35:49 -07:00
Chi Lo
3dcda13e62
[TensorRT EP] Fix concurrency issue for TRT custom op list (#20093)
The `CreateTensorRTCustomOpDomainList()` is not thread-safe due to its
static variables, `created_custom_op_list` and `custom_op_domain`.
This PR makes sure synchronization using mutex.

see issue: https://github.com/microsoft/onnxruntime/issues/20089
2024-03-26 21:20:14 -07:00