### Description
<!-- Describe your changes. -->
Fix comparison that was not updated when the threshold was converted to
bytes.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CI failure
This reverts commit f396748ed6.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Updates Windows QNN Nuget and Python packaging pipelines to download
QNN SDK from blob storage.
- Makes the QNN SDK version configurable when launching the python
packaging pipeline.
### Motivation and Context
Removes the need to rebuild images to update QNN SDK. Only applies to
Windows pipelines. Linux pipelines still get the SDK from disk.
### Description
Support GQA operator on CPU with FP32.
### Motivation and Context
Right now, models generated for CPU and GPU must be different. GQA CPU
allows these models to be the same.
### Description
<!-- Describe your changes. -->
Support 1D input to XNNPACK Conv and ConvTranspose by using faking
height of 1 to convert to 2D input.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable speech model with 1D input to use XNNPACK. There is no CPU EP
quantized ConvTranspose, so this fills that gap.
Fix handling of nodes that get assigned to kMSInternalNHWCDomain when loading an ORT format model. The ORT format model doesn't contain information about kMSInternalNHWCDomain since it is set during layout transformation. Fall back to known domains instead.
### Description
Contains critical bug fix
### Motivation and Context
This PR handles the bug fix wrt OV caching and blob generation.
This also handles the precision for AUTO plugin.
---------
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
### Introduce memory efficient topo sort (for training)
~~and laze initialize Priority-Based and Memory-Efficient topo sort.
Because in most cases, they are not needed, so we free the overheads of
GraphViewer construction for most use cases.~~
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.
I don't see a way for the flatbuffers 64-bit offsets to be used, as they
don't support storing 'table' types with 64-bit offsets (and our Tensor
is a 'table' type not a simple struct).
0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)
Allowing a Tensor to have its raw_data in an external file should
hopefully work with the least friction. As it's an extra field it's
backwards compatible.
Please feel free to suggest alternative approaches.
Side note: the diffs in the generated *.fbs.h files are unexpectedly
large. Maybe they weren't re-generated when the new flatbuffers version
was checked in. I updated by running:
`python .\compile_schema.py -f <build output
dir>\_deps\flatbuffers-build\Debug\flatc.exe`
from onnxruntime\core\flatbuffers\schema which I thought was the correct
way but maybe that's out of date.
I think you can ignore all the diffs in the generated files and just
worry about the changes to the .fbs files in
onnxruntime/core/flatbuffers/schema. Basically start at the bottom of
the files changed and work up as all the 'real' diffs are there.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: carzh <wolfivyaura@gmail.com>
### Description
fix test runner with optional input/output.
This change fixes the OP test runner (.jsonc format test) with optional
input(s) and/or output(s).
this fix reveals a problem of dealing with optional outputs:
> Take SkipSimplifiedLayerNorm as example:
>
> if in the ONNX model, the node's outputs are: [ 'output_0', '' ]
instead of [ 'output_0' ], the current implementation will fail. The
difference is, in the first case, context.outputCount == 2, and then the
typescript implementation will try to create a tensor for output[1]. It
will eventually call to C++ function (OpKernelContext::Output), and the
output.DataRaw() will be nullptr. WebGPU backend will fail because it
cannot deal with a TensorView with data == 0.
>
This problem may need to be fixed or workaround in separated PR. This PR
does not fix this problem. Failed test cases are modified to work -
please note this PR does not break those test cases as they never work.
### Description
This PR registers the following opset 20 operators to the DML EP:
-IsNaN-20
-IsInf-20
-ReduceMax-20
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds support for float16 BatchNormalization to the HTP backend.
- Fixes float32 support for BatchNormalization on the HTP backend when
`enable_htp_fp16_precision` is enabled.
### Motivation and Context
Support more models on the QNN HTP backend.
### Description
<!-- Describe your changes. -->
Add Nuget package changes for adding new 'net6.0-maccatalyst' platform.
The output ORT Nuget package was manually tested and verified in a .NET
MAUI app setup.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
Unused vector allocating large memory chunk within a concurrent routine
creates heap contention and is eliminated.
### Motivation and Context
This partially addresses
https://github.com/microsoft/onnxruntime/issues/20373.
### Description
These changes include
Support to OpenVINO 2024.1
Import PreCompiled Blobs with EPContext Blob
Separate Device/Precision as input
Deprecate CPU_FP32 , GPU_FP32 terminology , introduce CPU, GPU
AUTO GPU, CPU will only create GPU Blob and not CPU Blob.
### Motivation and Context
- OpenVINO 2024.1 will be out soon
- Import Precompiled Blob can greatly reduce FEIL/FIL Time.
- Separating Device/Precision will make the input cleaner
-
---------
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
<!-- Describe your changes. -->
Add version for onnxruntime_providers_vitisai.dll. So, the
onnxruntime_vitisai_ep.dll can check if the version is compatible.
To make sure the old onnxruntime_vitisai_ep.dll still work, we would
offset the api struct by version field.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
This is the direct request from Microsoft. The following is the problem
we try to solve:
How would you describe the dependency between (a)
onnxruntime_vitisai_ep.dll and (b) onnxruntime_providers_vitisai.dll?
E.g. for each version of (a) there is a minimum required version of (b),
or for each version of (b) there is minimum required version of (a).
Please note that in practice we won't be able to use the exact version
of ORT/EP that you tested against (because we might need to update ORT
for other reasons), but we might be able to accommodate some version
constraints that you specify. As we approach shipping, we'll lock the
version of ORT/EP to allow for stabilization and more detailed testing
(and work with you if it needs to be updated).
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I am prefiring this change to pre-run the non-dml checks, and also to
give folks the time to review it before DML gets released. When DML 1.14
officially releases, we'll only need to run the DML pipeline to
automatically pick up the nuget package. This should save us some
valuable time.
Note that DML 1.14 is the release needed for ORT 1.17.4, and DML 1.15
will come soon after.
### Description
Updates the QDQ quantizer to use ONNX Q/DQ ops for 16-bit quantization
if opset >= 21.
### Motivation and Context
The QDQ quantizer previously set the 'com.microsoft' domain on inserted
Q/DQ ops when the model needed 16-bit support. ONNX 1.16.0 added
int16/uint16 support to the QuantizeLinear and DequantizeLinear
operators, so we can change the default behavior.
This PR has the change of supporting INT64 tensor type for TRT 10.
This PR is also **compatible with TRT 8.6 and TRT 10** meaning user can
build ORT TRT against TRT 8.6 or TRT 10.
Due to the timeline for TRT 10 GA and ORT 1.18 release is very tight (We
don't have enough time to get our CIs installed with TRT 10 GA libraries
and run the build/tests), as well as Nvidia new Triton release (The
timeline is also very close to the timeline of TRT 10 GA) wants to
integrate TRT EP with TRT 10.
Therefore, our approach is to make this PR into ORT 1.18 first, so
everything is fully tested with TRT 8.6 CIs, and user can still manually
build ORT 1.18 against TRT 10 like the Triton case.
As for testing TRT 10, once TRT 10 GA is released, we will have another
branch which includes change at this PR as well as whatever changes
needed and update our CIs with TRT 10.
### Description
Currently we try to include all prebuilt binaries into the NPM packages.
This was working until we added libonnxruntime_providers_cuda.so
(>400MB) into the NPM package. The NPM registry refuses to accept new
package publishment because the file is too large.
To make the new NPM package working, we have to remove the large file
from the package, and add a new script on package installation. This
script will try to dynamically install onnxruntime CUDA dynamic library
for Linux/x64.
### Description
Introducing a new class ORTPipelineModule to handle wrapping layers in
DeepSpeed pipeline parallel.
### Motivation and Context
To support pipeline parallelism on ORTModule.
This PR will include an initial support of deepspeed Pipeline
parallelism.
- [x] Support Pipeline parallel where layers are nn Modules in
Sequential.
- [ ] Support LayerSpec and TiedLayerSpec
- [ ] Enable partitioning to accept List
- [ ] Full-GPU Graph Consolidation
- [ ] Subgraph Merging for Inference
### Description
This fixes following things:
- Expose `ENABLE_NPU_ADAPTER_ENUMERATION` macro via build command, so
that a user can enable NPU support for DML EP seamlessly.
- Add keyword `_dmlEp_` as part of the node name, which would be useful
for debugging purpose.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This adds a new "Graph Capture" option to the DML ep, similar to the
cuda graph functionality. Here's how graph capture works:
- A user can enable graph capture in the session options by setting
`ep.dml.enable_graph_capture` to `true`
- When they want to capture a run, they set `gpu_graph_id` in their
`RunOptions` to a number bigger than 0 (0 is reserved for internal use
according to the cuda graph documentation).
- Then, when they start the inference, the graph will be captured and
stored in the DML EP for future use
- When they execute the run for a second time with the same id, the
`ReplayGraph` function in the DML EP will be called instead of executing
the kernels, resulting in very low overhead and avoiding kernel
recompilation.
This feature can give up-to-par or even better performance than
specifying the static dimensions at session creation time, but is also
much more flexible.
Previous implementation used numpy array and numpy data_type to store
constant value and data type, which is not support BFloat16 natively.
This PR is to switch to use torch tensor which supports BFloat16.
### Description
Background:
User save large model with initializer data in external file. e.g:
onnx.save_model(onnx_model, "path/to/save/the/model.onnx", save_as_external_data=True, all_tensors_to_one_file=True,
location="filename", size_threshold=1024).
In that case, Ort loads the model, get the external initializer information (external file name, offset, length) and use the model path to find the external file, and locate to the tensor data via the offset and length.
But it won't work if user load the model from memory, since Ort lost track of the model path.
This PR adds API/session option to let user provide a table with external initializer file name as the key, the pointer to the loaded external file in memory and the buffer length as value. So that
1. user can load the model from memory buffer with external initializers in memory buffer too.
2. the initializers can be shared across sessions, for different EPs.
3. user can load the file in any way they want, e.g mmap.
Internally,
1. at session creation time, Ort goes through the external initializers in the graph, gets the file name, offset, data length of the external initializers from Tensorproto .
2. With the file name, Ort get the file in memory buffer and buffer length from the table user provided.
4. Ort locates the tensor buffer from file in memory buffer (user provided) using the offset and data length (from Tensorproto ).
5. Ort creates the Tensor and replace the existing Tensor in the graph.
### Motivation and Context
https://github.com/onnx/onnx/blob/main/docs/ExternalData.md
For a model with external data, the Tensorproto may have initializer data in a separate file. The external file location is set via the file path relative to the model path. With the API to load model from memory buffer, it lost track of the
model path. So it causes error if the model has external data. By adding a session option to set the external data buffer, Ort can find the external data correctly if model loaded from memory buffer.
This change improves the python API perf in 2 few ways:
1. Remove unnecessary CPU syncs by sharing a queue between the python
EPs and the allocator.
2. Add an opt-in CPU spinning sync to reduce overhead in applications
that run a lot of inferences per second.
Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.7 to 2.1.8.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.8</h2>
<h2>What's Changed</h2>
<ul>
<li>V2 - Limit Read Palette Indices by <a
href="https://github.com/JimBobSquarePants"><code>@JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2719">SixLabors/ImageSharp#2719</a></li>
<li>V2 - Clear Pixel Buffers on Decode. by <a
href="https://github.com/JimBobSquarePants"><code>@JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2717">SixLabors/ImageSharp#2717</a></li>
<li>V2 - Limit all memory allocations in the MemoryAllocator layer by <a
href="https://github.com/JimBobSquarePants"><code>@JimBobSquarePants</code></a>
in <a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2715">SixLabors/ImageSharp#2715</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="f21d64188e"><code>f21d641</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2715">#2715</a>
from SixLabors/backport/v2-memlimit</li>
<li><a
href="8f0b4d3e68"><code>8f0b4d3</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2717">#2717</a>
from SixLabors/backport/v2-clear-buffers</li>
<li><a
href="cf9496d284"><code>cf9496d</code></a>
test allocation limits</li>
<li><a
href="3d298db2cd"><code>3d298db</code></a>
Adapt BmpDecoder_ThrowsException_Issue2696 for V2</li>
<li><a
href="a78ce27a2b"><code>a78ce27</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2719">#2719</a>
from SixLabors/backport/v2-check-palette-indices</li>
<li><a
href="e6209147b1"><code>e620914</code></a>
Clamp read palette indices.</li>
<li><a
href="c122185ea0"><code>c122185</code></a>
Clear pixel buffers on decode.</li>
<li><a
href="5c6ec5d6fb"><code>5c6ec5d</code></a>
Limit all allocations</li>
<li>See full diff in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.7...v2.1.8">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
- Fix quantization tool bug that did not correctly set a quantized
bias's scale data type to fp16 if the original bias was fp16.
- Enabled fp16 ConvTranspose quantization unit tests that were disabled.
### Motivation and Context
Python quantization tests for fp16 ConvTranspose were originally
disabled due to a shape inference bug. It turns out that we also have a
bug in our quantizer that does not properly handle fp16 bias inputs.
Fixing the bug allows us to re-enable these tests with the latest
version of ONNX.
### Description
- Adds a patch that fixes a shape inference bug that caused a segfault:
https://github.com/onnx/onnx/pull/6080
- Fix documentation describing why QLinearMatMul tests are currently
being skipped.
### Motivation and Context
The [PR for integrating with ONNX
1.16.0](https://github.com/microsoft/onnxruntime/pull/19745) disabled
various python quantization tests due to a shape inference bug. This PR
applies the ONNX fix as a patch. We still can't enable the tests because
some of our CIs pip install onnx-1.16.0, which doesn't include the fix.
Enable provider option to let user provider the profiling file path.
Separate out the profiling level for ETW, in case there's switch like ETW enabled when Ort creates the QNN profiling, then gets disabled when Ort logs the profiling events. vise versa. Enhance the logic to decide the profiling level.
### Description
<!-- Describe your changes. -->
Add platform aware helper to fetch errno message string.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For usage in #20077
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>