### Description
<!-- Describe your changes. -->
1. add option to export onnx compatiable with ort_vllm. This makes sure
that onnx model only leverages on paged attn from vllm. It's intended to
use internally so not mentioned in readme.
2. add details in ORT
installation(https://github.com/microsoft/onnxruntime/pull/19338#discussion_r1476906190)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: wejoncy <wejoncy@163.com>
### Description
This pull request includes a small change to the
`Dockerfile.manylinux2_28_cuda` file in the
`tools/ci_build/github/linux/docker` directory. The change corrects the
`PREPEND_PATH` argument from `/usr/local/cuda/binet` to
`/usr/local/cuda/bin`, ensuring the correct path to CUDA binaries is
set.
### Description
<!-- Describe your changes. -->
An overridable initializer should not have a fixed value included in an
NNAPI model as it could be changed at runtime. The current check doesn't
include validating that the initializer is constant.
I was updating GetClipMinMax as part of adding CoreML EP ML Program
support, and in order to make both CoreML and NNAPI do the more correct
thing of using IsConstantInitializer this set of changes was required.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make NNAPI and CoreML EPs more correct.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
This change
55a669409a
didn't take into account external data when unpacking initializer, and
therefore crashes when trying to unpack them.
When configured using the following CMake ops Clion is not able to
configure due to checking with `nvcc ... --dryrun tmp.cu`:
```
cmake -G Ninja -Donnxruntime_USE_TENSORRT="ON" -Donnxruntime_USE_CUDA="ON" -Donnxruntime_USE_CUDA_NHWC_OPS="ON" -DCMAKE_CUDA_ARCHITECTURES="native" -Donnxruntime_NVCC_THREADS=1 -Donnxruntime_ENABLE_NVTX_PROFILE="ON" -Donnxruntime_USE_TENSORRT_BUILTIN_PARSER="ON" -DCMAKE_CUDA_COMPILER_LAUNCHER="ccache" -Donnxruntime_BUILD_UNIT_TESTS="ON" -Donnxruntime_USE_TRITON_KERNEL=OFF -Donnxruntime_USE_FLASH_ATTENTION=OFF
```
Without building the unittests everything works fine. I believe my
changes only follow the logic that is actually desired. If
`NVCC_HAS_STRICT_ALIASING` is set to false it should not be possible to
add this as a CUDA flag. Same is true for `HAS_NOERROR` as seen in
`adjust_global_compile_flags.cmake`
[TF32](https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/)
could help boost performance on GPU of SM >= 80. Sometime, user observes accuracy loss, or need disable TF32 for testing
purpose. To disable TF32, it is also possible to set environment
variable `NVIDIA_TF32_OVERRIDE = 0`. However, sometime we do not want to
use environment variable to avoid impacting other applications, or want
to have finer control (like one session using TF32, and another session
not). This provider option could help.
Here we add a provider option `use_tf32`. When `use_tf32 = 0`, we will
disable TF32 for float MatMul/GEMM in cublas. It applies to MatMulNBits,
Attention, LongformerAttention, PackedAttention,
PackedMultiHeadAttention operators when float GEMM is used internally in
the operator. Note that it will not impact other data type, like fp8
gemm could still use TF32 in accumulation.
Previously, cublasGemmStridedBatchedHelper does not use TF32 in
inference. Here we enabled TF32 by default, so we might observe speed up
for FP32 transformers models on SM >= 80.
There is another PR that enables the option for cuDNN Conv later.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/15407https://github.com/microsoft/onnxruntime/issues/19288
This pull request replaces `CUBLAS_TENSOR_OP_MATH` with
`CUBLAS_DEFAULT_MATH`. The changes affect several files, including test
cases and a Python script for AMD hipify process.
### Motivation and Context
CUBLAS_TENSOR_OP_MATH mode is deprecated:
https://docs.nvidia.com/cuda/cublas/index.html#cublasmath-t
On CUDA versions prior to 11, users are required to set the math mode to
CUBLAS_TENSOR_OP_MATH manually to be able to use tensor cores for FP16.
On CUDA 11 and CUDA 12, this is no longer required. Since latest ORT
only supports CUDA >= 11 so it is safe to remove CUBLAS_TENSOR_OP_MATH
from our code base.
Bumps [github/issue-labeler](https://github.com/github/issue-labeler)
from 3.3 to 3.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/github/issue-labeler/releases">github/issue-labeler's
releases</a>.</em></p>
<blockquote>
<h2>v3.4</h2>
<h2>What's Changed</h2>
<ul>
<li>Fix warning by update node version by <a
href="https://github.com/renanfranca"><code>@renanfranca</code></a> in
<a
href="https://redirect.github.com/github/issue-labeler/pull/82">github/issue-labeler#82</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/renanfranca"><code>@renanfranca</code></a>
made their first contribution in <a
href="https://redirect.github.com/github/issue-labeler/pull/82">github/issue-labeler#82</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/github/issue-labeler/compare/v3.3...v3.4">https://github.com/github/issue-labeler/compare/v3.3...v3.4</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="c1b0f9f52a"><code>c1b0f9f</code></a>
v3.4 release</li>
<li><a
href="50f02baa95"><code>50f02ba</code></a>
Fix warning by update node version (<a
href="https://redirect.github.com/github/issue-labeler/issues/82">#82</a>)</li>
<li><a
href="a8d4f1b8e8"><code>a8d4f1b</code></a>
Update README.md</li>
<li>See full diff in <a
href="https://github.com/github/issue-labeler/compare/v3.3...v3.4">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
Changed the type annotation of sess_options in InferenceSession's
`__init__` method
### Motivation and Context
sess_options is one `SessionOptions`, not a sequence of it.
It is passed directly into `C.InferenceSession`, and from the definition
of
[`C.InferenceSession`](efc17e79de/onnxruntime/python/onnxruntime_pybind_state.cc (L1790)),
we can see that it is not a sequence:
```cpp
py::class_<PyInferenceSession>(m, "InferenceSession", R"pbdoc(This is the main class used to run a model.)pbdoc")
// In Python3, a Python bytes object will be passed to C++ functions that accept std::string or char*
// without any conversion. So this init method can be used for model file path (string) and model content (bytes)
.def(py::init([](const PySessionOptions& so, const std::string arg, bool is_arg_file_name,
bool load_config_from_model = false) {
```
When I test a new provider option, the training pipeline failed. I found
that training uses hash code of provider info to try get provider
instance. If a provider option is not used in hashing, the provider
instance fetched from cache might have different configuration for that
option.
Here I fix the hashing to use all provider options (except the default
Arena config that cannot be set from python API since training is used
with PyTorch in most cases).
Fixed a few obvious typo in the touched files.
Add regression test cases.
This pull request includes modifications to the `c-api-cpu.yml` Azure
Pipelines configuration file. The changes mainly revolve around the
Node.js packaging stage and the handling of Node.js artifacts. The most
significant changes include renaming the Node.js packaging stage, adding
a new dependency to the stage, changing artifact names, adding a new
script to list Node.js artifacts, and updating the source folder for
copying NuGet binaries.
Changes in Node.js packaging:
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L503-R508):
Renamed the Node.js packaging stage from `Nodejs_Packaging_CPU` to
`Nodejs_Packaging` and added `Windows_CI_GPU_DML_Dev` as a new
dependency to the stage.
Changes in handling of Node.js artifacts:
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L568-R569):
Changed the artifact name from `drop-onnxruntime-nodejs-win-x64` to
`drop-onnxruntime-nodejs-win-x64-dml` in the task to download pipeline
artifacts for Windows x64.
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59R595-R598):
Added a new script to list Node.js artifacts from the directory
`$(Build.BinariesDirectory)/nodejs-artifacts/win32/x64/`.
*
[`tools/ci_build/github/azure-pipelines/templates/c-api-cpu.yml`](diffhunk://#diff-00815920cc190d10fdebceac0c3a4b8a59e408684ae38177dfe7f96cae276c59L635-R640):
Updated the source folder from
`$(Build.BinariesDirectory)\RelWithDebInfo\RelWithDebInfo\nuget-artifacts\onnxruntime-win-x64\lib`
to `$(Build.BinariesDirectory)\nodejs-artifacts\win32\x64` in the task
to copy NuGet binaries to the directory
`$(Build.SourcesDirectory)\js\node\bin\napi-v3\win32\x64`.
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Add ATen fallback support for bicubic interpolation algorithm.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Required for facebook/dinov2 model architecture as part of ONNX Runtime
integration with AML Vision models.
### Description
<!-- Describe your changes. -->
This PR adds
onnx conversion script for dynamo exported phi2,
optimization script,
and inference example script
A readme file is added as documentation.
https://github.com/microsoft/onnxruntime/tree/wangye/phi2_doc/onnxruntime/python/tools/transformers/models/phi2#readme
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Bumps
[gradle/gradle-build-action](https://github.com/gradle/gradle-build-action)
from 2 to 3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/gradle/gradle-build-action/releases">gradle/gradle-build-action's
releases</a>.</em></p>
<blockquote>
<h2>v3.0.0-rc.1</h2>
<p>First release candidate of
<code>gradle/gradle-build-action@v3.0.0</code>.
This release candidate will the first release available under the
<code>v3</code> version tag.</p>
<blockquote>
<p>[!IMPORTANT]
As of <code>v3</code> this action has been superceded by
<code>gradle/actions/setup-gradle</code>.
Any workflow that uses <code>gradle/gradle-build-action@v3</code> will
transparently delegate to
<code>gradle/actions/setup-gradle@v3</code>.</p>
<p>Users are encouraged to update their workflows, replacing:</p>
<pre><code>uses: gradle/gradle-build-action@v3
</code></pre>
<p>with</p>
<pre><code>uses: gradle/actions/setup-gradle@v3
</code></pre>
<p>See the <a
href="https://github.com/gradle/actions/tree/main/setup-gradle">setup-gradle
documentation</a> for up-to-date documentation for
<code>gradle/actons/setup-gradle</code>.</p>
</blockquote>
<h2>Changes from <code>gradle-build-action@v2</code></h2>
<p>This release brings some useful and much requested features,
including:</p>
<ul>
<li>save and restore the Gradle configuration-cache data</li>
<li>add the Job summary content as a PR comment</li>
<li>easily publish Build Scans® to the free <a
href="https://scans.gradle.com">Gradle Build Scan service</a></li>
<li>compatibility with Node 20</li>
</ul>
<p>The only major breaking change from
<code>gradle-build-action@v2.12.0</code> is the update to require a Node
20 runtime environment.
Aside from that change, this release should generally serve as a drop-in
replacement for <code>gradle-build-action@v2</code>.</p>
<h3>Changelog</h3>
<ul>
<li>[NEW] - Run with NodeJs 20.x (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/946">gradle/gradle-build-action#946</a>)</li>
<li>[NEW] - Support for save & restore of configuration-cache data
(<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/966">gradle/gradle-build-action#966</a>)</li>
<li>[NEW] - Support for automatic adding PR comment with Job Summary
content (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/1020">gradle/gradle-build-action#1020</a>)</li>
<li>[NEW] - Make it easy to publish a Build Scan® to <a
href="https://scans.gradle.com">https://scans.gradle.com</a> (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/1044">gradle/gradle-build-action#1044</a>)</li>
<li>[NEW] - Added <code>dependency-graph-continue-on-failure</code>
input, which can be set to <code>false</code> to force the Job to fail
when dependency graph submission fails (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/1036">gradle/gradle-build-action#1036</a>).
Failure modes include:
<ul>
<li>Fail build step if version of Gradle being executed is not supported
for dependency-graph generation (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/1034">gradle/gradle-build-action#1034</a>)</li>
<li>Fail job if permissions are insufficient to submit dependency graph
via Dependency Submission API (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/997">gradle/gradle-build-action#997</a>)</li>
</ul>
</li>
<li>[NEW] - Add <code>dependency-graph: clear</code> option to clear any
dependency-graph previously submitted by the job</li>
<li>[FIX] Allow cache entries to be reused by jobs with the same ID in
different workflows (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/1017">gradle/gradle-build-action#1017</a>)
<ul>
<li>Workflow name remains part of the cache key, but cache entries
generated by the same job id in a different workflow may be
restored</li>
</ul>
</li>
<li>[FIX] Register pre-installed JDKs in Maven toolchains.xml file (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/1024">gradle/gradle-build-action#1024</a>)
<ul>
<li>This allows pre-installed JDKs to be auto-detected by Gradle
Toolchain support on Windows</li>
</ul>
</li>
<li>[FIX] - Update the Gradle Enterprise injection configuration for
product rename to Develocity (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/995">gradle/gradle-build-action#995</a>)</li>
<li>[FIX] - Avoid submitting an empty dependency graph when state is
loaded from configuration-cache</li>
<li>[DEPRECATION] - Deprecation of the arguments parameter (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/996">gradle/gradle-build-action#996</a>)</li>
<li>[BREAKING CHANGE] - Remove the <code>gradle-executable</code> input
parameter. Use a separate workflow Step to execute a Gradle from a
custom location.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="4a8703fa34"><code>4a8703f</code></a>
Delegate to 'setup-gradle@v3.0.0-rc.1'</li>
<li><a
href="4a39eedb8c"><code>4a39eed</code></a>
Mention setup-gradle in README</li>
<li><a
href="272883a7ba"><code>272883a</code></a>
Remove all action sources: these have been migrated to
'gradle/actions'</li>
<li><a
href="2a8bfcf231"><code>2a8bfcf</code></a>
Delegate action implementation to gradle/actions/setup-gradle</li>
<li><a
href="e1ada08a9a"><code>e1ada08</code></a>
Bump the github-actions group with 1 update (<a
href="https://redirect.github.com/gradle/gradle-build-action/issues/1047">#1047</a>)</li>
<li><a
href="a8e3e5e2b4"><code>a8e3e5e</code></a>
Apply dependency version updates</li>
<li><a
href="2be01ca1c6"><code>2be01ca</code></a>
Build outputs</li>
<li><a
href="a00827eebb"><code>a00827e</code></a>
Bump the npm-dependencies group with 7 updates</li>
<li><a
href="ad80850e98"><code>ad80850</code></a>
Bump the github-actions group with 2 updates</li>
<li><a
href="bd6d0a74d4"><code>bd6d0a7</code></a>
Configure explicit java version for config-cache test</li>
<li>Additional commits viewable in <a
href="https://github.com/gradle/gradle-build-action/compare/v2...v3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
1. make parity_check use local model to avoid using hf token
2. del the model didn't work because it tried to del the object define
out of the function scope.
So it caused out of memory in A10.
3. In fact, 16G GPU memory (one T4) is enough. But the conversion
process always be killed in T4 and it works on A10/24G.
Standard_NC4as_T4_v3 has 28G CPU memory
Standard_NV36ads_A10_v5 has 440G memory.
It looks that the model conversion needs very huge memory.
### Motivation and Context
Last time, I came across some issues in convert_to_onnx.py so I use the
onnx model in https://github.com/microsoft/Llama-2-Onnx for testing.
Now, these issues could be fixed. So I use onnx model generated by this
repo and the CI can cover the model conversion.
Fix pytest version to 7.4.4, higher version will cause error
`from onnxruntime.capi import onnxruntime_validation
ModuleNotFoundError: No module named 'onnxruntime.capi'`
### Description
<!-- Describe your changes. -->
Setup usage of coremltools via dependencies instead of copying files.
Pull in some changes from
https://github.com/microsoft/onnxruntime/pull/19347 in preparation for
supporting ML Program and enabling building the ML Model on all
platforms to make development and testing of CoreML EP code easier.
- Update to coremltools 7.1
- Add patch for changes required for cross platform build of ML Program
related code
- Generate coreml proto files on all platforms
- mainly to test these changes work everywhere, as the proto files will
be used on all platforms when #19347 is checked in
- rename onnxruntime_coreml_proto target to coreml_proto as it contains
purely coreml protobuf code with no ORT related chagnes
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve setup.
### Description
This PR 1) adds LeakyRelu activation for fusedConv; 2) makes `vec4<f16>`
value work with `float32` uniforms attributes.
For example:
`clamp(value, vec4<f16>(uniforms.clip_min),
vec4<f16>(uniforms.clip_max)` will throw compilation errors since
`uniforms.clip_min` and `uniforms.clip_min` are `f32` not `f16`. So we
need to change it to `clamp(value, vec4<f16>(f16(uniforms.clip_min)),
vec4<f16>(f16(uniforms.clip_max))`
And above problem was introduced when we make activation attributes as
uniforms instead of constant.
BTW, after adding LeakyRelu, `realesrgan-t256` model can pass.
### Description
support external data in npm test.
This allows test runner to detect whether an external data is available
in the test folder, and if it is, load it as external data
automatically.
this feature does not parse every model to figure out whether the model
has external data. the following comments in code explained how to
determine whether should parse the model file.
```js
// for performance consideration, we do not parse every model. when we think it's likely to have external
// data, we will parse it. We think it's "likely" when one of the following conditions is met:
// 1. any file in the same folder has the similar file name as the model file
// (e.g., model file is "model_abc.onnx", and there is a file "model_abc.pb" or "model_abc.onnx.data")
// 2. the file size is larger than 1GB
```
### Description
- When converting ONNX split sizes to QNN split indices, do not include
the split at index 0. QNN 2.19 assumes index 0 is implicit and throws a
validation error if provided.
- Fix bug when using an ONNX Split operator with a `num_outputs`
attribute that does not evenly divide into `shape[axis]`. The ONNX spec
states that the last chunk should be smaller, but QNN EP made the last
chunk larger.
- Fix bug when using an ONNX Split operator with a `split` input. QNN EP
was incorrectly passing the split sizes as split indices without
conversion.
### Motivation and Context
QNN SDK 2.19 updated validation criteria for Split operators. QNN EP was
previously passing a split index that should have been implicit.
Also, discovered a bugs when using `num_outputs` attribute and `split`
input.
### Description
Currently, ORT will fail a build when the flag DEBUG_GENERATION is set
to 1 (used to debug BeamSearch and GreedySearch) in
[console_dumper.h](3b63d85c25/onnxruntime/contrib_ops/cpu/utils/console_dumper.h (L12))
with the following error:
`onnxruntime/onnxruntime/contrib_ops/cpu/transformers/logits_processor.h:270:15:
error: ‘DumpScores’ was not declared in this scope`
This is because it is defined in `logits_processor.cc`, and a debugging
artifact was passed in an earlier PR where this function is called from
`logits_processor.h` before it is defined
[[link](3a2ab1963a/onnxruntime/contrib_ops/cpu/transformers/logits_processor.h (L270))].
Builds with the flag have been broken since that PR was merged.
This PR moves DumpScores() definition from `logits_processor.cc` to
`logits_processor.h` so that all debug statements can be used correctly
in `logits_processor.cc` and `logits_processor.h` and build succeeds
with this debug flag.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
### Description
<!-- Describe your changes. -->
Resolving compilation errors when using USE_VITISAI
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
There will be compilation errors when USE_VITISAI is enabled
This is in addition to the #19058
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Multi-partition support for context binary cache feature
1. In QNNEP create the list of EPContext nodes if ep_context_enable is enabled, so that it can dump the model with multiple partitions
2. Extend context loading part to support multiple EPContext nodes
### Motivation and Context
It only support single partition before this changes. There's graph partition limitation for context cache feature after this change.
### Description
<!-- Describe your changes. -->
Register resize-18 and -19, which will be lit up automatically when dml
feature level bumps up to 6300.
It's worth noting that DML has a different implementation for antialias
than does ORT CPU. DML does iterative downsampling whenever the scale
factor is less than 0.5. This is equivalent to performing resize with a
variable-sized input window (also equivalent to mip mapping). ORT takes
a different approach, using the same convolution approach as PIL. The
two implementations approach each other in certain cases (with
iota-generated data) but they usually aren't perfectly equivalent.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Linnea May <linneamay@microsoft.com>
### Description
<!-- Describe your changes. -->
Refactor the VAIEP to use MSFT's standalone API
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs
in order to decouple the EP from onnxruntime.dll and the providers.dll.
This will help to simplify customer deployment of applications and use
cases that need to share their onnxruntime.dll with other applications.
---------
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
### Description
<!-- Describe your changes. -->
Increment num_resolves_ inside the graph resolve finalization function
so the subgraphs have the same value.
This prevents incorrect output regarding removing unused initializers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19141
### Description
Adds type/shape inferencing support for MSFT domain QuantizeLinear and
DequantizeLinear operators to symbolic_shape_infer.py
### Motivation and Context
Need a way to infer the types and shapes of Q/DQ ops in models that use
the MSFT domain versions (e.g., int16 quantization).
### Description
Building on g++ 13.2.0 results in -Wstringop-overread errors on Linux.
This commit addresses the flatbuffer build issue with the following
changes:
1. Remove the Werror flag in the flarbuffer patch.
2. Add a compilation option to suppress the 'stringop-overflow' error in
the Flatbuffers within the xnnpack provider.
### Motivation and Context
https://github.com/google/flatbuffers/issues/8119https://github.com/microsoft/onnxruntime/pull/19239
Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
### Description
There is a current bug in the BeamSearch implementation of T5, GPT, and
Whisper due to an interaction between two PRs merged in the past 7
months.
First PR/code change is the addition of BeamSearchScorer GPU
implementation. This PR accelerates some operations by executing them in
the GPU and not the CPU. The approach for this code change didn't
utilize a cudaStream when copying one particular variable from GPU to
CPU (see nullptr value here:
[[link](b65d3d0a53/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_t5.h (L213))]).
The second PR/code change was the alteration to utilize a cudaStream to
initialize various memory buffers in BeamSearch (see `stream` included
as the last argument in these allocations
[[link](d1431e1b78/onnxruntime/contrib_ops/cpu/transformers/beam_search_impl_base.h (L25))]).
During the in-between period of these two PRs, I believe neither
allocation utilized a stream and were thus synchronized. Once the latter
PR was merged, the copy became desynchronized with the initialization
due to different streams.
The fix for this is to reintroduce the same stream into the copy
operation added in the first PR.
### Motivation and Context
This does not happen reliably on every hardware with every script due to
the race condition nature, but the bug completely breaks ORT execution
with a BeamSearch model.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
### Description
This PR expands the graph capture capability to JS EP, which is similar
to #16081. But for JS EP, we don't use the CUDA Graph, instead, we
records all gpu commands and replay them, which removes most of the cpu
overhead to avoid the the situation that gpu waiting for cpu.
mobilenetv2-12 becomes 3.7ms from 6ms on NV 3090 and becomes 3.38ms from
4.58ms on Intel A770.
All limitations are similar with CUDA EP:
1. Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
2. Usage of graph capture is limited to models where-in all ops in the
model can be partitioned to the JS EP or CPU EP and no memory copy
between them.
3. Shapes of inputs/outputs cannot change across inference calls.
4. IObinding is required.
The usage is like below:
Method 1: specify outputs buffers explicitly.
```
const sessionOptions = {
executionProviders: [
{
name: "webgpu",
},
],
enableGraphCapture: true,
};
const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);
// prepare the inputBuffer/outputBuffer
... ...
const feeds = {
'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
};
const fetches = {
'output': ort.Tensor.fromGpuBuffer(outputBuffer, { dataType: 'float32', dims: [1, 1000] })
};
let results = await session.run(feeds, fetches); // The first run will begin to capture the graph.
// update inputBuffer content
... ...
results = = await session.run(feeds, fetches); // The 2ed run and after will directly call replay to execute the graph.
... ...
session.release();
```
Method 2: Don't specify outputs buffers explicitly. Internally, when
graph capture is enabled, it will set all outputs location to
'gpu-buffer'.
```
const sessionOptions = {
executionProviders: [
{
name: "webgpu",
},
],
enableGraphCapture: true,
};
const session = await ort.InferenceSession.create('./models/mobilenetv2-12.onnx', sessionOptions);
// prepare the inputBuffer
... ...
const feeds = {
'input': ort.Tensor.fromGpuBuffer(inputBuffer, { dataType: 'float32', dims })
};
let results = await session.run(feeds); // The first run will begin to capture the graph.
// update inputBuffer content
... ...
results = = await session.run(feeds); // The 2ed run and after will directly call replay to execute the graph.
... ...
session.release();
### Description
<!-- Describe your changes. -->
- Split out the code that implements the OrtKernelContext API (used by
compiled nodes and custom ops) and the code that implements the custom
ops API.
- Exclude based on minimal build settings using helpers
- the main change is to simply wrap the implementation into a lambda so
it can be easily enabled/disabled
- actual implementation of all functions are unchanged
- Re-organize so the related implementations are together
- most diffs are from this, but without the reorg it would be much
harder to know which helper to use
- General cleanup of lines that were too long.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Saves ~10KB in a minimal build.
Build command used for comparison
```
./build --android --android_api=29 --android_sdk="d:\Android" --android_abi=arm64-v8a --parallel --android_ndk_path="D:\Android\ndk\26.0.10792818\" --build_shared_lib --cmake_generator Ninja --skip_tests --minimal_build --disable_rtti --disable_ml_ops --disable_exceptions --cmake_extra_defines=onnxruntime_BUILD_UNIT_TESTS=OFF --include_ops_by_config .\no_ops.config --config MinSizeRel
```
Main: 1,218,480 bytes
With changes: 1,208,320 bytes