Adds QNN EP HTP shared memory allocator.
The HTP shared memory allocator (`HtpSharedMemoryAllocator`) calls the
rpcmem shared library (libcdsprpc.so/dll) to allocate and free memory
that can be shared between HTP and CPU.
The allocator can be enabled by setting QNN EP option
`enable_htp_shared_memory_allocator` to `1`.
`QNNExecutionProvider::CreatePreferredAllocators()` will then return an
instance of `HtpSharedMemoryAllocator`.
For each QNN context, we also need to register and unregister memory
handles in order to use the HTP shared memory. This memory handle
management is added to `QnnBackendManager`, which also manages the QNN
context handles.
For more information about using HTP shared memory with QNN, see:
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_shared_buffer_tutorial.html#shared-buffer-tutorial
Limitations:
- HTP shared memory usage is only supported for graph inputs and
outputs. Intermediate values are not supported.
- An allocation is assigned to a single shared memory buffer. The
allocator is not smart enough to have multiple allocations share a
single shared memory buffer.
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
### Description
This PR contains a part of the changes in #23318.
The reason of creating this PR is: The works to support building WebGPU
EP in WASM depends on #23318, which cannot be merged since it's blocked
by upstream (https://github.com/llvm/llvm-project/issues/122166). This
PR contains the changes can be safely merged separately and can unblock
the development of supporting building WebGPU EP in WASM.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
This change simplifies the o2i_output implementation by reducing
unnecessary intermediate variables, with no change in functionality.
### Motivation and Context
As above.
Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
### Description
use LOGS_DEFAULT for device lost logging.
Now since the GPU device lifecycle is managed by WebGpuContext, it's now
able to use ORT logging.
### Description
increase absolute error for test case `MatMulNBits.Float16Large` to 0.1
for WebGPU with subgroup implementation.
Fixes webgpu CI pipeline.
Test: onnxruntime_test_all.exe --gtest_filter=SplitOperatorTest.*
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Bumps [black](https://github.com/psf/black) from 24.2.0 to 24.10.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/psf/black/releases">black's
releases</a>.</em></p>
<blockquote>
<h2>24.10.0</h2>
<h3>Highlights</h3>
<ul>
<li>Black is now officially tested with Python 3.13 and provides Python
3.13
mypyc-compiled wheels. (<a
href="https://redirect.github.com/psf/black/issues/4436">#4436</a>) (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li>Black will issue an error when used with Python 3.12.5, due to an
upstream memory
safety issue in Python 3.12.5 that can cause Black's AST safety checks
to fail. Please
use Python 3.12.6 or Python 3.12.4 instead. (<a
href="https://redirect.github.com/psf/black/issues/4447">#4447</a>)</li>
<li>Black no longer supports running with Python 3.8 (<a
href="https://redirect.github.com/psf/black/issues/4452">#4452</a>)</li>
</ul>
<h3>Stable style</h3>
<ul>
<li>Fix crashes involving comments in parenthesised return types or
<code>X | Y</code> style unions.
(<a
href="https://redirect.github.com/psf/black/issues/4453">#4453</a>)</li>
<li>Fix skipping Jupyter cells with unknown <code>%%</code> magic (<a
href="https://redirect.github.com/psf/black/issues/4462">#4462</a>)</li>
</ul>
<h3>Preview style</h3>
<ul>
<li>Fix type annotation spacing between * and more complex type variable
tuple (i.e. <code>def fn(*args: *tuple[*Ts, T]) -> None: pass</code>)
(<a
href="https://redirect.github.com/psf/black/issues/4440">#4440</a>)</li>
</ul>
<h3>Caching</h3>
<ul>
<li>Fix bug where the cache was shared between runs with and without
<code>--unstable</code> (<a
href="https://redirect.github.com/psf/black/issues/4466">#4466</a>)</li>
</ul>
<h3>Packaging</h3>
<ul>
<li>Upgrade version of mypyc used to 1.12 beta (<a
href="https://redirect.github.com/psf/black/issues/4450">#4450</a>) (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li><code>blackd</code> now requires a newer version of aiohttp. (<a
href="https://redirect.github.com/psf/black/issues/4451">#4451</a>)</li>
</ul>
<h3>Output</h3>
<ul>
<li>Added Python target version information on parse error (<a
href="https://redirect.github.com/psf/black/issues/4378">#4378</a>)</li>
<li>Add information about Black version to internal error messages (<a
href="https://redirect.github.com/psf/black/issues/4457">#4457</a>)</li>
</ul>
<h2>24.8.0</h2>
<h3>Stable style</h3>
<ul>
<li>Fix crash when <code># fmt: off</code> is used before a closing
parenthesis or bracket. (<a
href="https://redirect.github.com/psf/black/issues/4363">#4363</a>)</li>
</ul>
<h3>Packaging</h3>
<ul>
<li>Packaging metadata updated: docs are explictly linked, the issue
tracker is now also
linked. This improves the PyPI listing for Black. (<a
href="https://redirect.github.com/psf/black/issues/4345">#4345</a>)</li>
</ul>
<h3>Parser</h3>
<ul>
<li>Fix regression where Black failed to parse a multiline f-string
containing another
multiline string (<a
href="https://redirect.github.com/psf/black/issues/4339">#4339</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/psf/black/blob/main/CHANGES.md">black's
changelog</a>.</em></p>
<blockquote>
<h2>24.10.0</h2>
<h3>Highlights</h3>
<ul>
<li>Black is now officially tested with Python 3.13 and provides Python
3.13
mypyc-compiled wheels. (<a
href="https://redirect.github.com/psf/black/issues/4436">#4436</a>) (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li>Black will issue an error when used with Python 3.12.5, due to an
upstream memory
safety issue in Python 3.12.5 that can cause Black's AST safety checks
to fail. Please
use Python 3.12.6 or Python 3.12.4 instead. (<a
href="https://redirect.github.com/psf/black/issues/4447">#4447</a>)</li>
<li>Black no longer supports running with Python 3.8 (<a
href="https://redirect.github.com/psf/black/issues/4452">#4452</a>)</li>
</ul>
<h3>Stable style</h3>
<ul>
<li>Fix crashes involving comments in parenthesised return types or
<code>X | Y</code> style unions.
(<a
href="https://redirect.github.com/psf/black/issues/4453">#4453</a>)</li>
<li>Fix skipping Jupyter cells with unknown <code>%%</code> magic (<a
href="https://redirect.github.com/psf/black/issues/4462">#4462</a>)</li>
</ul>
<h3>Preview style</h3>
<ul>
<li>Fix type annotation spacing between * and more complex type variable
tuple (i.e. <code>def fn(*args: *tuple[*Ts, T]) -> None: pass</code>)
(<a
href="https://redirect.github.com/psf/black/issues/4440">#4440</a>)</li>
</ul>
<h3>Caching</h3>
<ul>
<li>Fix bug where the cache was shared between runs with and without
<code>--unstable</code> (<a
href="https://redirect.github.com/psf/black/issues/4466">#4466</a>)</li>
</ul>
<h3>Packaging</h3>
<ul>
<li>Upgrade version of mypyc used to 1.12 beta (<a
href="https://redirect.github.com/psf/black/issues/4450">#4450</a>) (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li><code>blackd</code> now requires a newer version of aiohttp. (<a
href="https://redirect.github.com/psf/black/issues/4451">#4451</a>)</li>
</ul>
<h3>Output</h3>
<ul>
<li>Added Python target version information on parse error (<a
href="https://redirect.github.com/psf/black/issues/4378">#4378</a>)</li>
<li>Add information about Black version to internal error messages (<a
href="https://redirect.github.com/psf/black/issues/4457">#4457</a>)</li>
</ul>
<h2>24.8.0</h2>
<h3>Stable style</h3>
<ul>
<li>Fix crash when <code># fmt: off</code> is used before a closing
parenthesis or bracket. (<a
href="https://redirect.github.com/psf/black/issues/4363">#4363</a>)</li>
</ul>
<h3>Packaging</h3>
<ul>
<li>Packaging metadata updated: docs are explictly linked, the issue
tracker is now also
linked. This improves the PyPI listing for Black. (<a
href="https://redirect.github.com/psf/black/issues/4345">#4345</a>)</li>
</ul>
<h3>Parser</h3>
<ul>
<li>Fix regression where Black failed to parse a multiline f-string
containing another</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1b2427a2b7"><code>1b2427a</code></a>
Prepare release 24.10.0 (<a
href="https://redirect.github.com/psf/black/issues/4471">#4471</a>)</li>
<li><a
href="a22b1ebbfd"><code>a22b1eb</code></a>
Add mypyc 3.13 wheel build (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li><a
href="b7d0e7212b"><code>b7d0e72</code></a>
Bump AndreMiras/coveralls-python-action from
65c1672f0b8a201702d86c81b79187df...</li>
<li><a
href="f1a2f92bba"><code>f1a2f92</code></a>
Include --unstable in cache key (<a
href="https://redirect.github.com/psf/black/issues/4466">#4466</a>)</li>
<li><a
href="8d9d18c033"><code>8d9d18c</code></a>
Fix skipping Jupyter cells with unknown %% magic (<a
href="https://redirect.github.com/psf/black/issues/4462">#4462</a>)</li>
<li><a
href="bbfdba3a5e"><code>bbfdba3</code></a>
Fix docs CI: use venv for uv to fix 'failed to create directory' (<a
href="https://redirect.github.com/psf/black/issues/4460">#4460</a>)</li>
<li><a
href="8fb2add1f7"><code>8fb2add</code></a>
Use builtin generics (<a
href="https://redirect.github.com/psf/black/issues/4458">#4458</a>)</li>
<li><a
href="2a45cecf29"><code>2a45cec</code></a>
Fix crashes with comments in parentheses (<a
href="https://redirect.github.com/psf/black/issues/4453">#4453</a>)</li>
<li><a
href="b4d6d8632d"><code>b4d6d86</code></a>
Drop Python 3.8 support (<a
href="https://redirect.github.com/psf/black/issues/4452">#4452</a>)</li>
<li><a
href="ac018c16ca"><code>ac018c1</code></a>
Require newer aiohttp for blackd (<a
href="https://redirect.github.com/psf/black/issues/4451">#4451</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/psf/black/compare/24.2.0...24.10.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
This PR applies subgroup to implement matmulnbits when tile_m > 1 for
intel devices.
With this PR, prefill for 500 tokens prompt for phi3 becomes 3.5s from
8.5s on intel Meteor Lake.
### Description
Spec of LayerNormalization supports broadcasting (tensors Scale and B
should be unidirectional broadcastable to tensor X).
https://onnx.ai/onnx/operators/onnx__LayerNormalization.html
However, current implementation only allow scale and bias size to be
X.shape()[axis:].
Example of input tensors that normalized with axis=2:
| X shape | Scale shape | B shape | Before | After |
| - | - | - | - | - |
| (B, S, D) | (D) | (D) | Supported | Supported |
| (B, S, D) | (1, 1, D) | (1, 1, D) | Supported | Supported |
| (B, S, D) | (B, 1, D) | (B, 1, D) | Not Supported | Supported |
| (B, S, D) | (1, S, D) | (1, S, D) | Not Supported | Supported |
| (B, S, D) | (B, S, D) | (B, S, D) | Not Supported | Supported |
Here we add limited support: axis=2; scale/bias has same shape;
scale/bias/X have same number of dimensions. It could support common use
case in LLM and vision models.
### Motivation and Context
Support Stable Diffusion 3.x and Flux model.
### Description
Fixes build when specify with flag `--target
onnxruntime_providers_webgpu`
Otherwise the following error will occur:
```
range.cc
D:\code\onnxruntime\build\Windows\Debug\_deps\onnx-src\onnx\onnx_pb.h(65,10): error C1083: Cannot open include file: 'o
nnx/onnx-ml.pb.h': No such file or directory [D:\code\onnxruntime\build\Windows\Debug\onnxruntime_providers_webgpu.vcxp
roj]
(compiling source file '../../../onnxruntime/core/providers/webgpu/math/binary_elementwise_ops.cc')
```
Fix some inconsistency.
All our iOS build should target iOS 15.1.
All our macOS desktop build should target macOS 13.3 to align with the
changes made in #17361
### Description
Fix error causing incorrect output when past key/value share buffer with
present key/value
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
BUG #23273
With this change, I see the convTranspose time in that bug becomes ~7s
from ~90s on my Meteor Lake.
This PR does below things:
1. Use stride to update the increasement in the loop.
In the bug, the stride is 1024, which can greatly reduce the loop times.
2. Support components for A to reduce the memory access times.
3. When output channels is 1, the b components can be same with A to
further reduce the memory access times.
### Description
<!-- Describe your changes. -->
This PR tries to fix#22615. (see detailed description in the issue)
A perfect solution would be too difficult to make, because there are a
huge number of combinations of usage scenarios, including combinations
of development framework, bundler, dev/prod mode, and so on.
This PR is using the following approach:
- Introduce a new type of end to end test: export test. This type of
tests are complete web apps that use popular web development frameworks,
and the tests are using puppeteer to run the apps and check if the apps
can run without error.
- added one nextjs based web app and one vite based web app.
- In the test, perform the following test steps:
- `npm install` for packages built locally
- `npm run dev` to start dev server and use puppeteer to launch the
browser to test
- `npm run build && npm run start` to test prod build and use puppeteer
to launch the browser to test
- Make changes to ort-web, including:
- special handling on Webpack's behavior of rewriting `import.meta.url`
to a `file://` string
- revise build definitions
- fix wasm URL for proxy, if used in a bundled build
The new images contain the following updates:
1. Added Git, Ninja and VCPKG to all docker images
2. Updated CPU containers' GCC version from 12 to 14
3. Pinned CUDA 12 images' CUDNN version to 9.5(The latest one is 9.6)
4. Addressed container supply chain warnings by building CUDA 12 images
from scratch(avoid using Nvidia's prebuilt images)
5. Updated manylinux commit id to
75aeda9d18eafb323b00620537c8b4097d4bef48
Also, this PR updated some source code to make the CPU EP's source code
compatible with GCC 14.
### Description
<!-- Describe your changes. -->
Added a fatal error message for unsupported GroupQuerryAttention
do_rotary attribute.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/22987
Help user understand that this attribute is not supported.
Currently we have Clip/Relu with Q fusion on level 2. But for EPs that
are using NodeUnit, these optimizers are not applied. If we want to
remove such redundant Clip/Relu nodes, we need to add code to handle it
for each EP separately.
The PR detects a Clip/Relu is made redundant with a Q node, and add this
information to the corresponding QDQ NodeUnit, so that EPs can ignore
it, and can handle the target node only in the QDQ NodeUnit.
### Description
Always make sure resources and callbacks are cleaned up
### Motivation and Context
We've seen problems where the log callback isn't deregistered which can lead to crashes
---------
Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
### Description
Add a temporary path to RN 0.69.3 to update the boost url
### Motivation and Context
Fix the React-native CI until we update the RN to 0.70.15 or 0.73.3+
versions
ONNX's MatMul is same as numpy.matmul, which supports input tensors with
rank >= 1. But QNN's MatMul can only support input tensors with rank >=
2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to
support all possible cases of ONNX's MatMul, by adding Reshape nodes if
necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output
to expected shape at the end.
This PR also tries to use FullyConnected Op for MatMul if 2nd input is
2D initializer or 1D tensor because FullyConnected is faster than MatMul
on QNN EP. If 2nd input is 2D tensor, we require it an initializer
because FullyConnected requires 2nd input in [n, k] shape, we can
transpose it when graph building if it's an initializer (we don't want
to add extra Transpose node).
Use swin_base model as example, which contains several MatMul nodes with
2nd input is 2D initializer (not followed by Add), running on Gen3
mobile device, before the change, it takes 34.8876 ms, after this
change, it's 27.0639 ms.
Some quantized models have QDQ around Conv/Gemm but the weight and/or
bias are not quantized. This PR adds WeightBiasQuantization optimizer to
quantize float weight and/or bias to INT8 and INT32 tensors
respectively. We only do this for weight and/or bias initializer so that
ConstantFolding will fold the sub-graph to real quantized initializers
during the graph optimization next round.
### Description
Fix comparison of narrow type with wide type in loop condition.
### Motivation and Context
Comparison between types of different widths in a loop condition can
cause the loop to fail to terminate.
### Description
Fusing Pad & AveragePool requires AveragePool to use
`count_include_pad=1`. If the AveragePool already set some padding and
`count_include_pad=0`, fusion can't happen.
This PR adds a condition to perform fusion depending on those
attributes. If fusion occurs, `count_include_pad` is always set to `1`.
### Motivation and Context
Fix#22177 (mislabelled as a performance issue but there's an actual bug
in the implementation)
Bug introduced in #21556
### Description
This PR 1) uses override shape instead of tensor original shape in
shader key to reduce some shader variants; 2) adds indices shape rank to
shader key in case some potential errors.
### Description
Separating result processor out from profiler.py without changing the
behaviors of current profile.py
### Motivation and Context
Less dependency and smaller code for processing profile from other
scenarios.
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
1. Currently Python-Cuda-Publishing-Pipeline only publishes Linux
wheels, not Windows wheels. It is because recently we refactored the
upstream pipeline("Python-CUDA-Packaging-Pipeline") to use 1ES PT. This
PR fixed the issue
2. tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml no
longer includes component-governance-component-detection-steps.yml ,
because 1ES PT already inserted such a thing
3. Delete tools/ci_build/github/windows/eager/requirements.txt because
it is no longer used.
### Motivation and Context
The "Python-CUDA-Packaging-Pipeline" is for CUDA 12.
"Python CUDA ALT Packaging Pipeline" is for CUDA 11.
The two pipelines are very similar, except the CUDA versions are
different.
Each of them has three parts: build, test, publish.
"Python-CUDA-Packaging-Pipeline" is the first part: build.
"Python CUDA12 Package Test Pipeline" is the second part.
"Python-Cuda-Publishing-Pipeline" is the third part that publishes the
packages to an internal ADO feed.
Move Linux GPU CI pipeline to A10 machines which are more advanced.
Retire onnxruntime-Linux-GPU-T4 machine pool.
Disable run_lean_attention test because the new machines do not have
enough shared memory.
```
skip loading trt attention kernel fmha_mhca_fp16_128_256_sm86_kernel because no enough shared memory
[E:onnxruntime:, sequential_executor.cc:505 ExecuteKernel] Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: CUDA error cudaErrorInvalidValue:invalid argument
```
### Description
This PR is convenient to do post processing for the generated json file
when profiling is enabled. Kernel type can be used to aggregate the same
type kernels' overall time.
### Description
Use `https.get` instead of `fetch` in ORT Nodejs binding package install
script.
### Motivation and Context
According to discussions in #23232, the package `global-agent` cannot
work with `fetch` API. To make it work with the proxy agent, this PR
replaces the `fetch` API with `https.get` in the install script.