### Description
(1) Update BiasGelu fusion to support onnx Gelu-20
Since onnx Gelu-20 supports float/double/bf16/fp16, here we update
related ops to support these data types in CUDA and ROCm execution
providers:
(2) Add double support for Gelu/FastGelu op in CUDA/ROCm execution
provider
(3) Add BFloat16 support for Gelu ops in CUDA execution provider
(4) Add unit tests
(5) Update operator documents
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/23491
* [CPU EP] Implement Add/Sub/Mul/Div element wise operations for
(u)int8, (u)int16, uint32 and uint64.
* [CPU EP] Implement Neg unary operation for int16
* [CUDA EP] Implement Add/Sub/Mul/Div element wise operations for
(u)int8 and (u)int16
### Motivation and Context
This solves https://github.com/microsoft/onnxruntime/issues/23051
### Description
Follw up #21897
To be compatible with onnx 17.0, Registering opset 22 is required in
terms of the [updated operators
(bfloat16)](https://github.com/onnx/onnx/releases/tag/v1.17.0)
### Motivation and Context
Fix#23162Fix#23161Fix#23164 (Xnnpack)
### Remaining issue
#23163 (QNN) See [the
file](https://github.com/microsoft/onnxruntime/pull/23344/files#diff-04f5d6db0a6873f7299ed06ff1ec45a49e69f0865cb32f4397cd56db0cd0a784)
### Result of `find_optimizer_opset_version_updates_required.py (cpu
only)`
```
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_add_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.IsInf. Latest:20 Optimizer support ends at 10. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/isinf_reducesum_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/isinf_reducesum_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/isinf_reducesum_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.HardSigmoid. Latest:22 Optimizer support ends at 6. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_add_act_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.MaxPool. Latest:22 Optimizer support ends at 12. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.AveragePool. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.BatchNormalization. Latest:15 Optimizer support ends at 14. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Upsample. Latest:10 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Resize. Latest:19 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.GlobalMaxPool. Latest:22 Optimizer support ends at 1. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.GlobalAveragePool. Latest:22 Optimizer support ends at 1. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Shape. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pre_shape_node_elimination.cc
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_bn_fusion.cc
[ERROR] - Call/Declaration is split over multiple lines. Please check manually.File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/label_encoder_fusion.cc Line:49
[ERROR] - Failed to find version information for "ai.onnx.ml".LabelEncoder. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/label_encoder_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.HardSigmoid. Latest:22 Optimizer support ends at 6. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_activation_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Dropout. Latest:22 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/dropout_elimination.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/gemm_transpose_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/gemm_transpose_fusion.cc
[ERROR] - Symbolic name of 'ignorable_nodes[index].first' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/matmul_bn_fusion.cc
[ERROR] - Symbolic name of 'dest.first' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/matmul_bn_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.AveragePool. Latest:22 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.MaxPool. Latest:22 Optimizer support ends at 12. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Pad. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Dropout. Latest:22 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/bias_dropout_fusion.cc
[ERROR] - Failed to find version information for kMSDomain.BitmaskDropout. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/bias_dropout_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Clip. Latest:13 Optimizer support ends at 6. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/relu_clip_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/fast_gelu_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/fast_gelu_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Reshape. Latest:21 Optimizer support ends at 14. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/reshape_fusion.cc
[ERROR] - Failed to find version information for kMSDomain.ConcatTraining. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/reshape_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Where. Latest:16 Optimizer support ends at 9. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/not_where_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Where. Latest:16 Optimizer support ends at 9. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/not_where_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_mul_fusion.cc
[ERROR] - Symbolic name of 'QOpName' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_util.cc
[ERROR] - Symbolic name of 'QOpName' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_util.cc
[ERROR] - Symbolic name of 'DQOpName' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_util.cc
[ERROR] - Symbolic name of 'DQOpName' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_util.cc
[ERROR] - Call/Declaration is split over multiple lines. Please check manually.File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc Line:170
[WARNING] - Newer opset found for kOnnxDomain.MaxPool. Latest:22 Optimizer support ends at 12. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_propagation.cc
[ERROR] - Symbolic name of 'current_node.OpType(' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/compute_optimizer/upstream_transformer_base.cc
[WARNING] - Newer opset found for kOnnxDomain.Reshape. Latest:21 Optimizer support ends at 14. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/compute_optimizer/upstream_reshape.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/attention_fusion_helper.h
```
Use ruff as the code formatter in place of black and isort since it is
much faster, and as projects like PyTorch and ONNX have adopted ruff
format as well.
This PR include only auto-fixed changes in formatting.
### Description
<!-- Describe your changes. -->
- Implemented the DepthToSpace uint8_t kernel.
- Enabled DropQDQNodesRules for DepthToSpace.
- Added unit tests for the DepthToSpace uint8_t kernel.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This commit aims to enhance the performance of the Image
Super-Resolution INT8 Model (RFDN). Specifically, it improves the
Inference Per Second (IPS) by 25%, providing a significant boost in
efficiency and speed.
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.5.4 to 0.9.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/releases">ruff's
releases</a>.</em></p>
<blockquote>
<h2>0.9.1</h2>
<h2>Release Notes</h2>
<h3>Preview features</h3>
<ul>
<li>[<code>pycodestyle</code>] Run
<code>too-many-newlines-at-end-of-file</code> on each cell in notebooks
(<code>W391</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15308">#15308</a>)</li>
<li>[<code>ruff</code>] Omit diagnostic for shadowed private function
parameters in <code>used-dummy-variable</code> (<code>RUF052</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15376">#15376</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>flake8-bugbear</code>] Improve
<code>assert-raises-exception</code> message (<code>B017</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15389">#15389</a>)</li>
</ul>
<h3>Formatter</h3>
<ul>
<li>Preserve trailing end-of line comments for the last string literal
in implicitly concatenated strings (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15378">#15378</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Fix a bug where the server and client notebooks were out of sync
after reordering cells (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15398">#15398</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>[<code>flake8-pie</code>] Correctly remove wrapping parentheses
(<code>PIE800</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15394">#15394</a>)</li>
<li>[<code>pyupgrade</code>] Handle comments and multiline expressions
correctly (<code>UP037</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15337">#15337</a>)</li>
</ul>
<h2>Contributors</h2>
<ul>
<li><a
href="https://github.com/AntoineD"><code>@AntoineD</code></a></li>
<li><a
href="https://github.com/InSyncWithFoo"><code>@InSyncWithFoo</code></a></li>
<li><a
href="https://github.com/MichaReiser"><code>@MichaReiser</code></a></li>
<li><a href="https://github.com/calumy"><code>@calumy</code></a></li>
<li><a
href="https://github.com/dcreager"><code>@dcreager</code></a></li>
<li><a
href="https://github.com/dhruvmanila"><code>@dhruvmanila</code></a></li>
<li><a href="https://github.com/dylwil3"><code>@dylwil3</code></a></li>
<li><a href="https://github.com/sharkdp"><code>@sharkdp</code></a></li>
<li><a href="https://github.com/tjkuson"><code>@tjkuson</code></a></li>
</ul>
<h2>Install ruff 0.9.1</h2>
<h3>Install prebuilt binaries via shell script</h3>
<pre lang="sh"><code>curl --proto '=https' --tlsv1.2 -LsSf
https://github.com/astral-sh/ruff/releases/download/0.9.1/ruff-installer.sh
| sh
</code></pre>
<h3>Install prebuilt binaries via powershell script</h3>
<pre lang="sh"><code>powershell -ExecutionPolicy ByPass -c "irm
https://github.com/astral-sh/ruff/releases/download/0.9.1/ruff-installer.ps1
| iex"
</code></pre>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md">ruff's
changelog</a>.</em></p>
<blockquote>
<h2>0.9.1</h2>
<h3>Preview features</h3>
<ul>
<li>[<code>pycodestyle</code>] Run
<code>too-many-newlines-at-end-of-file</code> on each cell in notebooks
(<code>W391</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15308">#15308</a>)</li>
<li>[<code>ruff</code>] Omit diagnostic for shadowed private function
parameters in <code>used-dummy-variable</code> (<code>RUF052</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15376">#15376</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>flake8-bugbear</code>] Improve
<code>assert-raises-exception</code> message (<code>B017</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15389">#15389</a>)</li>
</ul>
<h3>Formatter</h3>
<ul>
<li>Preserve trailing end-of line comments for the last string literal
in implicitly concatenated strings (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15378">#15378</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Fix a bug where the server and client notebooks were out of sync
after reordering cells (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15398">#15398</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>[<code>flake8-pie</code>] Correctly remove wrapping parentheses
(<code>PIE800</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15394">#15394</a>)</li>
<li>[<code>pyupgrade</code>] Handle comments and multiline expressions
correctly (<code>UP037</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15337">#15337</a>)</li>
</ul>
<h2>0.9.0</h2>
<p>Check out the <a href="https://astral.sh/blog/ruff-v0.9.0">blog
post</a> for a migration guide and overview of the changes!</p>
<h3>Breaking changes</h3>
<p>Ruff now formats your code according to the 2025 style guide. As a
result, your code might now get formatted differently. See the formatter
section for a detailed list of changes.</p>
<p>This release doesn’t remove or remap any existing stable rules.</p>
<h3>Stabilization</h3>
<p>The following rules have been stabilized and are no longer in
preview:</p>
<ul>
<li><a
href="https://docs.astral.sh/ruff/rules/stdlib-module-shadowing/"><code>stdlib-module-shadowing</code></a>
(<code>A005</code>).
This rule has also been renamed: previously, it was called
<code>builtin-module-shadowing</code>.</li>
<li><a
href="https://docs.astral.sh/ruff/rules/builtin-lambda-argument-shadowing/"><code>builtin-lambda-argument-shadowing</code></a>
(<code>A006</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/slice-to-remove-prefix-or-suffix/"><code>slice-to-remove-prefix-or-suffix</code></a>
(<code>FURB188</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/boolean-chained-comparison/"><code>boolean-chained-comparison</code></a>
(<code>PLR1716</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/decimal-from-float-literal/"><code>decimal-from-float-literal</code></a>
(<code>RUF032</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/post-init-default/"><code>post-init-default</code></a>
(<code>RUF033</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/useless-if-else/"><code>useless-if-else</code></a>
(<code>RUF034</code>)</li>
</ul>
<p>The following behaviors have been stabilized:</p>
<ul>
<li><a
href="https://docs.astral.sh/ruff/rules/pytest-parametrize-names-wrong-type/"><code>pytest-parametrize-names-wrong-type</code></a>
(<code>PT006</code>): Detect <a
href="https://docs.pytest.org/en/7.1.x/how-to/parametrize.html#parametrize"><code>pytest.parametrize</code></a>
calls outside decorators and calls with keyword arguments.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="12f86f39a4"><code>12f86f3</code></a>
Ruff 0.9.1 (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15407">#15407</a>)</li>
<li><a
href="2b28d566a4"><code>2b28d56</code></a>
Associate a trailing end-of-line comment in a parenthesized implicit
concaten...</li>
<li><a
href="adca7bd95c"><code>adca7bd</code></a>
Remove pygments pin (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15404">#15404</a>)</li>
<li><a
href="6b98a26452"><code>6b98a26</code></a>
[red-knot] Support <code>assert_type</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15194">#15194</a>)</li>
<li><a
href="c87463842a"><code>c874638</code></a>
[red-knot] Move tuple-containing-Never tests to Markdown (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15402">#15402</a>)</li>
<li><a
href="c364b586f9"><code>c364b58</code></a>
[<code>flake8-pie</code>] Correctly remove wrapping parentheses
(<code>PIE800</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15394">#15394</a>)</li>
<li><a
href="73d424ee5e"><code>73d424e</code></a>
Fix outdated doc for handling the default file types with the pre-commit
hook...</li>
<li><a
href="6e9ff445fd"><code>6e9ff44</code></a>
Insert the cells from the <code>start</code> position (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15398">#15398</a>)</li>
<li><a
href="f2c3ddc5ea"><code>f2c3ddc</code></a>
[red-knot] Move intersection type tests to Markdown (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15396">#15396</a>)</li>
<li><a
href="b861551b6a"><code>b861551</code></a>
Remove unnecessary backticks (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15393">#15393</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/astral-sh/ruff/compare/0.5.4...0.9.1">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
### Description
Enhancements to EPContext Operations:
1. Introduced support for the bfloat16 data type in EPContext operations.
2. Bug Fix: Missing Custom OP Schema Registration when generator EPContext ONNX model
---------
Co-authored-by: mingyue <mingyue@xilinx.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
### Description
* Update python version metadata to be in sync with latest python
packages (onnxruntime, onnxruntime-gpu and onnxruntime-qnn).
* Update black format target-version to 3.10, and use lintrunner to
format all files.
* Update the lintrunner installation command line to be consistent.
* Include `requirements-lintrunner.txt` in `requirements-dev.txt` to
avoid duplicated settings.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22993
Python support by numpy:
https://numpy.org/neps/nep-0029-deprecation_policy.html#drop-schedule
```
On Apr 05, 2024 drop support for Python 3.9
On Apr 04, 2025 drop support for Python 3.10
```
### Description
Enable QNN HTP spill fill buffer setting to save RAM usage.
This feature is available after QNN 2.28. Need to re-generate QNN
context binary.
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api
Requirements:
1. Need to re-generate the Onnx model with QNN context binary by set the
EP option enable_htp_spill_fill_buffer = 1.
2. Works for a model with multiple Context binaries. Need manually merge
2 Onnx model with context binary into 1 Onnx model.
3. Requires Linux platform if generate the context binary offline since
QnnSystem lib is not available for Windows x86_64 platform.
No need to do extra thing while running the model inference.
The generated EPContext node will have a max_size attribute with the
maximum spill fill buffer size for the context binary
<img width="353" alt="image"
src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d">
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
#22380 removes the file
`tools/ci_build/github/linux/docker/inference/x86_64/python/cpu/scripts/requirements.txt`
but it is still used in `dockerfiles/Dockerfile.cuda`.
This change updates the file path of the requirements.txt
fixes#22945.
### Description
This PR registers GroupNormalization for opset 21
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR registers the following opset 21 operators:
Idenity-21
OlieanrMatmul-21
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
* Build cuda nhwc ops by default.
* Deprecate `--enable_cuda_nhwc_ops` in build.py and add
`--disable_cuda_nhwc_ops` option
Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops
will be disabled automatically.
### Motivation and Context
In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with
Tensor Cores, and this could improve performance for vision models.
This is the first step to prefer NHWC for CUDA in 1.21 release. Next
step is to do some tests on popular vision models. If it help in most
models and devices, set `prefer_nhwc=1` as default cuda provider option.
### Description
Based on https://github.com/microsoft/onnxruntime/pull/9700, and extend
it to ArgMin as well.
This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.
Key changes include:
#### Enhancements to `ArgMax` and `ArgMin` Operators:
* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)
* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)
#### Macro and Kernel Registration Improvements:
* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)
* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.
#### Safety Checks:
* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)
#### Testing Enhancements:
* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)
### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
### Description
Add I/O binding example using onnx data type in python API summary. The
API is available since 1.20 release.
### Motivation and Context
Follow up of https://github.com/microsoft/onnxruntime/pull/22306 to add
some documentation.
### Description
This PR fixes an equation in the MatMulNBits op spec. The old formula is
stated as
```
[CeilDiv((N * n_blocks_per_col + 1) * bits, 8)]
```
but it should be stated as
```
[N * CeilDiv(n_blocks_per_col * bits, 8)]
```
or as
```
[N * FloorDiv((n_blocks_per_col + 1) * bits, 8)]
```
### Motivation and Context
For models such as ChatGLM where the column size is odd, the division
math can be off. For example:

With the old equation, the projections are calculated as follows.
```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
4,096 * CeilDiv((107 + 1) * 4, 8) = 4,096 * CeilDiv(108 * 4, 8) = 4,096 * 54 = 221,184
# Up projection
B = 13,696 x 32 x 64
zero_points = 219,136
N = 13,696
n_blocks_per_col = 32
13,696 * CeilDiv((32 + 1) * 4, 8) = 13,696 * CeilDiv(33 * 4, 8) = 13,696 * 17 = 232,832
```
With the new equation, the projections are calculated as follows.
```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
4,096 * CeilDiv(107 * 4, 8) = 4,096 * 54 = 221,184
# Up projection
B = 13,696 x 32 x 64
zero_points= 219,136
N = 13,696
n_blocks_per_col = 32
13,696 * CeilDiv(32 * 4, 8) = 13,696 * 16 = 219,136
```
`If` nodes can have sequence outputs. Those nodes are mapped to the DML
EP to be able to keep the outputs on the GPU, but they actually execute
on the CPU by selecting either the `then` subgraph or the `else`
subgraph.
Add `MLFloat16` support for:
- `LayerNormalization`
- `SimplifiedLayerNormalization`
- `SkipLayerNormalization`
- `SkipSimplifiedLayerNormalization`
There are existing `LayerNormTest` unit tests that cover the `MLFloat16`
functionality for `LayerNormalization` once `MLFloat16` is registered
(for example
[`LayerNormTest.LayerNorm_Scale_Float16Input`](91c916f9c6/onnxruntime/test/contrib_ops/layer_norm_op_test.cc (L112))).
Similarly, there are unit tests such as
[`SkipLayerNormTest.SkipLayerNormBatch1_Float16`](91c916f9c6/onnxruntime/test/contrib_ops/skiplayernorm_op_test.cc (L255))
that cover MLFloat16 inputs for `SkipLayerNormalization`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <you@example.com>
### Description
* Add lintrunner to requirements-lintrunner.txt
* Lock lintrunner and lintrunner-adapter version
* Update documentation
### Motivation and Context
The document is not up to date.
### Description
This PR will add support for Continuous Decoding for batch_size = 1
input. From now on, GQA can take arbitrary length input using seqlens_k
as total_sequence_length - 1 and the sequence length of qkv as
new_sequence_length.
**This change will not affect the default behavior of GQA**
### Motivation and Context
Prior to this change it was impossible to support sequence_length > 1
inputs when past context was given. This use case is essential to making
continuous decoding work, which is one of our current efforts in
ORT-GenAI.
### Description
Implement softcap for gqa.
### Motivation and Context
Fixes certain models like Gemma-2 which need softcap to work so they
don't output nan's.
### Description
1. Added CUDA EP support for blocked quantization in QuantizeLinear and
DequantizeLinear ops.
2. Currently CUDA EP blocked quantization only supports int4/uint4
quantized types and float32/float16 unquantized types.
3. Added CUDA EP support in QDQ selector/action transformer. CUDA EP is
only added to DQ + MatMul -> MatMulNBits rule. Other rules' EP support
are not changed.
### Motivation and Context
ONNX opset 21 introduced blocked quantization for Q/DQ opts. ORT
originally only supports CPU EP blocked quantization.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <you@example.com>
### Description
Softmax (formula 1) is like the following:
```math
y_{i} = \frac{exp(x_{i})}{\sum_{i} exp(x_{i})}
```
After applying softmax, each element will be in the range of $(0, 1)$,
and the elements will add up to 1, so that they can be interpreted as
probabilities.
However, in language model, softmax has two issues:
* When all elements are -inf (for example, a whole row is masked when a
query token is padding), the result is not defined since exp(-inf)=0 and
divided-by-zero is encountered in the above formula.
* Why do we need normalize in a way that each query word are treated as
equal important (each row has sum equals to1)?
**Smooth Softmax** (formula 2) is a modified version that introduces a
smooth factor like the following:
```math
s_{i} = \frac{exp(x_{i})}{1+ \sum_{i} exp(x_{i})}
```
This formula could tackle the above two issues:
* It could handle the special case that all elements are -inf: the
result $s_{i}$ is 0 for every element in such case.
* Sum of all elements $\sum_{i}{s_{i}} = \frac{\sum_{i}{exp(x_{i})}}{1+
\sum_{i} exp(x_{i})}$ is in the range of (0, 1), so that we can train
the model to assign different importance to different query words.
Since exponential is prone to overflow or underflow, to get stable
result, formula 3 can be used:
```math
s_{i} = \frac{exp(x_{i} + c)}{exp(c)+ \sum_{i} exp(x_{i} +c)}
```
c can be any value in theory. In practical, choice of constant c shall
avoid $exp(c)$ and $exp(x_{i} +c)$ overflow (or underflow) at the same
time. A reasonable choice is like formula 4:
```math
c=-\max_{i} \{ x_i \}
```
or apply a constraint that c <=0 like the following formula 5:
```math
c=-\max(0, \max_{i} \{ x_i \})
```
The latter one (formula 5) ensures that $s_{i}$ will fallback to formula
2 when all elements are negative.
For CPU provider, smooth softmax is implemented in MLAS. CPU
implementation uses formula 5.
@wangyems implemented the smooth softmax in flash attention for CUDA,
which requires Ampere or newer GPU. The implementation of smooth softmax
in flash attention uses formula 4.
---------
Co-authored-by: Ye Wang
### Description
<!-- Describe your changes. -->
### Motivation and Context
1. Python API doc needs to be merged from a fork, but 1ES self-hosted
pool is only for one github repo.
2. ubuntu-latest will be install numpy above 2.0 by default, and current
python API doc generation doesn't support it.
So I pin numpy < 2.0.0
---------
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.
- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script
Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
### Description
Add a gather that supports block-quantized input data.
### Motivation and Context
To support Web inference scenario with quantized vocabulary embeddings.
### Description
This PR registers the ReduceMin-20 operator to the DML EP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add OVEP features for 1.19
The PR has,
- Added support for EpCtx with ORT Session options for optimized
performance.
- Added bug fixes
- Support for OV 2024.3
---------
Co-authored-by: ubuntu <ubuntu@ubuntu-mtlp-118727.iind.intel.com>
Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Maheshkar <ankit.maheshkar@intel.com>
### Description
<!-- Describe your changes. -->
Introduces an ATen fallback for
`torch.nn.functional.scaled_dot_product_attention`. This operator was
introduced in torch 2.0 and, since then, has had many updates including
the implementation of memory efficient attention for V100 machines. The
current torchscript exporter exports a subgraph for attention which does
not provide the same memory savings that PyTorch's memory efficient
attention kernel provides. Allowing fallback to PyTorch ATen op for
attention helps mitigate memory spike issues for models leveraging
memory efficient attention.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Memory issues arose when integrating ONNX Runtime Training with AML
Stable Diffusion.
---------
Co-authored-by: root <prathikrao@microsoft.com>
Add SparseAttention cpu implementation.
- [x] Refactoring GQAAttentionBase
- [x] Add SparseAttention implementation
- [x] Add test cases
This is unfused version. Flash attention version will be added later.