### Description
<!-- Describe your changes. -->
When the onnx model reuses initializers in more than one ops, if one of
the ops wants to add this initializer to the skipped list, but the other
ops still need this initializer, it will cause the process to crash.
Therefore, like other EPs, we count `initializer_usage_`, the number of
occurrences of each initializer in all ops and modify the
`AddInitializersToSkip` to minus the corresponding initializers'
statistic one when adding the specific operators.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
1. Update onnxruntime binary size checks ci pipeline's docker image. Use
a different docker image that is not manylinux based. The new one is
smaller.
2. Add flatbuffers tools/ci_build/requirements/pybind/requirements.txt
3. Delete
tools/ci_build/github/azure-pipelines/py-package-build-pipeline.yml. The
pipeline was for generating packages for Olive, but it went unused. And
the content is highly duplicated with our official python packaging
pipeline.
4. A lot of YAML files reference pypa/manylinux git repo but do not use
it. This PR removes the references.
### Description
<!-- Describe your changes. -->
This reverts commit 5d215ff810.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The reverted change causes a packaging pipeline to fail due to a crash
in one of the E2E Android tests.
Reverting this first to fix the pipeline. We should come up with an
alternative way to properly do the necessary clean up.
### Description
the `std::unordered_map` uses a `std::string_view` as key, while the
string view may refer to invalid memory. Function `IdentityBuilder`
returns a `std::string` which goes out of scope quickly.
```c++
unordered_map<string_view, std::vector<NodeIndex>> identical_children_map;
for (auto i = node->OutputEdgesBegin(); i != node->OutputEdgesEnd(); ++i) {
if (i->GetNode().OpType() == op) {
identical_children_map[IdentityBuilder(graph, i->GetNode())].push_back(i->GetNode().Index());
}
}
```
This code will cause a waring as error in EMSDK v4.0.1:
```
C:/code/o2/onnxruntime/core/optimizer/identical_children_consolidation.cc:51:30: error: object whose reference is captured by 'identical_children_map' will be destroyed at the end of the full-expression [-Werror,-Wdangling-capture]
51 | identical_children_map[IdentityBuilder(graph, i->GetNode())].push_back(i->GetNode().Index());
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
```
### Description
- Fixes segfault when the function that cleans up HTP memory handles
uses an invalid Logger.
- Fixes unit test that compares output from QNN EP with exact float
values. QNN HTP runs float32 models with float16 precision, so need to
use a tolerance in the comparison.
### Motivation and Context
Fixes issues with using QNN HTP memory sharing on Windows ARM64. This is
also needed to test HTP shared memory with
https://github.com/microsoft/onnxruntime/pull/23120.
### Description
<!-- Describe your changes. -->
The old `GetCapability` function of WebNN EP is just a very simple
search for groups of nodes that can be handled. This doesn't work well
in the following example graph, where A and D could be handled by the
EP, but B is between them in the topological order, as you get two
single node capabilities. However, it may also be advantageous if C and
E could be handled by the EP, since they would be combined with D even
though they are not connected.
```
A B C
| / |
D E
| |
```
Therefore, we improve partitioning results by reusing
`utils::CreateSupportedPartitions`, which walks the edges for each node
that the EP can handle as they are iterated in topological order. This
would guarantee that all connected nodes that can be handled are grouped
together. Correspondingly, we modify the `webnn::GetSupportedNodes`
function to return the supported nodes instead of the group of supported
partitions.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>
Add a tool to generate node_block_list used in [float16 conversion tool](04030f64be/onnxruntime/python/tools/transformers/float16.py (L175)).
Previously, we have a feature to dump statistics data (like min, max) of
each node input/output. However, it is time consuming to generate a list
of nodes that need to be kept in float32 when model is large.
This could help speed up the process by outputting a list of nodes that
have potential overflow in float-to-half conversion.
Usage is to build onnxruntime from source with ` --cmake_extra_defines
onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1`, then set some environment
variables before running float32 optimized onnx model like:
```
export ORT_DEBUG_NODE_IO_DUMP_HALF_CONVERSION_OVERFLOW=1
export ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD=50000
python benchmark.py -e optimum --height 1024 --width 1024 --steps 3 -b 1 -v Flux.1D -p flux1_dev_onnx/fp32_opt --skip_warmup
```
The threshold `ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD` shall be <=
65504. The default value is 50000 if the environment variable is not
set. It is better to leave some margin if number of samples are not
large enough in the test.
As a demo, we add an option --skip_warmup to benchmark.py for Flux, so
that we can reduce the time on dumping warm-up runs.
Example snippet of stdout (each inference session has such a summary
when session ended):
```
Total counter in node dumping: 141
Found 2 nodes cannot be converted to half precision due to potential input/output overflow.
Operator frequencies for these nodes:
Softmax : 1
MatMul : 1
# -------
# Example python script for float16 conversion
# For details, search `node_block_list` in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py
# -------
from onnxruntime.transformers.onnx_model import OnnxModel
m = OnnxModel(onnx.load('flux1_dev_onnx/fp32_opt/vae_decoder/model.onnx'))
node_block_list = [
'/decoder/mid_block/attentions.0/Softmax',
'/decoder/mid_block/attentions.0/MatMul',
]
m.convert_float_to_float16(keep_io_types=False, node_block_list=node_block_list)
m.save_model_to_file('fp16/optimized.onnx', use_external_data_format=False)
```
Then you can use the python script to convert corresponding model to
float16.
### Motivation and Context
It is a tool used to generate node_block_list used in float16 conversion
of stable diffusion 3.x and flux models in
https://github.com/microsoft/onnxruntime/pull/22986.
In stable diffusion or Flux pipeline, there are multiple models and
there could be multiple session runs for each model. Without a proper
tool, it is time consuming to get node_block_list for each model.
### Description
Follw up #21897
To be compatible with onnx 17.0, Registering opset 22 is required in
terms of the [updated operators
(bfloat16)](https://github.com/onnx/onnx/releases/tag/v1.17.0)
### Motivation and Context
Fix#23162Fix#23161Fix#23164 (Xnnpack)
### Remaining issue
#23163 (QNN) See [the
file](https://github.com/microsoft/onnxruntime/pull/23344/files#diff-04f5d6db0a6873f7299ed06ff1ec45a49e69f0865cb32f4397cd56db0cd0a784)
### Result of `find_optimizer_opset_version_updates_required.py (cpu
only)`
```
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_add_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.IsInf. Latest:20 Optimizer support ends at 10. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/isinf_reducesum_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/isinf_reducesum_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/isinf_reducesum_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.HardSigmoid. Latest:22 Optimizer support ends at 6. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_add_act_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/layer_norm_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.MaxPool. Latest:22 Optimizer support ends at 12. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.AveragePool. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.BatchNormalization. Latest:15 Optimizer support ends at 14. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Upsample. Latest:10 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Resize. Latest:19 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.GlobalMaxPool. Latest:22 Optimizer support ends at 1. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.GlobalAveragePool. Latest:22 Optimizer support ends at 1. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/nchwc_transformer.cc
[WARNING] - Newer opset found for kOnnxDomain.Shape. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pre_shape_node_elimination.cc
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_bn_fusion.cc
[ERROR] - Call/Declaration is split over multiple lines. Please check manually.File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/label_encoder_fusion.cc Line:49
[ERROR] - Failed to find version information for "ai.onnx.ml".LabelEncoder. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/label_encoder_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.HardSigmoid. Latest:22 Optimizer support ends at 6. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_activation_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Dropout. Latest:22 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/dropout_elimination.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/gemm_transpose_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/gemm_transpose_fusion.cc
[ERROR] - Symbolic name of 'ignorable_nodes[index].first' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/matmul_bn_fusion.cc
[ERROR] - Symbolic name of 'dest.first' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/matmul_bn_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.AveragePool. Latest:22 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.MaxPool. Latest:22 Optimizer support ends at 12. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Pad. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/pad_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Dropout. Latest:22 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/bias_dropout_fusion.cc
[ERROR] - Failed to find version information for kMSDomain.BitmaskDropout. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/bias_dropout_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Clip. Latest:13 Optimizer support ends at 6. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/relu_clip_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/fast_gelu_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Cast. Latest:21 Optimizer support ends at 19. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/fast_gelu_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Reshape. Latest:21 Optimizer support ends at 14. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/reshape_fusion.cc
[ERROR] - Failed to find version information for kMSDomain.ConcatTraining. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/reshape_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Where. Latest:16 Optimizer support ends at 9. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/not_where_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Where. Latest:16 Optimizer support ends at 9. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/not_where_fusion.cc
[WARNING] - Newer opset found for kOnnxDomain.Conv. Latest:22 Optimizer support ends at 11. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/conv_mul_fusion.cc
[ERROR] - Symbolic name of 'QOpName' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_util.cc
[ERROR] - Symbolic name of 'QOpName' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_util.cc
[ERROR] - Symbolic name of 'DQOpName' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_util.cc
[ERROR] - Symbolic name of 'DQOpName' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_util.cc
[ERROR] - Call/Declaration is split over multiple lines. Please check manually.File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc Line:170
[WARNING] - Newer opset found for kOnnxDomain.MaxPool. Latest:22 Optimizer support ends at 12. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/qdq_transformer/qdq_propagation.cc
[ERROR] - Symbolic name of 'current_node.OpType(' found for op. Please check manually. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/compute_optimizer/upstream_transformer_base.cc
[WARNING] - Newer opset found for kOnnxDomain.Reshape. Latest:21 Optimizer support ends at 14. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/compute_optimizer/upstream_reshape.cc
[WARNING] - Newer opset found for kOnnxDomain.Transpose. Latest:21 Optimizer support ends at 13. File:/home/titaiwang/onnxruntime/onnxruntime/core/optimizer/attention_fusion_helper.h
```
Use ruff as the code formatter in place of black and isort since it is
much faster, and as projects like PyTorch and ONNX have adopted ruff
format as well.
This PR include only auto-fixed changes in formatting.
### Description
This PR allows WebGPU EP to be built with Emscripten for WebAssembly,
Including:
- cmake build files update to support correct setup for Emscripten.
- code changes to fix build breaks for wasm
- change in Web CI pipeline to add a build-only target for wasm with
`--use_webgpu`.
### Description
Docker's buildx has four different drivers:
1. default
2. docker-container
3. kubernetes
4. remote
Now we are using "docker-container". This PR change it to the default
driver, because the container driver needs to fetch an image from docker
hub which is no longer free and has a rate limit.
### Description
Set power config id and the default power mode from provider option (if there is) for main thread, otherwise it will mess up the power mode if user just create session without run it.
The issue fixed by this PR is:
Process 1 just creates the session without run it.
Then, start process 2 which creates the session and run it with power saver mode. The result is with burst power mode.
### Description
<!-- Describe your changes. -->
- Implemented the DepthToSpace uint8_t kernel.
- Enabled DropQDQNodesRules for DepthToSpace.
- Added unit tests for the DepthToSpace uint8_t kernel.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This commit aims to enhance the performance of the Image
Super-Resolution INT8 Model (RFDN). Specifically, it improves the
Inference Per Second (IPS) by 25%, providing a significant boost in
efficiency and speed.
Add support to mainline Onnxruntime of changes from the ROCm Team's changes
### Motivation and Context
Various bugfixes, and changes added between ROCm 6.2 and 6.3 that
haven't been upstreamed yet to mainline
---------
Co-authored-by: Yueqing Zhang <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Artur Wojcik <artur.wojcik@outlook.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Co-authored-by: ikalinic <ilija.kalinic@amd.com>
Co-authored-by: sstamenk <sstamenk@amd.com>
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.5.4 to 0.9.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/releases">ruff's
releases</a>.</em></p>
<blockquote>
<h2>0.9.1</h2>
<h2>Release Notes</h2>
<h3>Preview features</h3>
<ul>
<li>[<code>pycodestyle</code>] Run
<code>too-many-newlines-at-end-of-file</code> on each cell in notebooks
(<code>W391</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15308">#15308</a>)</li>
<li>[<code>ruff</code>] Omit diagnostic for shadowed private function
parameters in <code>used-dummy-variable</code> (<code>RUF052</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15376">#15376</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>flake8-bugbear</code>] Improve
<code>assert-raises-exception</code> message (<code>B017</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15389">#15389</a>)</li>
</ul>
<h3>Formatter</h3>
<ul>
<li>Preserve trailing end-of line comments for the last string literal
in implicitly concatenated strings (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15378">#15378</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Fix a bug where the server and client notebooks were out of sync
after reordering cells (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15398">#15398</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>[<code>flake8-pie</code>] Correctly remove wrapping parentheses
(<code>PIE800</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15394">#15394</a>)</li>
<li>[<code>pyupgrade</code>] Handle comments and multiline expressions
correctly (<code>UP037</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15337">#15337</a>)</li>
</ul>
<h2>Contributors</h2>
<ul>
<li><a
href="https://github.com/AntoineD"><code>@AntoineD</code></a></li>
<li><a
href="https://github.com/InSyncWithFoo"><code>@InSyncWithFoo</code></a></li>
<li><a
href="https://github.com/MichaReiser"><code>@MichaReiser</code></a></li>
<li><a href="https://github.com/calumy"><code>@calumy</code></a></li>
<li><a
href="https://github.com/dcreager"><code>@dcreager</code></a></li>
<li><a
href="https://github.com/dhruvmanila"><code>@dhruvmanila</code></a></li>
<li><a href="https://github.com/dylwil3"><code>@dylwil3</code></a></li>
<li><a href="https://github.com/sharkdp"><code>@sharkdp</code></a></li>
<li><a href="https://github.com/tjkuson"><code>@tjkuson</code></a></li>
</ul>
<h2>Install ruff 0.9.1</h2>
<h3>Install prebuilt binaries via shell script</h3>
<pre lang="sh"><code>curl --proto '=https' --tlsv1.2 -LsSf
https://github.com/astral-sh/ruff/releases/download/0.9.1/ruff-installer.sh
| sh
</code></pre>
<h3>Install prebuilt binaries via powershell script</h3>
<pre lang="sh"><code>powershell -ExecutionPolicy ByPass -c "irm
https://github.com/astral-sh/ruff/releases/download/0.9.1/ruff-installer.ps1
| iex"
</code></pre>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md">ruff's
changelog</a>.</em></p>
<blockquote>
<h2>0.9.1</h2>
<h3>Preview features</h3>
<ul>
<li>[<code>pycodestyle</code>] Run
<code>too-many-newlines-at-end-of-file</code> on each cell in notebooks
(<code>W391</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15308">#15308</a>)</li>
<li>[<code>ruff</code>] Omit diagnostic for shadowed private function
parameters in <code>used-dummy-variable</code> (<code>RUF052</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15376">#15376</a>)</li>
</ul>
<h3>Rule changes</h3>
<ul>
<li>[<code>flake8-bugbear</code>] Improve
<code>assert-raises-exception</code> message (<code>B017</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15389">#15389</a>)</li>
</ul>
<h3>Formatter</h3>
<ul>
<li>Preserve trailing end-of line comments for the last string literal
in implicitly concatenated strings (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15378">#15378</a>)</li>
</ul>
<h3>Server</h3>
<ul>
<li>Fix a bug where the server and client notebooks were out of sync
after reordering cells (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15398">#15398</a>)</li>
</ul>
<h3>Bug fixes</h3>
<ul>
<li>[<code>flake8-pie</code>] Correctly remove wrapping parentheses
(<code>PIE800</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15394">#15394</a>)</li>
<li>[<code>pyupgrade</code>] Handle comments and multiline expressions
correctly (<code>UP037</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/pull/15337">#15337</a>)</li>
</ul>
<h2>0.9.0</h2>
<p>Check out the <a href="https://astral.sh/blog/ruff-v0.9.0">blog
post</a> for a migration guide and overview of the changes!</p>
<h3>Breaking changes</h3>
<p>Ruff now formats your code according to the 2025 style guide. As a
result, your code might now get formatted differently. See the formatter
section for a detailed list of changes.</p>
<p>This release doesn’t remove or remap any existing stable rules.</p>
<h3>Stabilization</h3>
<p>The following rules have been stabilized and are no longer in
preview:</p>
<ul>
<li><a
href="https://docs.astral.sh/ruff/rules/stdlib-module-shadowing/"><code>stdlib-module-shadowing</code></a>
(<code>A005</code>).
This rule has also been renamed: previously, it was called
<code>builtin-module-shadowing</code>.</li>
<li><a
href="https://docs.astral.sh/ruff/rules/builtin-lambda-argument-shadowing/"><code>builtin-lambda-argument-shadowing</code></a>
(<code>A006</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/slice-to-remove-prefix-or-suffix/"><code>slice-to-remove-prefix-or-suffix</code></a>
(<code>FURB188</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/boolean-chained-comparison/"><code>boolean-chained-comparison</code></a>
(<code>PLR1716</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/decimal-from-float-literal/"><code>decimal-from-float-literal</code></a>
(<code>RUF032</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/post-init-default/"><code>post-init-default</code></a>
(<code>RUF033</code>)</li>
<li><a
href="https://docs.astral.sh/ruff/rules/useless-if-else/"><code>useless-if-else</code></a>
(<code>RUF034</code>)</li>
</ul>
<p>The following behaviors have been stabilized:</p>
<ul>
<li><a
href="https://docs.astral.sh/ruff/rules/pytest-parametrize-names-wrong-type/"><code>pytest-parametrize-names-wrong-type</code></a>
(<code>PT006</code>): Detect <a
href="https://docs.pytest.org/en/7.1.x/how-to/parametrize.html#parametrize"><code>pytest.parametrize</code></a>
calls outside decorators and calls with keyword arguments.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="12f86f39a4"><code>12f86f3</code></a>
Ruff 0.9.1 (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15407">#15407</a>)</li>
<li><a
href="2b28d566a4"><code>2b28d56</code></a>
Associate a trailing end-of-line comment in a parenthesized implicit
concaten...</li>
<li><a
href="adca7bd95c"><code>adca7bd</code></a>
Remove pygments pin (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15404">#15404</a>)</li>
<li><a
href="6b98a26452"><code>6b98a26</code></a>
[red-knot] Support <code>assert_type</code> (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15194">#15194</a>)</li>
<li><a
href="c87463842a"><code>c874638</code></a>
[red-knot] Move tuple-containing-Never tests to Markdown (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15402">#15402</a>)</li>
<li><a
href="c364b586f9"><code>c364b58</code></a>
[<code>flake8-pie</code>] Correctly remove wrapping parentheses
(<code>PIE800</code>) (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15394">#15394</a>)</li>
<li><a
href="73d424ee5e"><code>73d424e</code></a>
Fix outdated doc for handling the default file types with the pre-commit
hook...</li>
<li><a
href="6e9ff445fd"><code>6e9ff44</code></a>
Insert the cells from the <code>start</code> position (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15398">#15398</a>)</li>
<li><a
href="f2c3ddc5ea"><code>f2c3ddc</code></a>
[red-knot] Move intersection type tests to Markdown (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15396">#15396</a>)</li>
<li><a
href="b861551b6a"><code>b861551</code></a>
Remove unnecessary backticks (<a
href="https://redirect.github.com/astral-sh/ruff/issues/15393">#15393</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/astral-sh/ruff/compare/0.5.4...0.9.1">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
### Description
Update xnnpack to remove the dependency on psimd and fp16 libraries.
However, coremltool still depends on them, which will be addressed
later.
Also, update CPUINFO because the latest xnnpack requires CPUINFO's avx10
support.
### Motivation and Context
The fewer dependencies the better.
### Description
Fix bug in previous change where a failure during `SetupBackend` causes `ReleaseResources `to be called to clean up but does nothing because `backend_setup_completed_ ` is false. `backend_setup_completed_ ` _seems_ to now be redundant so removing it fixes the problem.
### Motivation and Context
We are seeing crashes due to the log callback failing to be de-registered
### Description
This PR adds unit tests for [fusing the vision
components](https://github.com/microsoft/onnxruntime/pull/20721) of
Phi-3 vision and Phi-3.5 vision.
### Motivation and Context
Many multi-modal models use a CLIP encoder or a variant of CLIP as part
of their encoders. These fusion unit tests will ensure that the vision
components of Phi-3 vision and Phi-3.5 vision can still be fused when
existing fusions are modified to support more models.
WebNN doesn't provide a dedicated op for RotaryEmbedding. Instead, we
implement it by using a combination of WebNN ops. The decomposed graph
is referenced from DML EP at:
onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorRotaryEmbedding.cpp
### Description
<!-- Describe your changes. -->
* Remove deprecated gpu arch to control nuget/python package size
(latest TRT supports sm75 Turing and newer arch)
* Add 90 to support blackwell series in next release (86;89 not
considered as adding them will rapidly increase package size)
| arch_range | Python-cuda12 | Nuget-cuda12 |
| -------------- |
------------------------------------------------------------ |
---------------------------------- |
| 60;61;70;75;80 | Linux: 279MB Win: 267MB | Linux: 247MB Win: 235MB |
| 75;80 | Linux: 174MB Win: 162MB | Linux: 168MB Win: 156MB |
| **75;80;90** | **Linux: 299MB Win: 277MB** | **Linux: 294MB Win:
271MB** |
| 75;80;86;89 | [Linux: MB Win:
390MB](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=647457&view=results)
| Linux: 416MB Win: 383MB |
| 75;80;86;89;90 | [Linux: MB Win:
505MB](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=646536&view=results)
| Linux: 541MB Win: 498MB |
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Callout: While adding sm90 support, the build of cuda11.8+cudnn8 will be
dropped in the coming ORT release,
as the build has issue with blackwell (mentioned in comments) and demand
on cuda 11 is minor, according to internal ort-cuda11 repo.
### Description
Updating react-native to 0.70.15
### Motivation and Context
To address the issue with the failed checksum after boost switching URL
from Jfrog
Adds QNN EP HTP shared memory allocator.
The HTP shared memory allocator (`HtpSharedMemoryAllocator`) calls the
rpcmem shared library (libcdsprpc.so/dll) to allocate and free memory
that can be shared between HTP and CPU.
The allocator can be enabled by setting QNN EP option
`enable_htp_shared_memory_allocator` to `1`.
`QNNExecutionProvider::CreatePreferredAllocators()` will then return an
instance of `HtpSharedMemoryAllocator`.
For each QNN context, we also need to register and unregister memory
handles in order to use the HTP shared memory. This memory handle
management is added to `QnnBackendManager`, which also manages the QNN
context handles.
For more information about using HTP shared memory with QNN, see:
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_shared_buffer_tutorial.html#shared-buffer-tutorial
Limitations:
- HTP shared memory usage is only supported for graph inputs and
outputs. Intermediate values are not supported.
- An allocation is assigned to a single shared memory buffer. The
allocator is not smart enough to have multiple allocations share a
single shared memory buffer.
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
### Description
This PR contains a part of the changes in #23318.
The reason of creating this PR is: The works to support building WebGPU
EP in WASM depends on #23318, which cannot be merged since it's blocked
by upstream (https://github.com/llvm/llvm-project/issues/122166). This
PR contains the changes can be safely merged separately and can unblock
the development of supporting building WebGPU EP in WASM.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
This change simplifies the o2i_output implementation by reducing
unnecessary intermediate variables, with no change in functionality.
### Motivation and Context
As above.
Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
### Description
use LOGS_DEFAULT for device lost logging.
Now since the GPU device lifecycle is managed by WebGpuContext, it's now
able to use ORT logging.
### Description
increase absolute error for test case `MatMulNBits.Float16Large` to 0.1
for WebGPU with subgroup implementation.
Fixes webgpu CI pipeline.
Test: onnxruntime_test_all.exe --gtest_filter=SplitOperatorTest.*
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Bumps [black](https://github.com/psf/black) from 24.2.0 to 24.10.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/psf/black/releases">black's
releases</a>.</em></p>
<blockquote>
<h2>24.10.0</h2>
<h3>Highlights</h3>
<ul>
<li>Black is now officially tested with Python 3.13 and provides Python
3.13
mypyc-compiled wheels. (<a
href="https://redirect.github.com/psf/black/issues/4436">#4436</a>) (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li>Black will issue an error when used with Python 3.12.5, due to an
upstream memory
safety issue in Python 3.12.5 that can cause Black's AST safety checks
to fail. Please
use Python 3.12.6 or Python 3.12.4 instead. (<a
href="https://redirect.github.com/psf/black/issues/4447">#4447</a>)</li>
<li>Black no longer supports running with Python 3.8 (<a
href="https://redirect.github.com/psf/black/issues/4452">#4452</a>)</li>
</ul>
<h3>Stable style</h3>
<ul>
<li>Fix crashes involving comments in parenthesised return types or
<code>X | Y</code> style unions.
(<a
href="https://redirect.github.com/psf/black/issues/4453">#4453</a>)</li>
<li>Fix skipping Jupyter cells with unknown <code>%%</code> magic (<a
href="https://redirect.github.com/psf/black/issues/4462">#4462</a>)</li>
</ul>
<h3>Preview style</h3>
<ul>
<li>Fix type annotation spacing between * and more complex type variable
tuple (i.e. <code>def fn(*args: *tuple[*Ts, T]) -> None: pass</code>)
(<a
href="https://redirect.github.com/psf/black/issues/4440">#4440</a>)</li>
</ul>
<h3>Caching</h3>
<ul>
<li>Fix bug where the cache was shared between runs with and without
<code>--unstable</code> (<a
href="https://redirect.github.com/psf/black/issues/4466">#4466</a>)</li>
</ul>
<h3>Packaging</h3>
<ul>
<li>Upgrade version of mypyc used to 1.12 beta (<a
href="https://redirect.github.com/psf/black/issues/4450">#4450</a>) (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li><code>blackd</code> now requires a newer version of aiohttp. (<a
href="https://redirect.github.com/psf/black/issues/4451">#4451</a>)</li>
</ul>
<h3>Output</h3>
<ul>
<li>Added Python target version information on parse error (<a
href="https://redirect.github.com/psf/black/issues/4378">#4378</a>)</li>
<li>Add information about Black version to internal error messages (<a
href="https://redirect.github.com/psf/black/issues/4457">#4457</a>)</li>
</ul>
<h2>24.8.0</h2>
<h3>Stable style</h3>
<ul>
<li>Fix crash when <code># fmt: off</code> is used before a closing
parenthesis or bracket. (<a
href="https://redirect.github.com/psf/black/issues/4363">#4363</a>)</li>
</ul>
<h3>Packaging</h3>
<ul>
<li>Packaging metadata updated: docs are explictly linked, the issue
tracker is now also
linked. This improves the PyPI listing for Black. (<a
href="https://redirect.github.com/psf/black/issues/4345">#4345</a>)</li>
</ul>
<h3>Parser</h3>
<ul>
<li>Fix regression where Black failed to parse a multiline f-string
containing another
multiline string (<a
href="https://redirect.github.com/psf/black/issues/4339">#4339</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/psf/black/blob/main/CHANGES.md">black's
changelog</a>.</em></p>
<blockquote>
<h2>24.10.0</h2>
<h3>Highlights</h3>
<ul>
<li>Black is now officially tested with Python 3.13 and provides Python
3.13
mypyc-compiled wheels. (<a
href="https://redirect.github.com/psf/black/issues/4436">#4436</a>) (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li>Black will issue an error when used with Python 3.12.5, due to an
upstream memory
safety issue in Python 3.12.5 that can cause Black's AST safety checks
to fail. Please
use Python 3.12.6 or Python 3.12.4 instead. (<a
href="https://redirect.github.com/psf/black/issues/4447">#4447</a>)</li>
<li>Black no longer supports running with Python 3.8 (<a
href="https://redirect.github.com/psf/black/issues/4452">#4452</a>)</li>
</ul>
<h3>Stable style</h3>
<ul>
<li>Fix crashes involving comments in parenthesised return types or
<code>X | Y</code> style unions.
(<a
href="https://redirect.github.com/psf/black/issues/4453">#4453</a>)</li>
<li>Fix skipping Jupyter cells with unknown <code>%%</code> magic (<a
href="https://redirect.github.com/psf/black/issues/4462">#4462</a>)</li>
</ul>
<h3>Preview style</h3>
<ul>
<li>Fix type annotation spacing between * and more complex type variable
tuple (i.e. <code>def fn(*args: *tuple[*Ts, T]) -> None: pass</code>)
(<a
href="https://redirect.github.com/psf/black/issues/4440">#4440</a>)</li>
</ul>
<h3>Caching</h3>
<ul>
<li>Fix bug where the cache was shared between runs with and without
<code>--unstable</code> (<a
href="https://redirect.github.com/psf/black/issues/4466">#4466</a>)</li>
</ul>
<h3>Packaging</h3>
<ul>
<li>Upgrade version of mypyc used to 1.12 beta (<a
href="https://redirect.github.com/psf/black/issues/4450">#4450</a>) (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li><code>blackd</code> now requires a newer version of aiohttp. (<a
href="https://redirect.github.com/psf/black/issues/4451">#4451</a>)</li>
</ul>
<h3>Output</h3>
<ul>
<li>Added Python target version information on parse error (<a
href="https://redirect.github.com/psf/black/issues/4378">#4378</a>)</li>
<li>Add information about Black version to internal error messages (<a
href="https://redirect.github.com/psf/black/issues/4457">#4457</a>)</li>
</ul>
<h2>24.8.0</h2>
<h3>Stable style</h3>
<ul>
<li>Fix crash when <code># fmt: off</code> is used before a closing
parenthesis or bracket. (<a
href="https://redirect.github.com/psf/black/issues/4363">#4363</a>)</li>
</ul>
<h3>Packaging</h3>
<ul>
<li>Packaging metadata updated: docs are explictly linked, the issue
tracker is now also
linked. This improves the PyPI listing for Black. (<a
href="https://redirect.github.com/psf/black/issues/4345">#4345</a>)</li>
</ul>
<h3>Parser</h3>
<ul>
<li>Fix regression where Black failed to parse a multiline f-string
containing another</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1b2427a2b7"><code>1b2427a</code></a>
Prepare release 24.10.0 (<a
href="https://redirect.github.com/psf/black/issues/4471">#4471</a>)</li>
<li><a
href="a22b1ebbfd"><code>a22b1eb</code></a>
Add mypyc 3.13 wheel build (<a
href="https://redirect.github.com/psf/black/issues/4449">#4449</a>)</li>
<li><a
href="b7d0e7212b"><code>b7d0e72</code></a>
Bump AndreMiras/coveralls-python-action from
65c1672f0b8a201702d86c81b79187df...</li>
<li><a
href="f1a2f92bba"><code>f1a2f92</code></a>
Include --unstable in cache key (<a
href="https://redirect.github.com/psf/black/issues/4466">#4466</a>)</li>
<li><a
href="8d9d18c033"><code>8d9d18c</code></a>
Fix skipping Jupyter cells with unknown %% magic (<a
href="https://redirect.github.com/psf/black/issues/4462">#4462</a>)</li>
<li><a
href="bbfdba3a5e"><code>bbfdba3</code></a>
Fix docs CI: use venv for uv to fix 'failed to create directory' (<a
href="https://redirect.github.com/psf/black/issues/4460">#4460</a>)</li>
<li><a
href="8fb2add1f7"><code>8fb2add</code></a>
Use builtin generics (<a
href="https://redirect.github.com/psf/black/issues/4458">#4458</a>)</li>
<li><a
href="2a45cecf29"><code>2a45cec</code></a>
Fix crashes with comments in parentheses (<a
href="https://redirect.github.com/psf/black/issues/4453">#4453</a>)</li>
<li><a
href="b4d6d8632d"><code>b4d6d86</code></a>
Drop Python 3.8 support (<a
href="https://redirect.github.com/psf/black/issues/4452">#4452</a>)</li>
<li><a
href="ac018c16ca"><code>ac018c1</code></a>
Require newer aiohttp for blackd (<a
href="https://redirect.github.com/psf/black/issues/4451">#4451</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/psf/black/compare/24.2.0...24.10.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
This PR applies subgroup to implement matmulnbits when tile_m > 1 for
intel devices.
With this PR, prefill for 500 tokens prompt for phi3 becomes 3.5s from
8.5s on intel Meteor Lake.
### Description
Spec of LayerNormalization supports broadcasting (tensors Scale and B
should be unidirectional broadcastable to tensor X).
https://onnx.ai/onnx/operators/onnx__LayerNormalization.html
However, current implementation only allow scale and bias size to be
X.shape()[axis:].
Example of input tensors that normalized with axis=2:
| X shape | Scale shape | B shape | Before | After |
| - | - | - | - | - |
| (B, S, D) | (D) | (D) | Supported | Supported |
| (B, S, D) | (1, 1, D) | (1, 1, D) | Supported | Supported |
| (B, S, D) | (B, 1, D) | (B, 1, D) | Not Supported | Supported |
| (B, S, D) | (1, S, D) | (1, S, D) | Not Supported | Supported |
| (B, S, D) | (B, S, D) | (B, S, D) | Not Supported | Supported |
Here we add limited support: axis=2; scale/bias has same shape;
scale/bias/X have same number of dimensions. It could support common use
case in LLM and vision models.
### Motivation and Context
Support Stable Diffusion 3.x and Flux model.
### Description
Fixes build when specify with flag `--target
onnxruntime_providers_webgpu`
Otherwise the following error will occur:
```
range.cc
D:\code\onnxruntime\build\Windows\Debug\_deps\onnx-src\onnx\onnx_pb.h(65,10): error C1083: Cannot open include file: 'o
nnx/onnx-ml.pb.h': No such file or directory [D:\code\onnxruntime\build\Windows\Debug\onnxruntime_providers_webgpu.vcxp
roj]
(compiling source file '../../../onnxruntime/core/providers/webgpu/math/binary_elementwise_ops.cc')
```
Fix some inconsistency.
All our iOS build should target iOS 15.1.
All our macOS desktop build should target macOS 13.3 to align with the
changes made in #17361
### Description
Fix error causing incorrect output when past key/value share buffer with
present key/value
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->