### Description
Pre built QNN Android package
### Future Work
1. Setting up CI with Browserstack- onnxruntime_tests and Android test
2. ESRP Release to Maven
### Description
In TensorRT 10.5, the APIs `platformHasFastFp16` and
`platformHasFastInt8` have been deprecated.
Ignore these deprecation warnings.
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
### Description
Update segment anything 2 benchmark script:
(1) Fix cuda graph in benchmark. Make sure --use_cuda_graph takes effect
and random_inputs() generates according to the dtype of the model.
(2) Add a parameter to enable profiling.
(3) Use latest cuda 12.6.2 and cudnn 9.5.
(4) Update README.md.
### Motivation and Context
Previous, --use_cuda_graph does not take effect. This fixes the
benchmark.
### Description
Add SetEpDynamicOptions and Remove workload_type from run/session
options.
### Motivation and Context
Added SetEpDynamicOptions as a dynamic way of changing EP settings even
in the middle of a Run
Using workload_type run/session options to set Efficient/Default mode
for workloads does not cover all the scenarios and can lead to priority
inversions. Working on a new API to support setting Efficient/Default
mode for workloads.
---------
Co-authored-by: Luis E. Pena <luispena@microsoft.com>
### Description
Resolve#21976 .
ABSL generally does not have forward/backward compatibility. Our code is
only compatible with one fixed LTS version. So it's important to fix the
version number there when using find_package to detect an installed
version.
### Description
It runs after "Python-CUDA-Packaging-Pipeline" that runs on a CPU
machine that skipped all tests.
This testing pipeline is for doing the tests.
This fixes a bug found by libfuzzer:
LayerNormalization third input (beta) is optional. The following code
has potential out of bound access if the input is not available:
```
NodeArg* beta = layer_norm_node.MutableInputDefs()[2];
```
This adds a check to ensure the third input exists before fusion.
[AB#49036](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/49036)
### Description
* Add a few arguments CUDA_VERSION, CUDNN_VERSION, OS, GIT_COMMIT,
GIT_BRANCH and ONNXRUNTIME_VERSION to the Dockerfile.cuda to allow for
more flexibility in the build process.
* Update README.md to include the new arguments and their usage.
* Output labels to image so that it is easy to inspect the image.
Available CUDA versions for ubuntu 24.04 can be found
[here](https://hub.docker.com/r/nvidia/cuda/tags), and available CUDNN
versions can be found
[here](https://pypi.org/project/nvidia-cudnn-cu12/#history). Example
command line to build docker image:
```
docker build -t onnxruntime-cuda --build-arg CUDA_VERSION=12.6.1 \
--build-arg CUDNN_VERSION=9.5.0.50 \
--build-arg GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD) \
--build-arg GIT_COMMIT=$(git rev-parse HEAD) \
--build-arg ONNXRUNTIME_VERSION=$(cat ../VERSION_NUMBER) \
-f Dockerfile.cuda ..
```
Example labels from `docker inspect onnxruntime-cuda`:
```
"Labels": {
"CUDA_VERSION": "12.6.1",
"CUDNN_VERSION": "9.5.0.50",
"maintainer": "Changming Sun <chasun@microsoft.com>",
"onnxruntime_git_branch": "main",
"onnxruntime_git_commit": "bc84958dcef5c6017ae58085f55b669efd74f4a5",
"onnxruntime_version": "1.20.0",
"org.opencontainers.image.ref.name": "ubuntu",
"org.opencontainers.image.version": "24.04"
}
```
### Motivation and Context
https://github.com/microsoft/onnxruntime/pull/22339 has hard-coded the
cuda and cudnn versions. User might want to choose specified cuda and
cudnn version during building docker image.
Fix the QNN nuget package issue
### Description
Inside the package, folder name \runtimes\win-arm64\ was changed to \runtimes\win-ARM64\, which breaks lib copy settings in Microsoft.ML.OnnxRuntime.QNN.props.
### Motivation and Context
Fix issue: https://github.com/microsoft/onnxruntime/issues/21692
### Description
Update the commit from 59600894a2c1c18290944b83e989bfe618975230 to
1887322ed36d522409a6b805d4e7942cf76a8e40
### Motivation and Context
The new one has python 3.13.
AB#50959
This reverts commit 4e15b229a0.
Reason: We are seeing an increase in the number of deadlocks after this
PR. We have a release coming up next week and do not have enough time to
investigate the root cause, hence reverting this PR temporarily.
Moreover, this is causing an increase int he binary size.
### Description
We are seeing an [increase in the number of
deadlocks](https://github.com/microsoft/onnxruntime/pull/22315#issuecomment-2394821893)
after this PR. We have a release coming up next week and do not have
enough time to investigate the root cause, hence reverting this PR
temporarily.
### Motivation and Context
See above.
### Description
This change introduces the WebGPU EP into ONNX Runtime.
To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update
This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.
There is going to be separated PRs for each mentioned above.
Current working branch: #21904
### Description
Serve as example to build and run onnxruntime-gpu with latest software
stack.
To build docker image:
```
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime/dockerfiles
docker build -t onnxruntime-cuda -f Dockerfile.cuda ..
```
To launch the docker image built from previous step (and mount the code
directory to run a unit test below):
```
cd ..
docker run --rm -it --gpus all -v $PWD:/code onnxruntime-cuda /bin/bash
```
Then run the following in docker image to verify that the cuda provider
is good:
```
python /code/onnxruntime/test/python/onnxruntime_test_python_cudagraph.py
```
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22335
### Description
In GQA there was a memory issue which was best described by @edgchen1
[here](https://github.com/microsoft/onnxruntime/issues/22252#issuecomment-2384559255)
> here's the problematic code:
>
>
d9de054eb5/onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc (L149-L157)
>
> annotated:
>
> ```c++
> if (packed_qkv) {
> // Q is an OrtValue declared in the enclosing scope.
> OrtValue RotaryQKV;
> Tensor::InitOrtValue(element_type, TensorShape({batch_size, num_heads_
+ 2 * kv_num_heads_, sequence_length, head_size}), allocator,
RotaryQKV);
> // Save pointer to Q's data in q_input.
> q_input = Q.Get<Tensor>().Data<T>();
> k_input = q_input + num_heads_ * sequence_length * head_size;
> q_rotary = RotaryQKV.GetMutable<Tensor>()->MutableData<T>();
> k_rotary = q_rotary + num_heads_ * sequence_length * head_size;
> // Overwrite Q with RotaryQKV (OrtValues contain shared_ptr to
contained value).
> // Now, q_input is pointing to freed memory.
> Q = RotaryQKV;
> }
> ```
>
> later on, when we use `q_input`, there is a read access violation.
>
>
d9de054eb5/onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc (L170-L172)
>
> this problem showed up when CPU allocator sharing between sessions was
enabled. in that case, the CPU allocator's arena was disabled. I suspect
that the default usage of the arena hid this issue.
>
> though I debugged into the first branch, this appears to be a problem
in both branches:
>
>
d9de054eb5/onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc (L149-L168)
### Motivation and Context
Fixes a crucial bug. The issue was found here
https://github.com/microsoft/onnxruntime/issues/22252
### Description
Add ROCm EP option to benchmark.py script when using int8 quantization.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Without this change benchmarks with int8 quantization cannot be run with
ROCm execution provider.
### Description
<!-- Describe your changes. -->
Specify type to fix warning
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Increanse TensorRT tolerance from default 1e-5 to 1e-3 after TRT 10.4
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add interface to get config_options from onnxruntime.
### Motivation and Context
to support config session_options after EP Append, So need get
configurations on ep end.
### Description
(1) Support onnx data types in python APIs:
* IOBinding.bind_input
* IOBinding.bind_output
* ortvalue_from_shape_and_type
(2) Add unit tests, which serves an example of running BFloat16 or
Float8 models in Python.
Other minor changes:
(3) replace deprecated NP_TYPE_TO_TENSOR_TYPE by helper API.
(4) Rename ortvalue_from_numpy_with_onnxtype to
ortvalue_from_numpy_with_onnx_type.
The integer of onnx element type can be found in
(https://onnx.ai/onnx/api/mapping.html). Note that FLOAT4E2M1 is not
supported yet.
### Motivation and Context
Current python API does not support Bfloat16 and float8 (FLOAT8E4M3FN,
FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ) types, and other new data
types like INT4, UInt4 etc.
This removes the limitation.
https://github.com/microsoft/onnxruntime/issues/13001https://github.com/microsoft/onnxruntime/issues/20481https://github.com/microsoft/onnxruntime/issues/20578
### Description
With TensorRT 10.4 update, the name of TensorRT windows package changed
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add support for FP16 kernels in the XnnPack execution provider for
MaxPool operations.
Fixes:
[AB#50332](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50332)
### Motivation and Context
The major purpose of this pull request is to add some common
vars/functions and setup a consistent style for adding FP16 kernels in
XnnPack EP.
---------
### Description
- removed installing AppCenter + pipeline step that runs AppCenter
Espresso tests
- added script for running AppCenter tests
### Motivation and Context
App Center is getting deprecated in the next year + we have upcoming
Android work that depends on working E2E testing.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
The purpose of the patch is primarily to save power, but it also has
nice perf benefits (mostly from allowing the system to better distribute
power to cores doing meaningful work).
Changes are twofold:
1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
reality is after ~10^4 spins, if there hasn't been any new work
added its unlikely any new work is imminent so sleep to
preserve power. This aligns more closely with upstream EigenV3.
2) Use exponential backoff for waiting on memory. This saves a bit
more power, and important increases the time between iterations
in WorkerLoop to help accomidate the dramatically lowering spin
counts.
Since the tuning for both the iteration counts / backoff counts are
dramatically different for hybrid/non-hybrid systems, this patch
templates the affected functions and dynamically choses based on
`CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of
getting the change in, although its likely we could incur less dynamic
overhead if we added the template argument to the entirety of
`ThreadPoolTempl`.
Measured performance on an [Intel Meteor Lake
CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html)
across a range of models.
Below are the result of 3 runs with each metric being the
value-before-patch / value-after-patch (so for something like inference
time, lower is better).
<div align="center">
<table>
<tr>
<th>Session creation time cost</th>
<td>0.7179</td>
</tr>
<tr>
<th>First inference time cost</th>
<td>0.7156</td>
</tr>
<tr>
<th>Total inference time cost</th>
<td>1.0146</td>
</tr>
<tr>
<th>Total inference requests</th>
<td>0.8874</td>
</tr>
<tr>
<th>Average inference time cost</th>
<td>0.8800</td>
</tr>
<tr>
<th>Total inference run time</th>
<td>1.0146</td>
</tr>
<tr>
<th>Number of inferences per second</th>
<td>0.8955</td>
</tr>
<tr>
<th>Avg CPU usage</th>
<td>0.9462</td>
</tr>
<tr>
<th>Peak working set size</th>
<td>0.9922</td>
</tr>
<tr>
<th>Runs</th>
<td>1.1552</td>
</tr>
<tr>
<th>Min Latency</th>
<td>0.7283</td>
</tr>
<tr>
<th>Max Latency</th>
<td>0.9258</td>
</tr>
<tr>
<th>P50 Latency</th>
<td>0.9534</td>
</tr>
<tr>
<th>P90 Latency</th>
<td>0.9639</td>
</tr>
<tr>
<th>P95 Latency</th>
<td>0.9659</td>
</tr>
<tr>
<th>P99 Latency</th>
<td>0.9640</td>
</tr>
</table>
</div>
So the net result is a 1.16x improvement in throughput and between
1.08-1.37x improvement in latency.
### Description
Java parts of Multi-LoRA support - #22046.
### Motivation and Context
API equivalence with Python & C#.
---------
Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
- Add Java API for appending QNN EP
- Update Java unit test setup
- Fix issues with setting system properties for tests
- Unify Windows/non-Windows setup to simplify
This pull request introduces several enhancements to the benchmarking
process for the SAM2 model, including:
(1) Add profiling capabilities.
(2) test torch compile modes (none will disable compile and fallback to
eager mode)
(3) Update README for setting up the environment.
### Documentation Updates:
* README.md: Updated instructions to create separate conda environments
for GPU and CPU benchmarking, and detailed the parameters and outputs of
the benchmark script.
### Benchmark Script Enhancements:
* benchmark_sam2.py: Added optional parameters for enabling NVTX and
PyTorch profiling, and adjusted the initialization and execution flow to
incorporate these profiling options.
These changes enhance the flexibility and functionality of the
benchmarking process, making it easier to profile and benchmark the SAM2
model on different hardware configurations.
### Description
<!-- Describe your changes. -->
NS is not developed anymore and ORT doesn't use it for int4 inference
either. Remove it to clean up the code
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR fixes an equation in the MatMulNBits op spec. The old formula is
stated as
```
[CeilDiv((N * n_blocks_per_col + 1) * bits, 8)]
```
but it should be stated as
```
[N * CeilDiv(n_blocks_per_col * bits, 8)]
```
or as
```
[N * FloorDiv((n_blocks_per_col + 1) * bits, 8)]
```
### Motivation and Context
For models such as ChatGLM where the column size is odd, the division
math can be off. For example:

With the old equation, the projections are calculated as follows.
```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
4,096 * CeilDiv((107 + 1) * 4, 8) = 4,096 * CeilDiv(108 * 4, 8) = 4,096 * 54 = 221,184
# Up projection
B = 13,696 x 32 x 64
zero_points = 219,136
N = 13,696
n_blocks_per_col = 32
13,696 * CeilDiv((32 + 1) * 4, 8) = 13,696 * CeilDiv(33 * 4, 8) = 13,696 * 17 = 232,832
```
With the new equation, the projections are calculated as follows.
```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
4,096 * CeilDiv(107 * 4, 8) = 4,096 * 54 = 221,184
# Up projection
B = 13,696 x 32 x 64
zero_points= 219,136
N = 13,696
n_blocks_per_col = 32
13,696 * CeilDiv(32 * 4, 8) = 13,696 * 16 = 219,136
```
### Description
In macOS 15, apps running with CoreML will crash with an error message
like this one:
```
Terminating app due to uncaught exception 'NSGenericException', reason: 'Failed to set compute_device_types_mask E5RT: Cannot provide zero compute device types. (1)'
```
This can be easily seen when building ONNXRuntime from source and
running the unit tests. The fix was suggested in [this bug
report](https://forums.developer.apple.com/forums/thread/757040).
I've ported the change to ONNXRuntime and verified that:
* The issue is resolved in macOS 15 (all unit tests pass).
* The behaviour is unchanged in macOS 14.
### Motivation and Context
This fixes#22275 allowing apps using ONNXRuntime with CoreML to work
normally.
### Description
<!-- Describe your changes. -->
Fix syntax so usability checker works as expected.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently in debug mode, unit test will always download models to local
file system, which is a bit annoying. This PR fixes this by adding a
specific option to enable model download.