### Description
Add support for FP16 kernels in the XnnPack execution provider for
MaxPool operations.
Fixes:
[AB#50332](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50332)
### Motivation and Context
The major purpose of this pull request is to add some common
vars/functions and setup a consistent style for adding FP16 kernels in
XnnPack EP.
---------
### Description
- removed installing AppCenter + pipeline step that runs AppCenter
Espresso tests
- added script for running AppCenter tests
### Motivation and Context
App Center is getting deprecated in the next year + we have upcoming
Android work that depends on working E2E testing.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
The purpose of the patch is primarily to save power, but it also has
nice perf benefits (mostly from allowing the system to better distribute
power to cores doing meaningful work).
Changes are twofold:
1) Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
reality is after ~10^4 spins, if there hasn't been any new work
added its unlikely any new work is imminent so sleep to
preserve power. This aligns more closely with upstream EigenV3.
2) Use exponential backoff for waiting on memory. This saves a bit
more power, and important increases the time between iterations
in WorkerLoop to help accomidate the dramatically lowering spin
counts.
Since the tuning for both the iteration counts / backoff counts are
dramatically different for hybrid/non-hybrid systems, this patch
templates the affected functions and dynamically choses based on
`CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of
getting the change in, although its likely we could incur less dynamic
overhead if we added the template argument to the entirety of
`ThreadPoolTempl`.
Measured performance on an [Intel Meteor Lake
CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html)
across a range of models.
Below are the result of 3 runs with each metric being the
value-before-patch / value-after-patch (so for something like inference
time, lower is better).
<div align="center">
<table>
<tr>
<th>Session creation time cost</th>
<td>0.7179</td>
</tr>
<tr>
<th>First inference time cost</th>
<td>0.7156</td>
</tr>
<tr>
<th>Total inference time cost</th>
<td>1.0146</td>
</tr>
<tr>
<th>Total inference requests</th>
<td>0.8874</td>
</tr>
<tr>
<th>Average inference time cost</th>
<td>0.8800</td>
</tr>
<tr>
<th>Total inference run time</th>
<td>1.0146</td>
</tr>
<tr>
<th>Number of inferences per second</th>
<td>0.8955</td>
</tr>
<tr>
<th>Avg CPU usage</th>
<td>0.9462</td>
</tr>
<tr>
<th>Peak working set size</th>
<td>0.9922</td>
</tr>
<tr>
<th>Runs</th>
<td>1.1552</td>
</tr>
<tr>
<th>Min Latency</th>
<td>0.7283</td>
</tr>
<tr>
<th>Max Latency</th>
<td>0.9258</td>
</tr>
<tr>
<th>P50 Latency</th>
<td>0.9534</td>
</tr>
<tr>
<th>P90 Latency</th>
<td>0.9639</td>
</tr>
<tr>
<th>P95 Latency</th>
<td>0.9659</td>
</tr>
<tr>
<th>P99 Latency</th>
<td>0.9640</td>
</tr>
</table>
</div>
So the net result is a 1.16x improvement in throughput and between
1.08-1.37x improvement in latency.
### Description
Java parts of Multi-LoRA support - #22046.
### Motivation and Context
API equivalence with Python & C#.
---------
Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
- Add Java API for appending QNN EP
- Update Java unit test setup
- Fix issues with setting system properties for tests
- Unify Windows/non-Windows setup to simplify
This pull request introduces several enhancements to the benchmarking
process for the SAM2 model, including:
(1) Add profiling capabilities.
(2) test torch compile modes (none will disable compile and fallback to
eager mode)
(3) Update README for setting up the environment.
### Documentation Updates:
* README.md: Updated instructions to create separate conda environments
for GPU and CPU benchmarking, and detailed the parameters and outputs of
the benchmark script.
### Benchmark Script Enhancements:
* benchmark_sam2.py: Added optional parameters for enabling NVTX and
PyTorch profiling, and adjusted the initialization and execution flow to
incorporate these profiling options.
These changes enhance the flexibility and functionality of the
benchmarking process, making it easier to profile and benchmark the SAM2
model on different hardware configurations.
### Description
<!-- Describe your changes. -->
NS is not developed anymore and ORT doesn't use it for int4 inference
either. Remove it to clean up the code
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR fixes an equation in the MatMulNBits op spec. The old formula is
stated as
```
[CeilDiv((N * n_blocks_per_col + 1) * bits, 8)]
```
but it should be stated as
```
[N * CeilDiv(n_blocks_per_col * bits, 8)]
```
or as
```
[N * FloorDiv((n_blocks_per_col + 1) * bits, 8)]
```
### Motivation and Context
For models such as ChatGLM where the column size is odd, the division
math can be off. For example:

With the old equation, the projections are calculated as follows.
```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
4,096 * CeilDiv((107 + 1) * 4, 8) = 4,096 * CeilDiv(108 * 4, 8) = 4,096 * 54 = 221,184
# Up projection
B = 13,696 x 32 x 64
zero_points = 219,136
N = 13,696
n_blocks_per_col = 32
13,696 * CeilDiv((32 + 1) * 4, 8) = 13,696 * CeilDiv(33 * 4, 8) = 13,696 * 17 = 232,832
```
With the new equation, the projections are calculated as follows.
```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
4,096 * CeilDiv(107 * 4, 8) = 4,096 * 54 = 221,184
# Up projection
B = 13,696 x 32 x 64
zero_points= 219,136
N = 13,696
n_blocks_per_col = 32
13,696 * CeilDiv(32 * 4, 8) = 13,696 * 16 = 219,136
```
### Description
In macOS 15, apps running with CoreML will crash with an error message
like this one:
```
Terminating app due to uncaught exception 'NSGenericException', reason: 'Failed to set compute_device_types_mask E5RT: Cannot provide zero compute device types. (1)'
```
This can be easily seen when building ONNXRuntime from source and
running the unit tests. The fix was suggested in [this bug
report](https://forums.developer.apple.com/forums/thread/757040).
I've ported the change to ONNXRuntime and verified that:
* The issue is resolved in macOS 15 (all unit tests pass).
* The behaviour is unchanged in macOS 14.
### Motivation and Context
This fixes#22275 allowing apps using ONNXRuntime with CoreML to work
normally.
### Description
<!-- Describe your changes. -->
Fix syntax so usability checker works as expected.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Currently in debug mode, unit test will always download models to local
file system, which is a bit annoying. This PR fixes this by adding a
specific option to enable model download.
In current implementation, axis in softmax has to be the last, which is
an obvious limitation. This PR removes this limitation and will fix
issues #20710 and #22176.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Replace gradle/wrapper-validation-action with
gradle/actions/wrapper-validation-action
### Motivation and Context
This is recommended by
https://github.com/gradle/wrapper-validation-action. This job uses
deprecated functionality from the 'gradle/wrapper-validation-action'
action.
### Description
To fix the build issues for AIX OS while using system installed
protobuf/onnx.
### Motivation and Context
Code changes in this PR contains:
1. Fix for below compilation issue.
```
collect2: fatal error: library liblibprotobuf-lite not found
compilation terminated.
```
2. Adding onnx library into dependency list for test applicaitons.
### Description
if the variable is 1, the job running on A100 in PR checks.
Fixes
[AB#50333](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50333)
### Motivation and Context
We wish more big models which need to run on A100 can be tested in PR
checks, but Azure may decommission A100 agents without notifications
sometimes, which will block merging PRs.
This PR is an improvement of current workaround, making those jobs only
run main branch.
Once we find the A100 are all decommisioned by Azure, we could change
the UseA100 variable to 0 to disable the A100 jobs in PR checks
### Description
Support Float16 for CoreML MLProgram EP.
Operations:
"Add", "Mul", "Sub", "Div", "Pow", "Sqrt", "Reciprocal",
"Sigmoid", "Tanh", "Relu", "LeakyRelu", "Concat", "GridSample",
"GlobalAveragePool",
"Clip", "DepthToSpace", "Resize", "Slice", "Conv",
"ConvTranspose", "GlobalMaxPool", "Gemm", "MatMul",
"AveragePool", "MaxPool", "Reshape", "Split", "Transpose"
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
While allowing axes in unsqueeze to be scalar, its shape couldn't be
always accessed like a vector. This PR fixes issue #22031 so that the
original model could run well.
### Description
Enables using the MLTensor to pass data between models.
### Motivation and Context
Using MLTensor instead of ArrayBuffers reduces the number of copies
between the CPU and devices as well as the renderer and GPU process in
Chromium.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
<!-- Describe your changes. -->
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
`If` nodes can have sequence outputs. Those nodes are mapped to the DML
EP to be able to keep the outputs on the GPU, but they actually execute
on the CPU by selecting either the `then` subgraph or the `else`
subgraph.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update code regarding some QNN bug fixes:
1. QnnProfile_ExtendedEventData_t.version is not initialized in Qnn
2. Failed to finalize the graph for HardSigmoid with FP16 precision
### Description
<!-- Describe your changes. -->
Jar maven signing:
- GnuPG
- sha256.
Jar packages artifacts:
- onnxruntime-android-full-aar
- onnxruntime-java
- onnxruntime-java-gpu
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously, it is manually signed.
Goal: make it automatically.
(1) Fix a bug of parameters order.
(2) Update benchmark script:
* download test image if not exist
* combine multiple csv files into one file, and remove duplicated lines
(3) Add a section for benchmark in README.md
### Description
<!-- Describe your changes. -->
Increase the detox setup timeout to 4 minutes.
The iOS RN E2E tests are taking slightly around 2 mins to setup causing
flakiness.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve RN CI pass rate
### Description
TensorRT 10.4 is GA now, update to 10.4
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
* Add std::numeric_limits for MLFloat16 and BFloat16.
* Update some comments in csharp ORTFloat16.shared.cs.
* Add unit tests (including Clip)
Note that the canonical NaN is not consistent in C++ and C#. C# uses
negative quiet NaN as canonical NaN, while C++ uses positive quiet NaN.
The choice of CSharp Float16.NaN is to be consistent with
System.Half.NaN.
FP16 data returns from CUDA might have 7FFF as NaN; FP16 data from CPU
provider might have 0x7E00 as NaN. Anyway there is no consistent
canonical NaN in ORT right now. Because all these NaNs are aligned with
IEEE spec, there shall not an issue in downstream.
### Motivation and Context
std::numeric_limits is used in codebase but not defined for MLFloat16
and BFloat16. It causes some bugs like
https://github.com/microsoft/onnxruntime/issues/21957 introduced by
https://github.com/microsoft/onnxruntime/pull/21493.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
1. changing the emplace to [] that does have a difference, emplace will
only create a new entry if it doesn't already exist in the map
2. change the logic of the caching lookup to key off of input/output
names instead of ort raw ptrs.
3. changes OV tensor creation for CPU allocated input/output ORT
tensors. The CPU allocated input/output tensor path was re-allocating OV
tensors based on the ORT input/output tensors. So we'd get 2 copies: ORT
input/output tensor -> OV tensor (OVEP) -> NPU Tensor (NPU plugin).
---------
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
This fixes#22152
### Description
Tensor.fromImage fails in a webworker context, because HTMLCanvasElement
does not exist:
> HTMLCanvasElement is not defined
### Motivation and Context
This fixes#22152
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Updates the TransposeOptimizer to also remove empty (DQ -> Q) sequences
that occur at a graph output. An empty DQ->Q sequence results from a
Transpose being optimized out.
Consider the following example model:

The TransposeOptimizer removes the final Transpose and leaves an empty
DQ->Q->output_0 sequence. This PR ensures that the final DQ->Q is also
removed.
### Motivation and Context
Models with quantized output can run on QNN EP. The inference latency of
a customer model is impacted by the unnecessary DQ->Q sequence at the
output.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
Condition for [BrowserStack support for open-source
projects](https://www.browserstack.com/open-source)
### Motivation and Context
- Considering using BrowserStack for our end-to-end tests for iOS and
Android
### Description
Previously, we only fused (DQ -> Q) into a QNN Convert if the
quantization types differed (e.g., converting uint8 to uint16). This PR
always fuses DQ -> Q regardless of the quantization type because a
single QNN Convert op is faster than two separate ops.
Example fusions:
- [CURRENTLY SUPPORTED] Convert uint8 to uint16:
- `uint8 -> DQ -> Q -> uint16` becomes `uint8 -> Convert -> uint16`
- [CURRENTLY SUPPORTED] Convert uint16 to uint8:
- `uint16 -> DQ -> Q -> uint8` becomes `uint16 -> Convert -> uint8`
- [NEW] Convert uint8 (zp0, scale0) to uint8 (zp1, scale1):
- `uint8(zp0/scale0) -> DQ -> Q -> uint8(zp1/scale1)` becomes
`uint8(zp0/scale0) -> Convert -> uint8(zp1/scale1)`
- [NEW] Convert uint16 (zp0, scale0) to uint16 (zp1, scale1):
- `uint16(zp0/scale0) -> DQ -> Q -> uint16(zp1/scale1)` becomes
`uint16(zp0/scale0) -> Convert -> uint16(zp1/scale1)`
### Motivation and Context
The Transpose optimizer will normally remove empty DQ->Q sequences if
the quantization params are equal. However, for cases in which the
quantization params are not equal, QNN EP should convert DQ->Q to a
single QNN Convert op for performance. This affects a customer model.
Add `MLFloat16` support for:
- `LayerNormalization`
- `SimplifiedLayerNormalization`
- `SkipLayerNormalization`
- `SkipSimplifiedLayerNormalization`
There are existing `LayerNormTest` unit tests that cover the `MLFloat16`
functionality for `LayerNormalization` once `MLFloat16` is registered
(for example
[`LayerNormTest.LayerNorm_Scale_Float16Input`](91c916f9c6/onnxruntime/test/contrib_ops/layer_norm_op_test.cc (L112))).
Similarly, there are unit tests such as
[`SkipLayerNormTest.SkipLayerNormBatch1_Float16`](91c916f9c6/onnxruntime/test/contrib_ops/skiplayernorm_op_test.cc (L255))
that cover MLFloat16 inputs for `SkipLayerNormalization`.
### Description
add OnRunStart() method for Vitis AI execution provider
### Motivation and Context
To dynamically obtain some runtime parameters during execution, use
run_options within the Vitis AI execution provider (EP).
update script which was using deprecated num_bindings to num_io_tensors
tested on an engine dumped by trtexec and loaded the engine using
onnxruntime-gpu 1.19.2 python package.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <you@example.com>