### Description
Replace gradle/wrapper-validation-action with
gradle/actions/wrapper-validation-action
### Motivation and Context
This is recommended by
https://github.com/gradle/wrapper-validation-action. This job uses
deprecated functionality from the 'gradle/wrapper-validation-action'
action.
### Description
To fix the build issues for AIX OS while using system installed
protobuf/onnx.
### Motivation and Context
Code changes in this PR contains:
1. Fix for below compilation issue.
```
collect2: fatal error: library liblibprotobuf-lite not found
compilation terminated.
```
2. Adding onnx library into dependency list for test applicaitons.
### Description
if the variable is 1, the job running on A100 in PR checks.
Fixes
[AB#50333](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50333)
### Motivation and Context
We wish more big models which need to run on A100 can be tested in PR
checks, but Azure may decommission A100 agents without notifications
sometimes, which will block merging PRs.
This PR is an improvement of current workaround, making those jobs only
run main branch.
Once we find the A100 are all decommisioned by Azure, we could change
the UseA100 variable to 0 to disable the A100 jobs in PR checks
### Description
Support Float16 for CoreML MLProgram EP.
Operations:
"Add", "Mul", "Sub", "Div", "Pow", "Sqrt", "Reciprocal",
"Sigmoid", "Tanh", "Relu", "LeakyRelu", "Concat", "GridSample",
"GlobalAveragePool",
"Clip", "DepthToSpace", "Resize", "Slice", "Conv",
"ConvTranspose", "GlobalMaxPool", "Gemm", "MatMul",
"AveragePool", "MaxPool", "Reshape", "Split", "Transpose"
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
While allowing axes in unsqueeze to be scalar, its shape couldn't be
always accessed like a vector. This PR fixes issue #22031 so that the
original model could run well.
### Description
Enables using the MLTensor to pass data between models.
### Motivation and Context
Using MLTensor instead of ArrayBuffers reduces the number of copies
between the CPU and devices as well as the renderer and GPU process in
Chromium.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
<!-- Describe your changes. -->
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
`If` nodes can have sequence outputs. Those nodes are mapped to the DML
EP to be able to keep the outputs on the GPU, but they actually execute
on the CPU by selecting either the `then` subgraph or the `else`
subgraph.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update code regarding some QNN bug fixes:
1. QnnProfile_ExtendedEventData_t.version is not initialized in Qnn
2. Failed to finalize the graph for HardSigmoid with FP16 precision
### Description
<!-- Describe your changes. -->
Jar maven signing:
- GnuPG
- sha256.
Jar packages artifacts:
- onnxruntime-android-full-aar
- onnxruntime-java
- onnxruntime-java-gpu
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously, it is manually signed.
Goal: make it automatically.
(1) Fix a bug of parameters order.
(2) Update benchmark script:
* download test image if not exist
* combine multiple csv files into one file, and remove duplicated lines
(3) Add a section for benchmark in README.md
### Description
<!-- Describe your changes. -->
Increase the detox setup timeout to 4 minutes.
The iOS RN E2E tests are taking slightly around 2 mins to setup causing
flakiness.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve RN CI pass rate
### Description
TensorRT 10.4 is GA now, update to 10.4
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
* Add std::numeric_limits for MLFloat16 and BFloat16.
* Update some comments in csharp ORTFloat16.shared.cs.
* Add unit tests (including Clip)
Note that the canonical NaN is not consistent in C++ and C#. C# uses
negative quiet NaN as canonical NaN, while C++ uses positive quiet NaN.
The choice of CSharp Float16.NaN is to be consistent with
System.Half.NaN.
FP16 data returns from CUDA might have 7FFF as NaN; FP16 data from CPU
provider might have 0x7E00 as NaN. Anyway there is no consistent
canonical NaN in ORT right now. Because all these NaNs are aligned with
IEEE spec, there shall not an issue in downstream.
### Motivation and Context
std::numeric_limits is used in codebase but not defined for MLFloat16
and BFloat16. It causes some bugs like
https://github.com/microsoft/onnxruntime/issues/21957 introduced by
https://github.com/microsoft/onnxruntime/pull/21493.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
1. changing the emplace to [] that does have a difference, emplace will
only create a new entry if it doesn't already exist in the map
2. change the logic of the caching lookup to key off of input/output
names instead of ort raw ptrs.
3. changes OV tensor creation for CPU allocated input/output ORT
tensors. The CPU allocated input/output tensor path was re-allocating OV
tensors based on the ORT input/output tensors. So we'd get 2 copies: ORT
input/output tensor -> OV tensor (OVEP) -> NPU Tensor (NPU plugin).
---------
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
This fixes#22152
### Description
Tensor.fromImage fails in a webworker context, because HTMLCanvasElement
does not exist:
> HTMLCanvasElement is not defined
### Motivation and Context
This fixes#22152
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Updates the TransposeOptimizer to also remove empty (DQ -> Q) sequences
that occur at a graph output. An empty DQ->Q sequence results from a
Transpose being optimized out.
Consider the following example model:

The TransposeOptimizer removes the final Transpose and leaves an empty
DQ->Q->output_0 sequence. This PR ensures that the final DQ->Q is also
removed.
### Motivation and Context
Models with quantized output can run on QNN EP. The inference latency of
a customer model is impacted by the unnecessary DQ->Q sequence at the
output.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
Condition for [BrowserStack support for open-source
projects](https://www.browserstack.com/open-source)
### Motivation and Context
- Considering using BrowserStack for our end-to-end tests for iOS and
Android
### Description
Previously, we only fused (DQ -> Q) into a QNN Convert if the
quantization types differed (e.g., converting uint8 to uint16). This PR
always fuses DQ -> Q regardless of the quantization type because a
single QNN Convert op is faster than two separate ops.
Example fusions:
- [CURRENTLY SUPPORTED] Convert uint8 to uint16:
- `uint8 -> DQ -> Q -> uint16` becomes `uint8 -> Convert -> uint16`
- [CURRENTLY SUPPORTED] Convert uint16 to uint8:
- `uint16 -> DQ -> Q -> uint8` becomes `uint16 -> Convert -> uint8`
- [NEW] Convert uint8 (zp0, scale0) to uint8 (zp1, scale1):
- `uint8(zp0/scale0) -> DQ -> Q -> uint8(zp1/scale1)` becomes
`uint8(zp0/scale0) -> Convert -> uint8(zp1/scale1)`
- [NEW] Convert uint16 (zp0, scale0) to uint16 (zp1, scale1):
- `uint16(zp0/scale0) -> DQ -> Q -> uint16(zp1/scale1)` becomes
`uint16(zp0/scale0) -> Convert -> uint16(zp1/scale1)`
### Motivation and Context
The Transpose optimizer will normally remove empty DQ->Q sequences if
the quantization params are equal. However, for cases in which the
quantization params are not equal, QNN EP should convert DQ->Q to a
single QNN Convert op for performance. This affects a customer model.
Add `MLFloat16` support for:
- `LayerNormalization`
- `SimplifiedLayerNormalization`
- `SkipLayerNormalization`
- `SkipSimplifiedLayerNormalization`
There are existing `LayerNormTest` unit tests that cover the `MLFloat16`
functionality for `LayerNormalization` once `MLFloat16` is registered
(for example
[`LayerNormTest.LayerNorm_Scale_Float16Input`](91c916f9c6/onnxruntime/test/contrib_ops/layer_norm_op_test.cc (L112))).
Similarly, there are unit tests such as
[`SkipLayerNormTest.SkipLayerNormBatch1_Float16`](91c916f9c6/onnxruntime/test/contrib_ops/skiplayernorm_op_test.cc (L255))
that cover MLFloat16 inputs for `SkipLayerNormalization`.
### Description
add OnRunStart() method for Vitis AI execution provider
### Motivation and Context
To dynamically obtain some runtime parameters during execution, use
run_options within the Vitis AI execution provider (EP).
update script which was using deprecated num_bindings to num_io_tensors
tested on an engine dumped by trtexec and loaded the engine using
onnxruntime-gpu 1.19.2 python package.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <you@example.com>
This makes min and max with NaN for either operand always return NaN for
float16 data, matching the behaviour of float and double.
The behaviour for floats and doubles was previously fixed for the CPU
provider in #21492 and the CUDA provider in #19984, but these PRs didn't
fix the behaviour for float16 due to tests causing asan errors. The
memory access violations with float16 data have now been fixed in
#22135, so this PR is a follow up to make float16 min and max behave the
same as float and double for both the CPU and CUDA providers now that we
can add tests for this.
### Motivation and Context
Relevant previous issues (not float16 specific):
* #21455
* https://github.com/onnx/onnx/issues/6003
### Description
Following from #16578 and #16835 this migrates over
`OnnxTensor.createTensor(<array>)` to first instantiate a
`java.nio.Buffer` and then copy the array into that buffer in Java
before creating the tensor. It also changes the `OnnxTensor.getValue()`
method which returns a multidimensional array so it does the array
construction and value copy in Java. This allows the removal of some
unpleasant recursive C code which repeatedly calls into the JVM to
traverse Java's arrays. The equivalent Java code is still unpleasant and
recursive, but it's easier to reason about and memory safe. As a bonus,
more `OnnxTensor`s are now backed by buffers which allow users to pin
memory and reduce allocations by reusing them for same sized inputs.
Some of the JNI code which parses Java arrays still exists as it's used
by `OnnxMap`, removing that will be the target of a future refactor.
Strings are still processed in JNI as it is easier to work with String
tensors and UTF-8 arrays in C.
### Motivation and Context
Minimizing the amount of JNI code makes it easier to maintain and using
buffers in preference to arrays allows for fewer allocations.
### Description
<!-- Describe your changes. -->
Add handling of a missing optional axes input to the ROCm reduction ops.
Matches CUDA EP change from #22149
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix pipeline.
### Description
* Add lintrunner to requirements-lintrunner.txt
* Lock lintrunner and lintrunner-adapter version
* Update documentation
### Motivation and Context
The document is not up to date.
Composable Kernel build fails under ROCm 6.2.
This PR patches Composable Kernel the same way as
https://github.com/ROCm/composable_kernel/pull/1346
* fix buffer resource to match "s" constraint
* add missing memory clobber
### Description
<!-- Describe your changes. -->
For InstanceNormalization, it has `y = scale * (x - mean) /
sqrt(variance + epsilon) + B` , where mean and variance are computed per
instance per channel. Calculating mean and variance per channel is a
reduce processing, which is NCHW layout friendly since it makes the
adjacent threads can access contiguous data in gpu memory.
This PR optimizes both NHWC and NCHW InstanceNormalization. To
efficiently calculate the mean and variance, we need to make sure the
input is NCHW instead of NHWC. Then use shared memory to do the reduce
operation to get `channel_scale` and `channel_shift`.
With this PR, getting `channel_scale` and `channel_shift` are same for
NHWC and NCHW InstanceNormalization. And the overall performance becomes
very close now.
Below data comes from SD Turbo profiling results.
Before (InstanceNormalization overall time: 140.84 ms)
InstanceNormalization\|InstanceNormComputeMean | 129.70
-- | --
InstanceNormalization\|InstanceNormalizationNHWC | 10.55
InstanceNormalization\|InstanceNormComputeChannelScaleShift | 0.59
After (InstanceNormalization overall time: 59.44 ms)
InstanceNormalization\|InstanceNormComputeChannelScaleShift | 28.57
-- | --
InstanceNormalization\|TransposeShared | 20.19
InstanceNormalization\|InstanceNormalizationNHWC | 10.68
### Description
<!-- Describe your changes. -->
Specify the path of `ar`, `ld` and `libtool` when building apple
framework.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sometimes non-system executables will comes before the system-provided
ones. This PR intends to prevent it from happening.
### Description
Fix an issue that QNN models shared from other session use the session logger from that producer session also which cause confusion. Make QNN model compute function use the session logger from current session.
### Description
* Add MultiHeadAttention fusion for SAM2.
* Add LayerNormalization fusion for NCHW format by inserting Transpose
from NCHW to NHWC before layer normalization, and add another Transpose
after layer norm to convert NHWC back to NCHW. Hopefully, those extra
Transpose nodes will be removed when prefer_nhwc is enabled later.
* Add a condition that the input shall be 3D when fuse SkipLayerNorm.
* Update convert_to_onnx.py to add `--optimize` and `--use_gpu` options
to output optimized onnx model for CPU/CUDA eps.
* Add an option `--dtype fp16|fp32` in convert_to_onnx.py to support
converting optimized model to float16.
* Update the demo to use the optimized onnx models.
### Motivation and Context
To support optimization of SAM2 for CPU/CUDA eps that is exported in
https://github.com/microsoft/onnxruntime/pull/22119
### Description
When K == 0 output a MxN matrix filled with bias if present or filled
with zeros.
This brings it inline with MatMul behavior especially when Gemm is used
to fuse MatMul with Add.
### Motivation and Context
* Comply with numpy spec of MatMul
* Address a case when empty initializers are used for computation.