### Description
Current linux-ci-pipeline was broken due to missing parameters from
`py-packaging-linux-test-cpu.yml` template
### Motivation and Context
Fix Linux CI pipeline
### Description
Update stable diffusion benchmark:
(1) allow IO binding for optimum.
(2) do not use num_images_per_prompt across all engines for fair
comparison.
Example to run benchmark of optimum on stable diffusion 1.5:
```
git clone https://github.com/tianleiwu/optimum
cd optimum
git checkout tlwu/diffusers-io-binding
pip install -e .
pip install -U onnxruntime-gpu
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion
git checkout tlwu/benchmark_sd_optimum_io_binding
pip install -r requirements/cuda12/requirements.txt
optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 --task text-to-image ./sd_onnx_fp32
python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding
```
Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without
IO Binding; IO binding gains 16ms, or 2.7%,
### Motivation and Context
Optimum is working on enabling I/O binding:
https://github.com/huggingface/optimum/pull/2056. This could help
testing the impact of I/O binding on the performance of the stable
diffusion.
### Description
Enable the ConvReplaceWithQLinear graph optimization when using the ACL
execution provider.
### Motivation and Context
Fixes an issue where quantized Conv nodes followed by ReLU don't get
converted to QLinearConv, so ACL sees the weights as mutable and
therefore cannot run the Conv node.
Signed-off-by: Michael Tyler <michael.tyler@arm.com>
### Description
Making ::p optional in the Linux python CUDA package pipeline
### Motivation and Context
Linux stage from Python-CUDA-Packaging-Pipeline has failed since merge
of #22773
### Description
This PR fixes the spelling of the key value of the GRU operator in the
map in the `GetSupportedNodes` function (Gru -> GRU) and removes the
data type check for the fifth input (sequence_lens) of the GRU operator.
PTAL, thanks!
Add new provider option `trt_op_types_to_exclude`:
- User can provide op type list to be excluded from running on TRT
- e.g. `trt_op_types_to_exclude="MaxPool"`
There is a known performance issue with the DDS ops (NonMaxSuppression,
NonZero and RoiAlign) from TRT versions 10.0 to 10.7. TRT EP excludes
DDS ops from running on TRT by default, user can override default value
with empty string to include all ops.
The performance cost of falling back to the CPU EP is high for several
resampling nodes and causes multiple partitions in SD Turbo and VAE
decoder. Since the asymmetric mode with nearest to floor and integer
scales is identical to half_pixel anyway, stick with the WebNN EP.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
A break down PR of https://github.com/microsoft/onnxruntime/pull/22651
Add fp16 kernels.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
WebNN doesn't provide dedicate op for LRN, use a couple of WebNN ops to
emulate it in WebNN EP:
pow -> transpose -> pad -> averagePool -> transpose -> mul -> add -> pow
-> div
@Honry @fdwr PTAL, thanks!
### Description
This PR registers the following opset 21 operators:
Idenity-21
OlieanrMatmul-21
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
`Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in
offical release, this would cause undefined error.
Fix it by using `Module['jsepRegisterMLConstant']`.
### Description
Fixes a unit test that would fail intermittently due to an existing bug
with Pad (reflect mode). When the number of padded values is >= the
inner dimension size, the ORT Pad implementation accesses invalid
memory. This PR makes the number of padding values less than the inner
dimension size to avoid triggering the bug.
### Motivation and Context
See related issues:
https://github.com/microsoft/onnxruntime/issues/8265https://github.com/microsoft/onnxruntime/issues/11828https://github.com/microsoft/onnxruntime/issues/20801
Here's a valgrind trace obtained on a Linux machine (with
`sess_options.enable_cpu_mem_arena = False`)
```
==864228== Invalid read of size 4
==864228== at 0x2716272A: void onnxruntime::PadInnermostAxis<unsigned int>(unsigned int*, unsigned int*, long, unsigned long) (pad.cc:370)
==864228== by 0x2715D213: onnxruntime::common::Status onnxruntime::PadImpl<unsigned int>(onnxruntime::OpKernelContext*, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, onnxruntime::Mode const&, unsigned int) (pad.cc:551)
==864228== by 0x2715B2BB: onnxruntime::Pad::Compute(onnxruntime::OpKernelContext*) const (pad.cc:725)
==864228== by 0x276FF6A7: onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) (sequential_executor.cc:484)
==864228== by 0x276F4A04: onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (execution_steps.cc:73)
...
```
The above is obtained with the basic Pad(reflect) example on the [ONNX
Pad operator spec
page](https://onnx.ai/onnx/operators/onnx__Pad.html#summary):
```python
data = [
[1.0, 1.2],
[2.3, 3.4],
[4.5, 5.7],
]
pads = [0, 2, 0, 0]
mode = 'reflect'
# Expected output by ONNX spec
expected_output = [
[1.0, 1.2, 1.0, 1.2],
[2.3, 3.4, 2.3, 3.4],
[4.5, 5.7, 4.5, 5.7],
]
# Bugged output from onnxruntime has invalid/uninitialized data for the first element in the inner dimension
# invalid data may be 0.0, inf, nan, etc.
ort_output = [
[inf, 1.2, 1.0, 1.2],
[inf, 3.4, 2.3, 3.4],
[inf, 5.7, 4.5, 5.7],
]
```
The previous PR was reverted because it causes the whole model to
fallback when there is output shape info missing. This PR fixes the
issue by removing redundant fallbacks.
### Description
Fixes command for building Linux python packages by preventing an empty
`-p` command-line option from being passed to a subsequent build script:
1f3b675453/tools/ci_build/github/linux/run_python_dockerbuild.sh (L37)
### Motivation and Context
A recent [PR
](https://github.com/microsoft/onnxruntime/pull/22773)introduced a new
optional command-line option (`-p`) to pass custom python exe paths. We
need to check if the option is empty before forwarding the option to a
separate build script.
### Description
This PR Fix warning - `LegacyKeyValueFormat: "ENV key=value" should be
used instead of legacy "ENV key value" format` from all Dockerfile
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fixes#22512, MatMul, Add can be fused into a single Gemm even if
tensors dimensions are > 2. The PR excludes that cases.
### Motivation and Context
ORT crashes on valid models due to that unexpected fusion.
### Description
Support to set EPdynamic options in OVEP
### Motivation and Context
relate to https://github.com/microsoft/onnxruntime/pull/22282
---------
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
For per-axis quantization/dequantization, WebNN requires the scale and
zero_point inputs to be broadcastable. Axis should be used for reshape
these two inputs.
### Description
<!-- Describe your changes. -->
[VitisAI] Cache node subgraph when necessary
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zhenzew <zhenzew@amd.com>
### Description
1. Add XNNPack build on Linux ARM64
2. Build only one python wheel for PR request.
[AB#49763](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/49763)
### Motivation and Context
Why I add xnnpack build on Linux ARM64 rather than Windows ARM64.
Becuase KleidiAI doesn't support Windows
```
IF(XNNPACK_TARGET_PROCESSOR STREQUAL "arm64" AND XNNPACK_ENABLE_ARM_I8MM AND NOT CMAKE_C_COMPILER_ID STREQUAL "MSVC")
IF (XNNPACK_ENABLE_KLEIDIAI)
MESSAGE(STATUS "Enabling KleidiAI for Arm64")
ENDIF()
ELSE()
SET(XNNPACK_ENABLE_KLEIDIAI OFF)
ENDIF()
```
---------
### Description
Ignore all whitespace lint messages for cpplint. Remove redundant
configs in dml/.
### Motivation and Context
They are handled automatically by clang-format and creates too much
noise in the PR files tab.
### Description
Adds `reduce_range` option to `get_qdq_config()`
### Motivation and Context
Make it easier to set this option when calling get_qdq_config().
Otherwise, user has to set the option manually.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR make MatMul shaders not depend on inputs broadcasting pattern,
but only depend on input ranks and their shape provided in uniform. This
change fix the issue that currently shaders code are different for
different broadcasting, but have identical cache key and results in
wrong cache hit.
### Description
Fix a build error seen with GCC 11 when building at Homebrew on our
Linux x86_64 Ubuntu 22.04 CI (GitHub action runner).
### Motivation and Context
When building latest v1.20.0 at Homebrew
(https://github.com/Homebrew/homebrew-core/pull/196547), we hit a build
failure with GCC 11:
```
[ 65%] Building CXX object CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o
/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/shims/linux/super/g++-11 -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DENABLE_CPU_FP16_TRAINING_OPS -DHAS_STRING_VIEW=1 -DNSYNC_ATOMIC_CPP11 -DONLY_C_LOCALE=0 -DONNX_ML=1 -DONNX_NAMESPACE=onnx -DORT_ENABLE_STREAM -DORT_NO_RTTI -DPLATFORM_POSIX -DPROTOBUF_USE_DLLS -D_GNU_SOURCE -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/utf8_range-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime/core/session -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/pytorch_cpuinfo-src/include -I/tmp/onnxruntime-20241103-6403-lh3bwj/build -I/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-build -ffunction-sections -fdata-sections -Wno-restrict -DCPUINFO_SUPPORTED -O3 -DNDEBUG -fPIC -fno-rtti -Wall -Wextra -Wno-deprecated-copy -Wno-tautological-pointer-compare -Wno-nonnull-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Werror -MD -MT CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -MF CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o.d -o CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -c /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc
/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc: In function ‘void onnx_transpose_optimization::Permute1DConstant(onnx_transpose_optimization::api::GraphRef&, onnx_transpose_optimization::api::NodeRef&, onnx_transpose_optimization::api::TensorRef&, size_t, std::string_view, const std::vector<long int>&)’:
/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1114:10: error: ‘memcpy’ is not a member of ‘std’; did you mean ‘wmemcpy’?
1114 | std::memcpy(dst, src, bytes_per_val);
| ^~~~~~
| wmemcpy
```
It is possible this error may not occur on different GCC versions if
`cstring` has been indirectly included by another header.
### Description
This PR will set default python to 3.10 except
tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml. This is
needed because we are no longer using python 3.8
This PR excludes changes for Big Models CI, because it will require
additional changes. Which will be track in
USER STORY 52729
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
With recent changes, below build error is found under AIX.
```
ld: 0706-012 The -p flag is not recognized.
ld: 0706-012 The -a flag is not recognized.
ld: 0706-012 The -t flag is not recognized.
ld: 0706-012 The -h flag is not recognized.
ld: 0706-012 The -= flag is not recognized.
ld: 0706-012 The -$ flag is not recognized.
ld: 0706-012 The -$ flag is not recognized.
ld: 0706-012 The -O flag is not recognized.
ld: 0706-027 The -R IGIN flag is ignored.
collect2: error: ld returned 255 exit status
```
### Motivation and Context
AIX linker doesn't support -rpath option , so blocking this option under
AIX.
### Description
Skip `MatMulIntegerToFloat` fusion in case of DML EP for cases where
model uses Quantization before `MatMulInteger`. This is mainly done to
be resource efficient, and we have better `MatMulInteger` Metacommand
coverage which computes in int data type
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This CL make WebGPU backend support subgroup features and thus allow
using subgroup optimizations in the future.
### Description
With this CL WebGPU backends will create devices with subgroups and
subgroups-f16 features (both are under origin trial in Chrome) or
chromium-experimental-subgroups feature enabled whenever available.
### Motivation and Context
This CL would allow WebGPU operator shaders to use subgroup
optimizations in the future, and might get some significant speedup with
these optimization.
### Description
<!-- Describe your changes. -->
* Update CI with TRT 10.6
* Update oss parser to [10.6-GA-ORT-DDS
](https://github.com/onnx/onnx-tensorrt/tree/10.6-GA-ORT-DDS) and update
dependency version
* Update Py-cuda11 CI to use trt10.6
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
(There will be 3rd PR to further reduce trt_version hardcoding)
### Description
Updates python quantization tool:
- Ensures QDQ Pad has equal quantization parameters across input and
output for certain Pad configurations.
- Ensures QDQ Slice always has equal quantization parameters across
input and output.
- Fixes bug when Softmax is _excluded_ from quantization.
### Motivation and Context
QDQ Pad and Slice have lower latency on QNN EP when their quantization
parameters are equal.
### Description
- Changes running the E2E iOS tests from running in App Center to
running in BrowserStack
- Steps for running locally can be found in the OneNote
### Motivation and Context
- Follow-up of #22117
- App Center (the previous platform for running E2E mobile tests) is
getting deprecated in 2025
### Misc info
Additional build steps were required to get the necessary testing
artifacts for BrowserStack. App Center consumed an entire folder, while
BrowserStack requests the following:
1. a ZIP file of all the tests
2. an IPA file of the test app
#### Flow
Here is a rough outline of what is happening in the pipeline:
1. The build_and_assemble_apple_pods.py script builds the relevant
frameworks (currently, this means packages for iOS and Mac)
4. The test_apple_packages.py script installs the necessary cocoapods
for later steps
5. XCode task to build for testing builds the iOS target for the test
app
6. Now that the test app and the tests have been built, we can zip them,
creating the tests .zip file
7. To create the IPA file, we need to create a .plist XML file which is
generated by the generate_plist.py script.
- Attempts to use the Xcode@5 task to automatically generate the plist
file failed.
- Also, building for testing generates some plist files -- these cannot
be used to export an IPA file.
8. We run the Xcode task to build an .xcarchive file, which is required
for creating an IPA file.
9. We use xcodebuild in a script step to build an IPA file with the
xcarchive and plist files from the last two steps.
10. Finally, we can run the tests using the BrowserStack script.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Fixes scenario in which a bias input quantized to int32 has a scale that
is too small. A bias with a scale that is smaller than a certain
threshold will overflow the range of an `int32` when quantized, which
significantly decreases accuracy.
Credit to @yihonglyu for finding out about this issue and the fix.
### Motivation and Context
Consider the following Convolution with very small weights and a
constant bias input of `[5, -4.5]`.

The QDQ quantizer first computes the following quantization scale for
`input_0` and `weight`:
- `input_0`: scale=0.5
- `weight`: scale=7.843e-10 **[really small]**
The QDQ quantizer then computes the bias input's scale as follows:
```
bias_scale = input_0_scale * weight_0_scale = 0.5 * 7.843e-10 = 3.9215686274509805e-11
```
This `bias_scale` is too small. Before this PR, the QDQ quantizer would
quantize the f32 bias with this `bias_scale`:
```
bias_quant = round(bias_f32 / bias_scale) = round([5.0/bias_scale, -4.5/bias_scale]) = [127500000000, -114750000000]
```
These quantized bias values exceed the range of int32, and so are
clipped to [int32.min(), int32.max()], which is very inaccurate.
#### New approach
This PR increases the `weight_0_scale` by the necessary amount to ensure
that `bias_scale` (which equals `weight_0_scale * input_0_scale`) is
appropriate for the int32 quantization type.
The smallest valid bias scale is given by the normal scale formula:
`bias_smallest_valid_scale = (bias_f32_max - bias_f32_min) / (int32_max
- int32_min)`
Then, we compute the candidate bias scale:
`bias_scale_candidate = input_0_scale * weight_0_scale`
If the candidate scale is smaller than the smallest valid scale, we
increase the `weight_0_scale` by the necessary ratio:
```python
if bias_scale_candidate < bias_smallest_valid_scale:
ratio = bias_smallest_valid_scale / bias_scale_candidate
weight_0_scale = ratio * weight_0_scale
```
Then, we recompute the final bias scale:
```python
bias_scale = input_0_scale * weight_0_scale
```
#### Impact on accuracy
Here's the above model's quantized output compared to the f32
(ground-truth) output.
- Before PR:
- f32 model output[0]: **5.0f**
- qdq model output[0]: **0.075**
- SNR: 0.1369 (higher is better)
- After PR:
- f32 model output[0]: **5.0f**
- qdq model output[0]: **4.992**
- SNR: 55.656 (higher is better)
### Description
Introduces the `get_qdq_config()` function to get a quantization
configuration for a full integer QDQ model. This function provides an
easier way of specifying commonly used options and sets convenient
defaults. Specifically:
- Instead of requiring the user to pass a dictionary of `extra_options`,
the new interface adds function parameters for common settings:
- All calibrator settings
- Whether activations/weights are symmetric
- Whether to keep or fuse relu/clip into Q
- Minimum real range for quantization
- Dictionary of tensor quantization overrides.
- Automatically scans the input floating-point model and fills out the
operator types to quantize. Otherwise, only a limited number of operator
types would be quantized by default.
- Detects if the input model uses external data. If so, ensures that the
generated QDQ model also uses external data.
- Detects if the model will use newly introduced quantization types
(int4/int16) with an older opset. If so, forces the use of the
`com.microsoft` domain for Q/DQ ops, which support all types.
- Automatically enables the "extra option" called
`ForceQuantizeNoInputCheck` to ensure data movement operators (e.g.,
Transpose) are always quantized.
- User can pass a function to indicate which nodes to exclude from
quantization.
- The user can still pass their own `extra_options` to override any of
the above if necessary.
```python
from onnxruntime.quantization import get_int_qdq_config, quantize # , ...
# Get QDQ configuration
qdq_config = get_int_qdq_config(
float_model,
data_reader,
calibrate_method=CalibrationMethod.Percentile,
calibrate_args={"percentile": 99.98}, # Converted to extra_options
activation_type=QuantType.QUInt8,
weight_type=QuantType.QInt8,
per_channel=True,
nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"`
# Other options converted to extra_options:
min_real_range=0.0001,
keep_removable_activations=True,
activation_symmetric=True,
weight_symmetric=True,
)
# Quantize model
quantize(float_model_path, qdq_model_path, qdq_config)
```
### Motivation and Context
Need a version of `get_qnn_qdq_config()` that is not EP-specific.