### Description
Move C# doc Github Action to Windows machines, to avoid having
dependency on Mono which I think is getting deprecated.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Updates pipelines to use QNN SDK 2.28.2.241116.
- Re-enable LayerNormalization unit tests that failed with accuracy
errors with the previous QNN SDK (2.28.0).
- Update QNN EP to no longer provide a dummy bias for LayerNorm if the
QNN SDK version is >= 2.28.0.
### Motivation and Context
Use the latest QNN SDK. This version improves inference latency for
certain customer models.
### Description
<!-- Describe your changes. -->
- Improved Transpose around QLinearSoftmax in Level 3 NHWC Transformer.
- Removed redundant code HandleQLinearConcat, HandleQLinearBinaryOp.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
By merging and eliminating redundant transpose , the Image Segmentation
i8 model (MobileNetv2 + DeepLabv3) achieves a 2.34X speedup.
### Description
Now, ENABLE_CUDA_NHWC_OPS is enabled by default.
It adds a new chance to create cuda provider while both cuda/dml are
enabled
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
A breakdown PR of https://github.com/microsoft/onnxruntime/pull/22651
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fixes build failure for the cuda minimal build
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
[This change](https://github.com/microsoft/onnxruntime/pull/19470) in
1.20 is causing build failures for the cuda minimal build.
Essentially, some cudnn logic was not guarded by the `USE_CUDA_MINIMAL`.
Also the build is looking for cudnn while in the cuda minimal build it
shouldn't depend on it, resulting in linking error.
cc @gedoensmax @chilo-ms
### Description
OVEP development changes for ORT 1.21 Release
### Motivation and Context
Has critical bug fixes
Support for concurrency execution of models is enabled
Support for OV 2024.5
Memory optimizations for NPU platform
---------
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
### Description
This change is to update the Gradle version within java project to 8.7,
it also upgrades the JAVA to 17. Gradle version from react-native was
also updated to 7.5 to make it compatible with changes from the Java
directory. However, the target java version remains the same. Java
version from these will be upgraded in a separated PR.
This is spited from #22206
### Motivation and Context
This is the first step to upgrade the react native version.
### Description
Allows the QDQ quantizer to handle input models that already have some
pre-quantized weights. In this case, the qdq quantizer will properly
skip/handle the pre-quantized weights.
Also handles an operator (e.g., Conv) with a pre-quantized weight and a
float bias. The tool will read the pre-quantized weight's quantization
scale to compute the bias's scale (`bias_scale = input_scale *
weight_scale`).
Input model (pre-quantized Conv weight):

Output QDQ model (everything is quantized):

### Motivation and Context
Customers may use external tools to quantize some weights (e.g., int4
for Conv/MatMul). The qdq quantizer should still be able to quantize the
rest of the model (float weights and activations) in this case.
### Description
<!-- Describe your changes. -->
It seems after CI updated to py310, numpy got updated to 2.0 and sympy
1.2 failed to cast float numpy array.
Pointing sympy to 1.13 when py>=3.9 and re-enable unit test
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Error: Linux CPU
CI
### Description
A break-down PR of https://github.com/microsoft/onnxruntime/pull/22651
Op API change only.
- add template to functions and classes that support fp32 and fp16
- rename functions, classes and files that support fp32 and fp16 from
SQNBxxx to QNBxxx
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR registers GroupNormalization for opset 21
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Current linux-ci-pipeline was broken due to missing parameters from
`py-packaging-linux-test-cpu.yml` template
### Motivation and Context
Fix Linux CI pipeline
### Description
Update stable diffusion benchmark:
(1) allow IO binding for optimum.
(2) do not use num_images_per_prompt across all engines for fair
comparison.
Example to run benchmark of optimum on stable diffusion 1.5:
```
git clone https://github.com/tianleiwu/optimum
cd optimum
git checkout tlwu/diffusers-io-binding
pip install -e .
pip install -U onnxruntime-gpu
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion
git checkout tlwu/benchmark_sd_optimum_io_binding
pip install -r requirements/cuda12/requirements.txt
optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 --task text-to-image ./sd_onnx_fp32
python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding
```
Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without
IO Binding; IO binding gains 16ms, or 2.7%,
### Motivation and Context
Optimum is working on enabling I/O binding:
https://github.com/huggingface/optimum/pull/2056. This could help
testing the impact of I/O binding on the performance of the stable
diffusion.
### Description
Enable the ConvReplaceWithQLinear graph optimization when using the ACL
execution provider.
### Motivation and Context
Fixes an issue where quantized Conv nodes followed by ReLU don't get
converted to QLinearConv, so ACL sees the weights as mutable and
therefore cannot run the Conv node.
Signed-off-by: Michael Tyler <michael.tyler@arm.com>
### Description
Making ::p optional in the Linux python CUDA package pipeline
### Motivation and Context
Linux stage from Python-CUDA-Packaging-Pipeline has failed since merge
of #22773
### Description
This PR fixes the spelling of the key value of the GRU operator in the
map in the `GetSupportedNodes` function (Gru -> GRU) and removes the
data type check for the fifth input (sequence_lens) of the GRU operator.
PTAL, thanks!
Add new provider option `trt_op_types_to_exclude`:
- User can provide op type list to be excluded from running on TRT
- e.g. `trt_op_types_to_exclude="MaxPool"`
There is a known performance issue with the DDS ops (NonMaxSuppression,
NonZero and RoiAlign) from TRT versions 10.0 to 10.7. TRT EP excludes
DDS ops from running on TRT by default, user can override default value
with empty string to include all ops.
The performance cost of falling back to the CPU EP is high for several
resampling nodes and causes multiple partitions in SD Turbo and VAE
decoder. Since the asymmetric mode with nearest to floor and integer
scales is identical to half_pixel anyway, stick with the WebNN EP.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
A break down PR of https://github.com/microsoft/onnxruntime/pull/22651
Add fp16 kernels.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
WebNN doesn't provide dedicate op for LRN, use a couple of WebNN ops to
emulate it in WebNN EP:
pow -> transpose -> pad -> averagePool -> transpose -> mul -> add -> pow
-> div
@Honry @fdwr PTAL, thanks!
### Description
This PR registers the following opset 21 operators:
Idenity-21
OlieanrMatmul-21
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
`Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in
offical release, this would cause undefined error.
Fix it by using `Module['jsepRegisterMLConstant']`.
### Description
Fixes a unit test that would fail intermittently due to an existing bug
with Pad (reflect mode). When the number of padded values is >= the
inner dimension size, the ORT Pad implementation accesses invalid
memory. This PR makes the number of padding values less than the inner
dimension size to avoid triggering the bug.
### Motivation and Context
See related issues:
https://github.com/microsoft/onnxruntime/issues/8265https://github.com/microsoft/onnxruntime/issues/11828https://github.com/microsoft/onnxruntime/issues/20801
Here's a valgrind trace obtained on a Linux machine (with
`sess_options.enable_cpu_mem_arena = False`)
```
==864228== Invalid read of size 4
==864228== at 0x2716272A: void onnxruntime::PadInnermostAxis<unsigned int>(unsigned int*, unsigned int*, long, unsigned long) (pad.cc:370)
==864228== by 0x2715D213: onnxruntime::common::Status onnxruntime::PadImpl<unsigned int>(onnxruntime::OpKernelContext*, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, onnxruntime::Mode const&, unsigned int) (pad.cc:551)
==864228== by 0x2715B2BB: onnxruntime::Pad::Compute(onnxruntime::OpKernelContext*) const (pad.cc:725)
==864228== by 0x276FF6A7: onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) (sequential_executor.cc:484)
==864228== by 0x276F4A04: onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (execution_steps.cc:73)
...
```
The above is obtained with the basic Pad(reflect) example on the [ONNX
Pad operator spec
page](https://onnx.ai/onnx/operators/onnx__Pad.html#summary):
```python
data = [
[1.0, 1.2],
[2.3, 3.4],
[4.5, 5.7],
]
pads = [0, 2, 0, 0]
mode = 'reflect'
# Expected output by ONNX spec
expected_output = [
[1.0, 1.2, 1.0, 1.2],
[2.3, 3.4, 2.3, 3.4],
[4.5, 5.7, 4.5, 5.7],
]
# Bugged output from onnxruntime has invalid/uninitialized data for the first element in the inner dimension
# invalid data may be 0.0, inf, nan, etc.
ort_output = [
[inf, 1.2, 1.0, 1.2],
[inf, 3.4, 2.3, 3.4],
[inf, 5.7, 4.5, 5.7],
]
```
The previous PR was reverted because it causes the whole model to
fallback when there is output shape info missing. This PR fixes the
issue by removing redundant fallbacks.
### Description
Fixes command for building Linux python packages by preventing an empty
`-p` command-line option from being passed to a subsequent build script:
1f3b675453/tools/ci_build/github/linux/run_python_dockerbuild.sh (L37)
### Motivation and Context
A recent [PR
](https://github.com/microsoft/onnxruntime/pull/22773)introduced a new
optional command-line option (`-p`) to pass custom python exe paths. We
need to check if the option is empty before forwarding the option to a
separate build script.
### Description
This PR Fix warning - `LegacyKeyValueFormat: "ENV key=value" should be
used instead of legacy "ENV key value" format` from all Dockerfile
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fixes#22512, MatMul, Add can be fused into a single Gemm even if
tensors dimensions are > 2. The PR excludes that cases.
### Motivation and Context
ORT crashes on valid models due to that unexpected fusion.
### Description
Support to set EPdynamic options in OVEP
### Motivation and Context
relate to https://github.com/microsoft/onnxruntime/pull/22282
---------
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>