Commit graph

11989 commits

Author SHA1 Message Date
Chi Lo
fa4cbcd36b
[TensorRT EP] Add new provider option to exclude nodes from running on TRT (#22681)
Add new provider option `trt_op_types_to_exclude`:
- User can provide op type list to be excluded from running on TRT
- e.g. `trt_op_types_to_exclude="MaxPool"`

There is a known performance issue with the DDS ops (NonMaxSuppression,
NonZero and RoiAlign) from TRT versions 10.0 to 10.7. TRT EP excludes
DDS ops from running on TRT by default, user can override default value
with empty string to include all ops.
2024-11-13 11:34:43 -08:00
shiyi
3adcf4d714
[WebNN] Remove validation for coordinate_transformation_mode (#22811)
The performance cost of falling back to the CPU EP is high for several
resampling nodes and causes multiple partitions in SD Turbo and VAE
decoder. Since the asymmetric mode with nearest to floor and integer
scales is identical to half_pixel anyway, stick with the WebNN EP.
2024-11-13 11:12:00 -08:00
Xu Xing
ff57ac4f3d
[js/webgpu] Add scatterND (#22755)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-13 09:13:00 -08:00
liqun Fu
bc2b1b5e37
Fix issue #22796 - a typo: (__GNUC__ > 9) -> (__GNUC__ > 10) (#22807)
### Description
fix #22796 
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
2024-11-12 18:56:35 -08:00
Xiang Zhang
69a36eb231
Revert Implement DML copy for Lora Adapters (#22814)
Revert https://github.com/microsoft/onnxruntime/pull/22396
2024-11-12 17:45:59 -05:00
Jing Fang
7fa69461fd
[ARM] MatMulNBits FP16 support - kernels only (#22806)
### Description
A break down PR of https://github.com/microsoft/onnxruntime/pull/22651
Add fp16 kernels.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-12 14:28:47 -08:00
Jiajia Qin
7e0dd9d433
[js/webgpu] Optimize Expand (#22752)
Use components = 4 if possible.

llama3.2-1B becomes 20 tokens/s from 18 tokens/s on my iGPUs.
2024-11-12 12:37:19 -08:00
Jiajia Qin
05c8dc9d1c
[js/webgpu] Optimize ConvTranspose (#22774)
BUG #22031 

The overall time of ConvTranspose in Demucs model becomes 517.41 ms from
1415.65 ms on my iGPUs.
2024-11-12 12:37:07 -08:00
Bin Miao
67f5be0da2
[WebNN EP] Support LRN operator (#22775)
WebNN doesn't provide dedicate op for LRN, use a couple of WebNN ops to
emulate it in WebNN EP:
pow -> transpose -> pad -> averagePool -> transpose -> mul -> add -> pow
-> div
@Honry @fdwr PTAL, thanks!
2024-11-12 11:53:52 -08:00
junchao-zhao
fd5b1a18ee
Fix LARCH64 compile error (#22759)
### Description

Currently loongarch has not implemented AIsSigned qgemm, so I added
bypass for it
2024-11-12 11:47:43 -08:00
Jian Chen
75a44582ba
Update all JDK version to 17 (#22786) 2024-11-12 11:42:18 -08:00
Ted Themistokleous
2b0f3435d2
[MIGraphX EP] Add support for Gelu, BiasGelu, FastGelu operators (#22808)
### Description
Adds support for different flavours of gelu already supported in
MIGraphX
2024-11-12 11:04:15 -08:00
dtang317
9836ef1c89
register Identity and QLinearMatmul for opset21 (#22804)
### Description
This PR registers the following opset 21 operators:

Idenity-21
OlieanrMatmul-21



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-12 09:36:19 -08:00
amarin16
f0ac5e0d3d
Update skip layer norm (#22719)
Update the `SkipLayerNorm` implementation to address issues.
2024-11-12 07:01:30 -08:00
Wanming Lin
cdc8db9984
[WebNN] Fixed WebNN Module undefined issue (#22795)
`Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in
offical release, this would cause undefined error.

Fix it by using `Module['jsepRegisterMLConstant']`.
2024-11-11 21:31:24 -08:00
Adrian Lizarraga
0ad44d0f79
[Quant Tool] Flaky test due to Pad reflect bug (#22798)
### Description
Fixes a unit test that would fail intermittently due to an existing bug
with Pad (reflect mode). When the number of padded values is >= the
inner dimension size, the ORT Pad implementation accesses invalid
memory. This PR makes the number of padding values less than the inner
dimension size to avoid triggering the bug.


### Motivation and Context
See related issues:
https://github.com/microsoft/onnxruntime/issues/8265
https://github.com/microsoft/onnxruntime/issues/11828
https://github.com/microsoft/onnxruntime/issues/20801

Here's a valgrind trace obtained on a Linux machine (with
`sess_options.enable_cpu_mem_arena = False`)
```
==864228== Invalid read of size 4
==864228==    at 0x2716272A: void onnxruntime::PadInnermostAxis<unsigned int>(unsigned int*, unsigned int*, long, unsigned long) (pad.cc:370)
==864228==    by 0x2715D213: onnxruntime::common::Status onnxruntime::PadImpl<unsigned int>(onnxruntime::OpKernelContext*, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, onnxruntime::Mode const&, unsigned int) (pad.cc:551)
==864228==    by 0x2715B2BB: onnxruntime::Pad::Compute(onnxruntime::OpKernelContext*) const (pad.cc:725)
==864228==    by 0x276FF6A7: onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) (sequential_executor.cc:484)
==864228==    by 0x276F4A04: onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (execution_steps.cc:73)
...
```

The above is obtained with the basic Pad(reflect) example on the [ONNX
Pad operator spec
page](https://onnx.ai/onnx/operators/onnx__Pad.html#summary):

```python
data = [
    [1.0, 1.2],
    [2.3, 3.4],
    [4.5, 5.7],
]

pads = [0, 2, 0, 0]

mode = 'reflect'

# Expected output by ONNX spec
expected_output = [
    [1.0, 1.2, 1.0, 1.2],
    [2.3, 3.4, 2.3, 3.4],
    [4.5, 5.7, 4.5, 5.7],
]

# Bugged output from onnxruntime has invalid/uninitialized data for the first element in the inner dimension
# invalid data may be 0.0, inf, nan, etc.
ort_output = [
    [inf, 1.2, 1.0, 1.2],
    [inf, 3.4, 2.3, 3.4],
    [inf, 5.7, 4.5, 5.7],
]
```
2024-11-11 19:49:27 -08:00
shiyi
f7d1f0fc5e
Reland "[WebNN] Fallback the node when its output doesn't have shape info" (#22685)
The previous PR was reverted because it causes the whole model to
fallback when there is output shape info missing. This PR fixes the
issue by removing redundant fallbacks.
2024-11-11 16:30:10 -08:00
Adrian Lizarraga
b1e0930eab
Fix build for linux python wheel (#22801)
### Description
Fixes command for building Linux python packages by preventing an empty
`-p` command-line option from being passed to a subsequent build script:
1f3b675453/tools/ci_build/github/linux/run_python_dockerbuild.sh (L37)



### Motivation and Context
A recent [PR
](https://github.com/microsoft/onnxruntime/pull/22773)introduced a new
optional command-line option (`-p`) to pass custom python exe paths. We
need to check if the option is empty before forwarding the option to a
separate build script.
2024-11-11 15:20:07 -08:00
Jian Chen
885a7acd45
Fix warning - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (#22800)
### Description
This PR Fix warning - `LegacyKeyValueFormat: "ENV key=value" should be
used instead of legacy "ENV key value" format` from all Dockerfile



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-11 13:05:34 -08:00
Xavier Dupré
1f3b675453
Fix MatMulBnFusion to exclude cases when tensors are not 2D tensors (#22762)
### Description
Fixes #22512, MatMul, Add can be fused into a single Gemm even if
tensors dimensions are > 2. The PR excludes that cases.



### Motivation and Context
ORT crashes on valid models due to that unexpected fusion.
2024-11-11 19:48:25 +01:00
Dmitri Smirnov
c5276ac448
Revert "enable serialize prepacked weights into data file (#22256)" (#22788)
This reverts commit c5b6be045f.

### Description
Revert

### Motivation and Context
This needs simpler and more robust approach
2024-11-11 09:59:05 -08:00
sheetalarkadam
e8f1d73b0b
Add Android QNN Browserstack test (#22434)
Add Android QNN Browserstack test



### Motivation and Context
Real device test in CI
2024-11-10 16:10:29 -08:00
Preetha Veeramalai
c9ed016b12
OVEP Dynamic WorkloadType support (#22779)
### Description
Support to set EPdynamic options in OVEP

### Motivation and Context
relate to https://github.com/microsoft/onnxruntime/pull/22282

---------

Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
2024-11-09 23:26:29 -08:00
shiyi
63cb53257b
[WebNN] Support steps >= 1 for slice operator (#22708)
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
2024-11-09 18:20:52 -08:00
Wanming Lin
b9b1a0353a
[WebNN] QDQ's axis should be used for broadcasting (#22721)
For per-axis quantization/dequantization, WebNN requires the scale and
zero_point inputs to be broadcastable. Axis should be used for reshape
these two inputs.
2024-11-09 18:19:46 -08:00
zz002
d3ad76b2cf
[VitisAI] Cache node subgraph when necessary (#22073)
### Description
<!-- Describe your changes. -->

[VitisAI] Cache node subgraph when necessary

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zhenzew <zhenzew@amd.com>
2024-11-08 23:17:16 -08:00
Yi Zhang
ef281f850a
Add XNNPack build on Linux ARM64 and improve Linux CPU (#22773)
### Description
1. Add XNNPack build on Linux ARM64
2. Build only one python wheel for PR request.

[AB#49763](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/49763)



### Motivation and Context
Why I add xnnpack build on Linux ARM64  rather than Windows ARM64.
Becuase KleidiAI  doesn't support Windows

```
IF(XNNPACK_TARGET_PROCESSOR STREQUAL "arm64" AND XNNPACK_ENABLE_ARM_I8MM AND NOT CMAKE_C_COMPILER_ID STREQUAL "MSVC")
  IF (XNNPACK_ENABLE_KLEIDIAI)
    MESSAGE(STATUS "Enabling KleidiAI for Arm64")
  ENDIF()
ELSE()
  SET(XNNPACK_ENABLE_KLEIDIAI OFF)
ENDIF()
```

---------
2024-11-09 11:26:19 +08:00
Justin Chu
a8539ec7d1
Ignore all whitespace lint messages for cpplint (#22781)
### Description

Ignore all whitespace lint messages for cpplint. Remove redundant
configs in dml/.

### Motivation and Context

They are handled automatically by clang-format and creates too much
noise in the PR files tab.
2024-11-08 14:31:28 -08:00
Adrian Lizarraga
020d52d92c
[Quant Tool] Add reduce_range option to get_qdq_config() (#22782)
### Description
Adds `reduce_range` option to `get_qdq_config()`



### Motivation and Context
Make it easier to set this option when calling get_qdq_config().
Otherwise, user has to set the option manually.
2024-11-08 14:04:11 -08:00
xhcao
b5ee4ac760
[js/webgpu] support GridSample operator (#22652)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-08 11:02:36 -08:00
jzm-intel
d9b91682f1
WebGPU JSEP: Make shader code not depend on input broadcasting patterns (#22536)
This PR make MatMul shaders not depend on inputs broadcasting pattern,
but only depend on input ranks and their shape provided in uniform. This
change fix the issue that currently shaders code are different for
different broadcasting, but have identical cache key and results in
wrong cache hit.
2024-11-08 11:00:51 -08:00
Michael Cho
4d614e15bd
Fix build with GCC 11 (#22770)
### Description

Fix a build error seen with GCC 11 when building at Homebrew on our
Linux x86_64 Ubuntu 22.04 CI (GitHub action runner).


### Motivation and Context

When building latest v1.20.0 at Homebrew
(https://github.com/Homebrew/homebrew-core/pull/196547), we hit a build
failure with GCC 11:
```
 [ 65%] Building CXX object CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o
  /home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/shims/linux/super/g++-11 -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DENABLE_CPU_FP16_TRAINING_OPS -DHAS_STRING_VIEW=1 -DNSYNC_ATOMIC_CPP11 -DONLY_C_LOCALE=0 -DONNX_ML=1 -DONNX_NAMESPACE=onnx -DORT_ENABLE_STREAM -DORT_NO_RTTI -DPLATFORM_POSIX -DPROTOBUF_USE_DLLS -D_GNU_SOURCE -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/utf8_range-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime/core/session -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/pytorch_cpuinfo-src/include -I/tmp/onnxruntime-20241103-6403-lh3bwj/build -I/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-build -ffunction-sections -fdata-sections -Wno-restrict  -DCPUINFO_SUPPORTED -O3 -DNDEBUG -fPIC -fno-rtti -Wall -Wextra -Wno-deprecated-copy -Wno-tautological-pointer-compare -Wno-nonnull-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Werror -MD -MT CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -MF CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o.d -o CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -c /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc
  /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc: In function ‘void onnx_transpose_optimization::Permute1DConstant(onnx_transpose_optimization::api::GraphRef&, onnx_transpose_optimization::api::NodeRef&, onnx_transpose_optimization::api::TensorRef&, size_t, std::string_view, const std::vector<long int>&)’:
  /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1114:10: error: ‘memcpy’ is not a member of ‘std’; did you mean ‘wmemcpy’?
   1114 |     std::memcpy(dst, src, bytes_per_val);
        |          ^~~~~~
        |          wmemcpy
```

It is possible this error may not occur on different GCC versions if
`cstring` has been indirectly included by another header.
2024-11-07 21:04:57 -08:00
Jian Chen
e7987a6b0b
Replace reference to python 3.8 with python 3.10 (#22692)
### Description
This PR will set default python to 3.10 except
tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml. This is
needed because we are no longer using python 3.8

This PR excludes changes for Big Models CI, because it will require
additional changes. Which will be track in
USER STORY 52729



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-07 16:51:40 -08:00
Ranjit Ranjan
193671295e
[AIX] Fix for AIX build break (#22745)
### Description
With recent changes, below build error is found under AIX. 

```
ld: 0706-012 The -p flag is not recognized.
ld: 0706-012 The -a flag is not recognized.
ld: 0706-012 The -t flag is not recognized.
ld: 0706-012 The -h flag is not recognized.
ld: 0706-012 The -= flag is not recognized.
ld: 0706-012 The -$ flag is not recognized.
ld: 0706-012 The -$ flag is not recognized.
ld: 0706-012 The -O flag is not recognized.
ld: 0706-027 The -R IGIN flag is ignored.

collect2: error: ld returned 255 exit status
```

### Motivation and Context
AIX linker doesn't support -rpath option , so blocking this option under
AIX.
2024-11-07 13:22:22 -08:00
raoanag
f16036b6f5
[DML EP] Prefer MatMulInteger over MatMulIntegerToFloat in case of (#22469)
### Description
Skip `MatMulIntegerToFloat` fusion in case of DML EP for cases where
model uses Quantization before `MatMulInteger`. This is mainly done to
be resource efficient, and we have better `MatMulInteger` Metacommand
coverage which computes in int data type



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-07 10:02:01 -08:00
Yulong Wang
a436b3af1a
[webgpu] fix indices type when it's 4D (#22758)
### Description

Fix indices type from `array<u32, 4>` to `vec4<u32>` when the variable
is 4D.
2024-11-07 08:10:05 -08:00
jzm-intel
6a295eb75b
[JS/WebGPU] Creating devices with subgroup features enabled if possible (#21833)
This CL make WebGPU backend support subgroup features and thus allow
using subgroup optimizations in the future.

### Description
With this CL WebGPU backends will create devices with subgroups and
subgroups-f16 features (both are under origin trial in Chrome) or
chromium-experimental-subgroups feature enabled whenever available.

### Motivation and Context
This CL would allow WebGPU operator shaders to use subgroup
optimizations in the future, and might get some significant speedup with
these optimization.
2024-11-07 02:13:40 -08:00
Yifan Li
3b7a6eba69
[TensorRT EP] support TensorRT 10.6-GA (#22644)
### Description
<!-- Describe your changes. -->
* Update CI with TRT 10.6
* Update oss parser to [10.6-GA-ORT-DDS
](https://github.com/onnx/onnx-tensorrt/tree/10.6-GA-ORT-DDS) and update
dependency version
* Update Py-cuda11 CI to use trt10.6


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
(There will be 3rd PR to further reduce trt_version hardcoding)
2024-11-06 14:33:46 -08:00
Adrian Lizarraga
aa0cf1c5e1
[Quant Tool] Update QDQ Pad, Slice, Softmax (#22676)
### Description
Updates python quantization tool:
- Ensures QDQ Pad has equal quantization parameters across input and
output for certain Pad configurations.
- Ensures QDQ Slice always has equal quantization parameters across
input and output.
- Fixes bug when Softmax is _excluded_ from quantization.


### Motivation and Context
QDQ Pad and Slice have lower latency on QNN EP when their quantization
parameters are equal.
2024-11-06 14:06:29 -08:00
Caroline Zhu
0221693e43
[Mobile] Add E2E BrowserStack tests for iOS tests (#22610)
### Description
- Changes running the E2E iOS tests from running in App Center to
running in BrowserStack
- Steps for running locally can be found in the OneNote

### Motivation and Context
- Follow-up of #22117 
- App Center (the previous platform for running E2E mobile tests) is
getting deprecated in 2025

### Misc info
Additional build steps were required to get the necessary testing
artifacts for BrowserStack. App Center consumed an entire folder, while
BrowserStack requests the following:
1. a ZIP file of all the tests
2. an IPA file of the test app

#### Flow
Here is a rough outline of what is happening in the pipeline:
1. The build_and_assemble_apple_pods.py script builds the relevant
frameworks (currently, this means packages for iOS and Mac)
4. The test_apple_packages.py script installs the necessary cocoapods
for later steps
5. XCode task to build for testing builds the iOS target for the test
app
6. Now that the test app and the tests have been built, we can zip them,
creating the tests .zip file
7. To create the IPA file, we need to create a .plist XML file which is
generated by the generate_plist.py script.
- Attempts to use the Xcode@5 task to automatically generate the plist
file failed.
- Also, building for testing generates some plist files -- these cannot
be used to export an IPA file.
8. We run the Xcode task to build an .xcarchive file, which is required
for creating an IPA file.
9. We use xcodebuild in a script step to build an IPA file with the
xcarchive and plist files from the last two steps.
10. Finally, we can run the tests using the BrowserStack script.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-11-06 11:22:29 -08:00
Adrian Lizarraga
4f6993d567
[Quant Tool] Prevent int32 quantized bias from clipping by adjusting the weight's scale (#22020)
### Description
Fixes scenario in which a bias input quantized to int32 has a scale that
is too small. A bias with a scale that is smaller than a certain
threshold will overflow the range of an `int32` when quantized, which
significantly decreases accuracy.

Credit to @yihonglyu for finding out about this issue and the fix.

### Motivation and Context
Consider the following Convolution with very small weights and a
constant bias input of `[5, -4.5]`.

![image](https://github.com/user-attachments/assets/4bde2bd9-892f-4ae9-887b-61a6668779a1)

The QDQ quantizer first computes the following quantization scale for
`input_0` and `weight`:
- `input_0`: scale=0.5
- `weight`: scale=7.843e-10 **[really small]**

The QDQ quantizer then computes the bias input's scale as follows:
```
bias_scale = input_0_scale * weight_0_scale = 0.5 * 7.843e-10 = 3.9215686274509805e-11
```

This `bias_scale` is too small. Before this PR, the QDQ quantizer would
quantize the f32 bias with this `bias_scale`:
```
bias_quant = round(bias_f32 / bias_scale) =  round([5.0/bias_scale, -4.5/bias_scale]) = [127500000000, -114750000000]
```
These quantized bias values exceed the range of int32, and so are
clipped to [int32.min(), int32.max()], which is very inaccurate.

#### New approach
This PR increases the `weight_0_scale` by the necessary amount to ensure
that `bias_scale` (which equals `weight_0_scale * input_0_scale`) is
appropriate for the int32 quantization type.

The smallest valid bias scale is given by the normal scale formula: 
`bias_smallest_valid_scale = (bias_f32_max - bias_f32_min) / (int32_max
- int32_min)`

Then, we compute the candidate bias scale:
`bias_scale_candidate = input_0_scale * weight_0_scale`

If the candidate scale is smaller than the smallest valid scale, we
increase the `weight_0_scale` by the necessary ratio:
```python
if bias_scale_candidate < bias_smallest_valid_scale:
    ratio = bias_smallest_valid_scale / bias_scale_candidate
    weight_0_scale = ratio * weight_0_scale
```

Then, we recompute the final bias scale:
```python
bias_scale = input_0_scale * weight_0_scale
```

#### Impact on accuracy
Here's the above model's quantized output compared to the f32
(ground-truth) output.
- Before PR: 
  - f32 model output[0]: **5.0f**
  - qdq model output[0]: **0.075**
  - SNR: 0.1369 (higher is better)
- After PR:
  - f32 model output[0]: **5.0f**
  - qdq model output[0]: **4.992**
  - SNR: 55.656 (higher is better)
2024-11-06 10:44:54 -08:00
Adrian Lizarraga
2c1b17ce98
[Quant Tool] Introduce get_qdq_config() helper to get QDQ configurations (#22677)
### Description
Introduces the `get_qdq_config()` function to get a quantization
configuration for a full integer QDQ model. This function provides an
easier way of specifying commonly used options and sets convenient
defaults. Specifically:

- Instead of requiring the user to pass a dictionary of `extra_options`,
the new interface adds function parameters for common settings:
  - All calibrator settings
  - Whether activations/weights are symmetric
  - Whether to keep or fuse relu/clip into Q
  - Minimum real range for quantization
  - Dictionary of tensor quantization overrides.
- Automatically scans the input floating-point model and fills out the
operator types to quantize. Otherwise, only a limited number of operator
types would be quantized by default.
- Detects if the input model uses external data. If so, ensures that the
generated QDQ model also uses external data.
- Detects if the model will use newly introduced quantization types
(int4/int16) with an older opset. If so, forces the use of the
`com.microsoft` domain for Q/DQ ops, which support all types.
- Automatically enables the "extra option" called
`ForceQuantizeNoInputCheck` to ensure data movement operators (e.g.,
Transpose) are always quantized.
- User can pass a function to indicate which nodes to exclude from
quantization.
- The user can still pass their own `extra_options` to override any of
the above if necessary.
 
```python
from onnxruntime.quantization import get_int_qdq_config, quantize # , ...

# Get QDQ configuration
qdq_config = get_int_qdq_config(
    float_model,
    data_reader,
    calibrate_method=CalibrationMethod.Percentile,
    calibrate_args={"percentile": 99.98},  # Converted to extra_options
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QInt8,
    per_channel=True,
    nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"`

    # Other options converted to extra_options:
    min_real_range=0.0001,
    keep_removable_activations=True,
    activation_symmetric=True,
    weight_symmetric=True,
)

# Quantize model
quantize(float_model_path, qdq_model_path, qdq_config)
```
### Motivation and Context
Need a version of `get_qnn_qdq_config()` that is not EP-specific.
2024-11-06 10:27:02 -08:00
Tianlei Wu
72186bbb71
[CUDA] Build nhwc ops by default (#22648)
### Description

* Build cuda nhwc ops by default.
* Deprecate `--enable_cuda_nhwc_ops` in build.py and add
`--disable_cuda_nhwc_ops` option

Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops
will be disabled automatically.

### Motivation and Context

In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with
Tensor Cores, and this could improve performance for vision models.

This is the first step to prefer NHWC for CUDA in 1.21 release. Next
step is to do some tests on popular vision models. If it help in most
models and devices, set `prefer_nhwc=1` as default cuda provider option.
2024-11-06 09:54:55 -08:00
Tianlei Wu
ba22d7879a
[CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above (#22713)
### Description
Based on https://github.com/microsoft/onnxruntime/pull/9700, and extend
it to ArgMin as well.

This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.

Key changes include:

#### Enhancements to `ArgMax` and `ArgMin` Operators:

* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)

* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)

#### Macro and Kernel Registration Improvements:

* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)

* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.

#### Safety Checks:

* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)

#### Testing Enhancements:

* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)

### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
2024-11-06 09:54:32 -08:00
Tianlei Wu
d993ec313f
[CUDA] Fix NumericLimits (#22738)
### Description
* Fix `NumericLimits<float>` that used infinity as max, which is not
consistent with `std::numeric_limits<float>::max()`
In Windows, (float)(1e+300) is used for INFINITY, which causes compiler
error in Visual Studio 2022 v17.12 Preview 5.
* Rename `NumericLimits<T>::Min` to Lowest to be consistent with
std::numeric_limits
* Fix topk implementation: use `NumericLimits<CudaT>` instead of
`NumericLimits<T>` in kernel. That could avoid defining a confusing
defintion of `NumericLimits<MLFloat16>` that returns half instead of
MLFloat16.
* Use CUDART_MAX_NORMAL_FP16 if possible. It sets bits value directly,
which is faster than converting float to half.

Note that NumericLimits does not support __nv_bfloat16 and _nv_fp8_e4m3
and __nv_fp8_e5m2 right now.

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22728
2024-11-06 09:53:49 -08:00
Enrico Galli
1cb5ceedf3
[WebNN EP] Fix issues with MLTensor caching (#22701)
This PR fixes a bug that occurs when searching for compatible `MLTensor`
in the cache. We were missing checking the number of dimensions in the
shape. This would mean that a cached buffer of shape `[1]` could match
for `[1, 1, 256, 256]`.

This PR also adds better handling when attempting to force an `MLTensor`
to a different shape.
2024-11-06 09:17:11 -08:00
Yang Gu
811231e418
[js/webgpu] Destroy staging buffers aggressively during weights uploading (#22726)
In current implementation, all the staging buffers for weights uploading
are destroyed after first batch of kernel execution. It requires a lot
of memory as all the staging buffers couldn't be reused. It also hurts
the startup time (weights uploading only happens in session creation),
as weights uploading is delayed to a very late time.
This PR uses a very aggressive way to submit queue and destroy staging
buffers, so that the related GPU memory could be reused as much as
possible, though the real situation depends on the WebGPU and driver
implementation. The aggressive queue submission also moves GPU
operations to a very early time, which helps the startup time.
Some buffer uploading benchmarks are composed to compare multiple
solutions, regarding to the memory and time consumption. Benchmarks can
be found at
https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html,
while detailed test data can be found at

https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit.
I also tested phi3.5 on 2 machines, first inference time improved from
5141ms to 3579ms and from 4327ms to 2947ms separately.
2024-11-06 08:55:15 -08:00
Edward Chen
742a0d30be
[C# MauiModelTester] Fix icon name in Info.plist (#21666)
Fix icon name in Info.plist. It now matches the icon at `csharp/tools/MauiModelTester/Resources/AppIcon/onnxruntime_icon.png`.
2024-11-05 16:55:38 -08:00
Jian Chen
deee48002c
Enable CUDA Python Test (#22717)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-05 16:26:50 -08:00
Yulong Wang
0371e92419
[webgpu] change default validation mode (#22730)
### Description

Change default validation mode in Release build from "wgpuOnly" to
"basic"
2024-11-05 16:18:05 -08:00