Commit graph

11997 commits

Author SHA1 Message Date
Jian Chen
5659d055ee
Fix Linux CI pipeline where ep was not provided for py-packaging-linux-test-cpu.yml (#22828)
### Description
Current linux-ci-pipeline was broken due to missing parameters from
`py-packaging-linux-test-cpu.yml` template


### Motivation and Context
Fix Linux CI pipeline
2024-11-14 09:41:37 -08:00
Tianlei Wu
09c98433e7
[CUDA] stable diffusion benchmark allows IO binding for optimum (#22834)
### Description

Update stable diffusion benchmark:
(1) allow IO binding for optimum.
(2) do not use num_images_per_prompt across all engines for fair
comparison.

Example to run benchmark of optimum on stable diffusion 1.5:
```
git clone https://github.com/tianleiwu/optimum
cd optimum
git checkout tlwu/diffusers-io-binding
pip install -e .

pip install -U onnxruntime-gpu
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion
git checkout tlwu/benchmark_sd_optimum_io_binding
pip install -r requirements/cuda12/requirements.txt

optimum-cli export onnx --model runwayml/stable-diffusion-v1-5  --task text-to-image ./sd_onnx_fp32

python optimize_pipeline.py -i ./sd_onnx_fp32 -o ./sd_onnx_fp16 --float16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16
python benchmark.py -e optimum -r cuda -v 1.5 -p ./sd_onnx_fp16 --use_io_binding
```

Example output in H100_80GB_HBM3: 572 ms with IO Binding; 588 ms without
IO Binding; IO binding gains 16ms, or 2.7%,

### Motivation and Context

Optimum is working on enabling I/O binding:
https://github.com/huggingface/optimum/pull/2056. This could help
testing the impact of I/O binding on the performance of the stable
diffusion.
2024-11-14 00:09:07 -08:00
Michael Tyler
dd99e34d66
Enable ConvReplaceWithQLinear when using ACL (#22823)
### Description
Enable the ConvReplaceWithQLinear graph optimization when using the ACL
execution provider.



### Motivation and Context
Fixes an issue where quantized Conv nodes followed by ReLU don't get
converted to QLinearConv, so ACL sees the weights as mutable and
therefore cannot run the Conv node.

Signed-off-by: Michael Tyler <michael.tyler@arm.com>
2024-11-13 21:44:50 -08:00
Wanming Lin
82681205e4
[WebNN] Fix MLTensorUsage is undefined issue (#22831)
`MLTensorUsage` has been removed from Chromium:
https://chromium-review.googlesource.com/c/chromium/src/+/6015318, but
we still need to make it compatible with old Chrome versions, so just
make it `undefined` for latest Chrome version.
2024-11-13 20:22:22 -08:00
Jian Chen
f423b737a9
Fix Linux python CUDA package pipeline (#22803)
### Description
Making ::p optional in the Linux python CUDA package pipeline



### Motivation and Context
Linux stage from Python-CUDA-Packaging-Pipeline has failed since merge
of #22773
2024-11-13 14:20:21 -08:00
microsoft-github-policy-service[bot]
6d7603f054
Auto-generated baselines by 1ES Pipeline Templates (#22817) 2024-11-13 13:50:52 -08:00
Bin Miao
a15381d7fc
[WebNN EP] Fix issues of GRU operator (#22123)
### Description
This PR fixes the spelling of the key value of the GRU operator in the
map in the `GetSupportedNodes` function (Gru -> GRU) and removes the
data type check for the fifth input (sequence_lens) of the GRU operator.

PTAL, thanks!
2024-11-13 13:34:34 -08:00
Hector Li
a9b62fa8da
Keep the model metadata on the generated EP context model (#22825)
### Description
Keep the model metadata on the generated EP context model
2024-11-13 11:52:21 -08:00
Chi Lo
fa4cbcd36b
[TensorRT EP] Add new provider option to exclude nodes from running on TRT (#22681)
Add new provider option `trt_op_types_to_exclude`:
- User can provide op type list to be excluded from running on TRT
- e.g. `trt_op_types_to_exclude="MaxPool"`

There is a known performance issue with the DDS ops (NonMaxSuppression,
NonZero and RoiAlign) from TRT versions 10.0 to 10.7. TRT EP excludes
DDS ops from running on TRT by default, user can override default value
with empty string to include all ops.
2024-11-13 11:34:43 -08:00
shiyi
3adcf4d714
[WebNN] Remove validation for coordinate_transformation_mode (#22811)
The performance cost of falling back to the CPU EP is high for several
resampling nodes and causes multiple partitions in SD Turbo and VAE
decoder. Since the asymmetric mode with nearest to floor and integer
scales is identical to half_pixel anyway, stick with the WebNN EP.
2024-11-13 11:12:00 -08:00
Xu Xing
ff57ac4f3d
[js/webgpu] Add scatterND (#22755)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-13 09:13:00 -08:00
liqun Fu
bc2b1b5e37
Fix issue #22796 - a typo: (__GNUC__ > 9) -> (__GNUC__ > 10) (#22807)
### Description
fix #22796 
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
2024-11-12 18:56:35 -08:00
Xiang Zhang
69a36eb231
Revert Implement DML copy for Lora Adapters (#22814)
Revert https://github.com/microsoft/onnxruntime/pull/22396
2024-11-12 17:45:59 -05:00
Jing Fang
7fa69461fd
[ARM] MatMulNBits FP16 support - kernels only (#22806)
### Description
A break down PR of https://github.com/microsoft/onnxruntime/pull/22651
Add fp16 kernels.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-12 14:28:47 -08:00
Jiajia Qin
7e0dd9d433
[js/webgpu] Optimize Expand (#22752)
Use components = 4 if possible.

llama3.2-1B becomes 20 tokens/s from 18 tokens/s on my iGPUs.
2024-11-12 12:37:19 -08:00
Jiajia Qin
05c8dc9d1c
[js/webgpu] Optimize ConvTranspose (#22774)
BUG #22031 

The overall time of ConvTranspose in Demucs model becomes 517.41 ms from
1415.65 ms on my iGPUs.
2024-11-12 12:37:07 -08:00
Bin Miao
67f5be0da2
[WebNN EP] Support LRN operator (#22775)
WebNN doesn't provide dedicate op for LRN, use a couple of WebNN ops to
emulate it in WebNN EP:
pow -> transpose -> pad -> averagePool -> transpose -> mul -> add -> pow
-> div
@Honry @fdwr PTAL, thanks!
2024-11-12 11:53:52 -08:00
junchao-zhao
fd5b1a18ee
Fix LARCH64 compile error (#22759)
### Description

Currently loongarch has not implemented AIsSigned qgemm, so I added
bypass for it
2024-11-12 11:47:43 -08:00
Jian Chen
75a44582ba
Update all JDK version to 17 (#22786) 2024-11-12 11:42:18 -08:00
Ted Themistokleous
2b0f3435d2
[MIGraphX EP] Add support for Gelu, BiasGelu, FastGelu operators (#22808)
### Description
Adds support for different flavours of gelu already supported in
MIGraphX
2024-11-12 11:04:15 -08:00
dtang317
9836ef1c89
register Identity and QLinearMatmul for opset21 (#22804)
### Description
This PR registers the following opset 21 operators:

Idenity-21
OlieanrMatmul-21



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-12 09:36:19 -08:00
amarin16
f0ac5e0d3d
Update skip layer norm (#22719)
Update the `SkipLayerNorm` implementation to address issues.
2024-11-12 07:01:30 -08:00
Wanming Lin
cdc8db9984
[WebNN] Fixed WebNN Module undefined issue (#22795)
`Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in
offical release, this would cause undefined error.

Fix it by using `Module['jsepRegisterMLConstant']`.
2024-11-11 21:31:24 -08:00
Adrian Lizarraga
0ad44d0f79
[Quant Tool] Flaky test due to Pad reflect bug (#22798)
### Description
Fixes a unit test that would fail intermittently due to an existing bug
with Pad (reflect mode). When the number of padded values is >= the
inner dimension size, the ORT Pad implementation accesses invalid
memory. This PR makes the number of padding values less than the inner
dimension size to avoid triggering the bug.


### Motivation and Context
See related issues:
https://github.com/microsoft/onnxruntime/issues/8265
https://github.com/microsoft/onnxruntime/issues/11828
https://github.com/microsoft/onnxruntime/issues/20801

Here's a valgrind trace obtained on a Linux machine (with
`sess_options.enable_cpu_mem_arena = False`)
```
==864228== Invalid read of size 4
==864228==    at 0x2716272A: void onnxruntime::PadInnermostAxis<unsigned int>(unsigned int*, unsigned int*, long, unsigned long) (pad.cc:370)
==864228==    by 0x2715D213: onnxruntime::common::Status onnxruntime::PadImpl<unsigned int>(onnxruntime::OpKernelContext*, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, onnxruntime::Mode const&, unsigned int) (pad.cc:551)
==864228==    by 0x2715B2BB: onnxruntime::Pad::Compute(onnxruntime::OpKernelContext*) const (pad.cc:725)
==864228==    by 0x276FF6A7: onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) (sequential_executor.cc:484)
==864228==    by 0x276F4A04: onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (execution_steps.cc:73)
...
```

The above is obtained with the basic Pad(reflect) example on the [ONNX
Pad operator spec
page](https://onnx.ai/onnx/operators/onnx__Pad.html#summary):

```python
data = [
    [1.0, 1.2],
    [2.3, 3.4],
    [4.5, 5.7],
]

pads = [0, 2, 0, 0]

mode = 'reflect'

# Expected output by ONNX spec
expected_output = [
    [1.0, 1.2, 1.0, 1.2],
    [2.3, 3.4, 2.3, 3.4],
    [4.5, 5.7, 4.5, 5.7],
]

# Bugged output from onnxruntime has invalid/uninitialized data for the first element in the inner dimension
# invalid data may be 0.0, inf, nan, etc.
ort_output = [
    [inf, 1.2, 1.0, 1.2],
    [inf, 3.4, 2.3, 3.4],
    [inf, 5.7, 4.5, 5.7],
]
```
2024-11-11 19:49:27 -08:00
shiyi
f7d1f0fc5e
Reland "[WebNN] Fallback the node when its output doesn't have shape info" (#22685)
The previous PR was reverted because it causes the whole model to
fallback when there is output shape info missing. This PR fixes the
issue by removing redundant fallbacks.
2024-11-11 16:30:10 -08:00
Adrian Lizarraga
b1e0930eab
Fix build for linux python wheel (#22801)
### Description
Fixes command for building Linux python packages by preventing an empty
`-p` command-line option from being passed to a subsequent build script:
1f3b675453/tools/ci_build/github/linux/run_python_dockerbuild.sh (L37)



### Motivation and Context
A recent [PR
](https://github.com/microsoft/onnxruntime/pull/22773)introduced a new
optional command-line option (`-p`) to pass custom python exe paths. We
need to check if the option is empty before forwarding the option to a
separate build script.
2024-11-11 15:20:07 -08:00
Jian Chen
885a7acd45
Fix warning - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (#22800)
### Description
This PR Fix warning - `LegacyKeyValueFormat: "ENV key=value" should be
used instead of legacy "ENV key value" format` from all Dockerfile



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-11 13:05:34 -08:00
Xavier Dupré
1f3b675453
Fix MatMulBnFusion to exclude cases when tensors are not 2D tensors (#22762)
### Description
Fixes #22512, MatMul, Add can be fused into a single Gemm even if
tensors dimensions are > 2. The PR excludes that cases.



### Motivation and Context
ORT crashes on valid models due to that unexpected fusion.
2024-11-11 19:48:25 +01:00
Dmitri Smirnov
c5276ac448
Revert "enable serialize prepacked weights into data file (#22256)" (#22788)
This reverts commit c5b6be045f.

### Description
Revert

### Motivation and Context
This needs simpler and more robust approach
2024-11-11 09:59:05 -08:00
sheetalarkadam
e8f1d73b0b
Add Android QNN Browserstack test (#22434)
Add Android QNN Browserstack test



### Motivation and Context
Real device test in CI
2024-11-10 16:10:29 -08:00
Preetha Veeramalai
c9ed016b12
OVEP Dynamic WorkloadType support (#22779)
### Description
Support to set EPdynamic options in OVEP

### Motivation and Context
relate to https://github.com/microsoft/onnxruntime/pull/22282

---------

Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
2024-11-09 23:26:29 -08:00
shiyi
63cb53257b
[WebNN] Support steps >= 1 for slice operator (#22708)
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
2024-11-09 18:20:52 -08:00
Wanming Lin
b9b1a0353a
[WebNN] QDQ's axis should be used for broadcasting (#22721)
For per-axis quantization/dequantization, WebNN requires the scale and
zero_point inputs to be broadcastable. Axis should be used for reshape
these two inputs.
2024-11-09 18:19:46 -08:00
zz002
d3ad76b2cf
[VitisAI] Cache node subgraph when necessary (#22073)
### Description
<!-- Describe your changes. -->

[VitisAI] Cache node subgraph when necessary

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zhenzew <zhenzew@amd.com>
2024-11-08 23:17:16 -08:00
Yi Zhang
ef281f850a
Add XNNPack build on Linux ARM64 and improve Linux CPU (#22773)
### Description
1. Add XNNPack build on Linux ARM64
2. Build only one python wheel for PR request.

[AB#49763](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/49763)



### Motivation and Context
Why I add xnnpack build on Linux ARM64  rather than Windows ARM64.
Becuase KleidiAI  doesn't support Windows

```
IF(XNNPACK_TARGET_PROCESSOR STREQUAL "arm64" AND XNNPACK_ENABLE_ARM_I8MM AND NOT CMAKE_C_COMPILER_ID STREQUAL "MSVC")
  IF (XNNPACK_ENABLE_KLEIDIAI)
    MESSAGE(STATUS "Enabling KleidiAI for Arm64")
  ENDIF()
ELSE()
  SET(XNNPACK_ENABLE_KLEIDIAI OFF)
ENDIF()
```

---------
2024-11-09 11:26:19 +08:00
Justin Chu
a8539ec7d1
Ignore all whitespace lint messages for cpplint (#22781)
### Description

Ignore all whitespace lint messages for cpplint. Remove redundant
configs in dml/.

### Motivation and Context

They are handled automatically by clang-format and creates too much
noise in the PR files tab.
2024-11-08 14:31:28 -08:00
Adrian Lizarraga
020d52d92c
[Quant Tool] Add reduce_range option to get_qdq_config() (#22782)
### Description
Adds `reduce_range` option to `get_qdq_config()`



### Motivation and Context
Make it easier to set this option when calling get_qdq_config().
Otherwise, user has to set the option manually.
2024-11-08 14:04:11 -08:00
xhcao
b5ee4ac760
[js/webgpu] support GridSample operator (#22652)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-08 11:02:36 -08:00
jzm-intel
d9b91682f1
WebGPU JSEP: Make shader code not depend on input broadcasting patterns (#22536)
This PR make MatMul shaders not depend on inputs broadcasting pattern,
but only depend on input ranks and their shape provided in uniform. This
change fix the issue that currently shaders code are different for
different broadcasting, but have identical cache key and results in
wrong cache hit.
2024-11-08 11:00:51 -08:00
Michael Cho
4d614e15bd
Fix build with GCC 11 (#22770)
### Description

Fix a build error seen with GCC 11 when building at Homebrew on our
Linux x86_64 Ubuntu 22.04 CI (GitHub action runner).


### Motivation and Context

When building latest v1.20.0 at Homebrew
(https://github.com/Homebrew/homebrew-core/pull/196547), we hit a build
failure with GCC 11:
```
 [ 65%] Building CXX object CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o
  /home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/shims/linux/super/g++-11 -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DENABLE_CPU_FP16_TRAINING_OPS -DHAS_STRING_VIEW=1 -DNSYNC_ATOMIC_CPP11 -DONLY_C_LOCALE=0 -DONNX_ML=1 -DONNX_NAMESPACE=onnx -DORT_ENABLE_STREAM -DORT_NO_RTTI -DPLATFORM_POSIX -DPROTOBUF_USE_DLLS -D_GNU_SOURCE -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/utf8_range-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime/core/session -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/pytorch_cpuinfo-src/include -I/tmp/onnxruntime-20241103-6403-lh3bwj/build -I/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-build -ffunction-sections -fdata-sections -Wno-restrict  -DCPUINFO_SUPPORTED -O3 -DNDEBUG -fPIC -fno-rtti -Wall -Wextra -Wno-deprecated-copy -Wno-tautological-pointer-compare -Wno-nonnull-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Werror -MD -MT CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -MF CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o.d -o CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -c /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc
  /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc: In function ‘void onnx_transpose_optimization::Permute1DConstant(onnx_transpose_optimization::api::GraphRef&, onnx_transpose_optimization::api::NodeRef&, onnx_transpose_optimization::api::TensorRef&, size_t, std::string_view, const std::vector<long int>&)’:
  /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1114:10: error: ‘memcpy’ is not a member of ‘std’; did you mean ‘wmemcpy’?
   1114 |     std::memcpy(dst, src, bytes_per_val);
        |          ^~~~~~
        |          wmemcpy
```

It is possible this error may not occur on different GCC versions if
`cstring` has been indirectly included by another header.
2024-11-07 21:04:57 -08:00
Jian Chen
e7987a6b0b
Replace reference to python 3.8 with python 3.10 (#22692)
### Description
This PR will set default python to 3.10 except
tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml. This is
needed because we are no longer using python 3.8

This PR excludes changes for Big Models CI, because it will require
additional changes. Which will be track in
USER STORY 52729



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-07 16:51:40 -08:00
Ranjit Ranjan
193671295e
[AIX] Fix for AIX build break (#22745)
### Description
With recent changes, below build error is found under AIX. 

```
ld: 0706-012 The -p flag is not recognized.
ld: 0706-012 The -a flag is not recognized.
ld: 0706-012 The -t flag is not recognized.
ld: 0706-012 The -h flag is not recognized.
ld: 0706-012 The -= flag is not recognized.
ld: 0706-012 The -$ flag is not recognized.
ld: 0706-012 The -$ flag is not recognized.
ld: 0706-012 The -O flag is not recognized.
ld: 0706-027 The -R IGIN flag is ignored.

collect2: error: ld returned 255 exit status
```

### Motivation and Context
AIX linker doesn't support -rpath option , so blocking this option under
AIX.
2024-11-07 13:22:22 -08:00
raoanag
f16036b6f5
[DML EP] Prefer MatMulInteger over MatMulIntegerToFloat in case of (#22469)
### Description
Skip `MatMulIntegerToFloat` fusion in case of DML EP for cases where
model uses Quantization before `MatMulInteger`. This is mainly done to
be resource efficient, and we have better `MatMulInteger` Metacommand
coverage which computes in int data type



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-11-07 10:02:01 -08:00
Yulong Wang
a436b3af1a
[webgpu] fix indices type when it's 4D (#22758)
### Description

Fix indices type from `array<u32, 4>` to `vec4<u32>` when the variable
is 4D.
2024-11-07 08:10:05 -08:00
jzm-intel
6a295eb75b
[JS/WebGPU] Creating devices with subgroup features enabled if possible (#21833)
This CL make WebGPU backend support subgroup features and thus allow
using subgroup optimizations in the future.

### Description
With this CL WebGPU backends will create devices with subgroups and
subgroups-f16 features (both are under origin trial in Chrome) or
chromium-experimental-subgroups feature enabled whenever available.

### Motivation and Context
This CL would allow WebGPU operator shaders to use subgroup
optimizations in the future, and might get some significant speedup with
these optimization.
2024-11-07 02:13:40 -08:00
Yifan Li
3b7a6eba69
[TensorRT EP] support TensorRT 10.6-GA (#22644)
### Description
<!-- Describe your changes. -->
* Update CI with TRT 10.6
* Update oss parser to [10.6-GA-ORT-DDS
](https://github.com/onnx/onnx-tensorrt/tree/10.6-GA-ORT-DDS) and update
dependency version
* Update Py-cuda11 CI to use trt10.6


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
(There will be 3rd PR to further reduce trt_version hardcoding)
2024-11-06 14:33:46 -08:00
Adrian Lizarraga
aa0cf1c5e1
[Quant Tool] Update QDQ Pad, Slice, Softmax (#22676)
### Description
Updates python quantization tool:
- Ensures QDQ Pad has equal quantization parameters across input and
output for certain Pad configurations.
- Ensures QDQ Slice always has equal quantization parameters across
input and output.
- Fixes bug when Softmax is _excluded_ from quantization.


### Motivation and Context
QDQ Pad and Slice have lower latency on QNN EP when their quantization
parameters are equal.
2024-11-06 14:06:29 -08:00
Caroline Zhu
0221693e43
[Mobile] Add E2E BrowserStack tests for iOS tests (#22610)
### Description
- Changes running the E2E iOS tests from running in App Center to
running in BrowserStack
- Steps for running locally can be found in the OneNote

### Motivation and Context
- Follow-up of #22117 
- App Center (the previous platform for running E2E mobile tests) is
getting deprecated in 2025

### Misc info
Additional build steps were required to get the necessary testing
artifacts for BrowserStack. App Center consumed an entire folder, while
BrowserStack requests the following:
1. a ZIP file of all the tests
2. an IPA file of the test app

#### Flow
Here is a rough outline of what is happening in the pipeline:
1. The build_and_assemble_apple_pods.py script builds the relevant
frameworks (currently, this means packages for iOS and Mac)
4. The test_apple_packages.py script installs the necessary cocoapods
for later steps
5. XCode task to build for testing builds the iOS target for the test
app
6. Now that the test app and the tests have been built, we can zip them,
creating the tests .zip file
7. To create the IPA file, we need to create a .plist XML file which is
generated by the generate_plist.py script.
- Attempts to use the Xcode@5 task to automatically generate the plist
file failed.
- Also, building for testing generates some plist files -- these cannot
be used to export an IPA file.
8. We run the Xcode task to build an .xcarchive file, which is required
for creating an IPA file.
9. We use xcodebuild in a script step to build an IPA file with the
xcarchive and plist files from the last two steps.
10. Finally, we can run the tests using the BrowserStack script.

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-11-06 11:22:29 -08:00
Adrian Lizarraga
4f6993d567
[Quant Tool] Prevent int32 quantized bias from clipping by adjusting the weight's scale (#22020)
### Description
Fixes scenario in which a bias input quantized to int32 has a scale that
is too small. A bias with a scale that is smaller than a certain
threshold will overflow the range of an `int32` when quantized, which
significantly decreases accuracy.

Credit to @yihonglyu for finding out about this issue and the fix.

### Motivation and Context
Consider the following Convolution with very small weights and a
constant bias input of `[5, -4.5]`.

![image](https://github.com/user-attachments/assets/4bde2bd9-892f-4ae9-887b-61a6668779a1)

The QDQ quantizer first computes the following quantization scale for
`input_0` and `weight`:
- `input_0`: scale=0.5
- `weight`: scale=7.843e-10 **[really small]**

The QDQ quantizer then computes the bias input's scale as follows:
```
bias_scale = input_0_scale * weight_0_scale = 0.5 * 7.843e-10 = 3.9215686274509805e-11
```

This `bias_scale` is too small. Before this PR, the QDQ quantizer would
quantize the f32 bias with this `bias_scale`:
```
bias_quant = round(bias_f32 / bias_scale) =  round([5.0/bias_scale, -4.5/bias_scale]) = [127500000000, -114750000000]
```
These quantized bias values exceed the range of int32, and so are
clipped to [int32.min(), int32.max()], which is very inaccurate.

#### New approach
This PR increases the `weight_0_scale` by the necessary amount to ensure
that `bias_scale` (which equals `weight_0_scale * input_0_scale`) is
appropriate for the int32 quantization type.

The smallest valid bias scale is given by the normal scale formula: 
`bias_smallest_valid_scale = (bias_f32_max - bias_f32_min) / (int32_max
- int32_min)`

Then, we compute the candidate bias scale:
`bias_scale_candidate = input_0_scale * weight_0_scale`

If the candidate scale is smaller than the smallest valid scale, we
increase the `weight_0_scale` by the necessary ratio:
```python
if bias_scale_candidate < bias_smallest_valid_scale:
    ratio = bias_smallest_valid_scale / bias_scale_candidate
    weight_0_scale = ratio * weight_0_scale
```

Then, we recompute the final bias scale:
```python
bias_scale = input_0_scale * weight_0_scale
```

#### Impact on accuracy
Here's the above model's quantized output compared to the f32
(ground-truth) output.
- Before PR: 
  - f32 model output[0]: **5.0f**
  - qdq model output[0]: **0.075**
  - SNR: 0.1369 (higher is better)
- After PR:
  - f32 model output[0]: **5.0f**
  - qdq model output[0]: **4.992**
  - SNR: 55.656 (higher is better)
2024-11-06 10:44:54 -08:00
Adrian Lizarraga
2c1b17ce98
[Quant Tool] Introduce get_qdq_config() helper to get QDQ configurations (#22677)
### Description
Introduces the `get_qdq_config()` function to get a quantization
configuration for a full integer QDQ model. This function provides an
easier way of specifying commonly used options and sets convenient
defaults. Specifically:

- Instead of requiring the user to pass a dictionary of `extra_options`,
the new interface adds function parameters for common settings:
  - All calibrator settings
  - Whether activations/weights are symmetric
  - Whether to keep or fuse relu/clip into Q
  - Minimum real range for quantization
  - Dictionary of tensor quantization overrides.
- Automatically scans the input floating-point model and fills out the
operator types to quantize. Otherwise, only a limited number of operator
types would be quantized by default.
- Detects if the input model uses external data. If so, ensures that the
generated QDQ model also uses external data.
- Detects if the model will use newly introduced quantization types
(int4/int16) with an older opset. If so, forces the use of the
`com.microsoft` domain for Q/DQ ops, which support all types.
- Automatically enables the "extra option" called
`ForceQuantizeNoInputCheck` to ensure data movement operators (e.g.,
Transpose) are always quantized.
- User can pass a function to indicate which nodes to exclude from
quantization.
- The user can still pass their own `extra_options` to override any of
the above if necessary.
 
```python
from onnxruntime.quantization import get_int_qdq_config, quantize # , ...

# Get QDQ configuration
qdq_config = get_int_qdq_config(
    float_model,
    data_reader,
    calibrate_method=CalibrationMethod.Percentile,
    calibrate_args={"percentile": 99.98},  # Converted to extra_options
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QInt8,
    per_channel=True,
    nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"`

    # Other options converted to extra_options:
    min_real_range=0.0001,
    keep_removable_activations=True,
    activation_symmetric=True,
    weight_symmetric=True,
)

# Quantize model
quantize(float_model_path, qdq_model_path, qdq_config)
```
### Motivation and Context
Need a version of `get_qnn_qdq_config()` that is not EP-specific.
2024-11-06 10:27:02 -08:00