Commit graph

11559 commits

Author SHA1 Message Date
xhcao
3bfb5e4f62
[js/webgpu] support float16 for Clip (#21584)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-28 13:19:20 -07:00
Wanming Lin
59114227fd
[WebNN EP] Remove NHWC preferred layout (#21570)
Currently WebNN CPU backend has supported NCHW layout in Chromium, we
can now drop NHWC preferred layout for CPU backend in WebNN EP to
simplify the code.
2024-08-28 13:17:34 -07:00
Ye Wang
bf8855ba3c
Support Smooth Softmax in fmha (#21885)
### Description
<!-- Describe your changes. -->
refer to https://github.com/microsoft/onnxruntime/pull/21867


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <you@example.com>
2024-08-28 09:29:33 -07:00
AlbertGuan9527
ef073fd8f4
Add session and run option workload_type for applications to set efficient mode. (#21781)
### Description
This PR added session and run option workload_type, this option is the
knob for applications to enable/disable the processor performance
efficient mode.



### Motivation and Context
The efficient mode is co-engineered with processor vendors to allow
applications voluntarily being serviced at a more energy efficient
performance level. This functionality can be used by long running,
latency insensitive application to save the energy consumption.
2024-08-28 08:17:01 -07:00
Jian Chen
e95277484e
Adding $(Build.SourcesDirectory)s to the ignoreDirectories (#21878) 2024-08-27 19:56:48 -07:00
George Wu
23f3912334
support both qnn x64 and arm64ec stages in py packaging pipeline (#21880)
both arm64ec and x64 packages are needed.  
x64 is needed for offline context binary generation
and arm64ec is needed for interop with python packages that don't have
prebuilt arm64 packages and only have x64.
2024-08-27 15:07:30 -07:00
Yulong Wang
d2a1b7a353
Introduce custom external data loader (#21634)
### Description

This PR introduces support for custom external data loader. An EP can
register a custom external data loader to override the default behavior,
making it possible to upload initializers directly to GPU.



### Motivation and Context

- In ONNX Runtime Web, WebAssembly uses 32-bit as pointer type
(`sizeof(size_t)==4`), which means there is a 4GB hard limit on the
maximum memory. As the ONNX models get larger, this becomes a blocker
for supporting medium-sized language models.

- ORT runs out of memory because the current code always loads data into
CPU memory, including the .onnx file (protobuf) and external data
file(s). However, if using GPU EP, the big data does not need to be kept
on CPU because the only thing that ORT does is to load the data into
memory, upload to GPU and then release them.

- Some platforms has offered developers way to upload data directly to
GPU. For example, webgpu allows uploading from any ArrayBuffer (it can
be a side buffer, not count into the 4GB) to GPU directly. This helps to
keep the CPU memory usage significantly.

### Design

Class `ExternalDataLoader` and `ExternalDataLoaderManager` are
introduced. They are similar to `DataTransfer` and
`DataTransferManager`. `InferenceSession` owns the manager object, and
`SessionState` keeps a reference to it.

Added a new method `GetExternalDataLoader` in `IExecutionProvider`. An
EP can override the method to register an instance of custom external
data loader.

The key function in a `ExternalDataLoader` class is method `LoadTensor`:

```c++
  // the tensor is pre-created using the TensorProto info of the initializer and the MemoryInfo (from allocation plan).
  virtual common::Status LoadTensor(const Env& env,
                                    const std::filesystem::path& data_file_path,
                                    FileOffsetType data_offset,
                                    SafeInt<size_t> data_length,
                                    Tensor& tensor) const;
```

This function can be registered by EP, going through a few layers and
eventually get into `DeserializeTensorProto()` in the finalizing stage
of session initialization. In this step, initializer tensors are
created. Behavior is changed to first look up for a registered external
data loader that can handle the current memory info. If any instance is
available, use the loader; otherwise respect the old code path.
2024-08-27 12:18:52 -07:00
Caroline Zhu
b7f09d4c27
Increase timeout for orttraining-linux-gpu pipeline (#21844)
### Description
Increase timeout to 160 minutes

### Motivation and Context
- Recent runs of orttraining-linux-gpu pipeline have been timing out
2024-08-27 11:47:12 -07:00
Jian Chen
7f851f4e61
Removing docker_base_image parameter and variables (#21864)
### Description
Removing `docker_base_image` parameter and variables. From the Cuda
Packaging pipeline.



### Motivation and Context
Since the docker image is hard coded in the 

`onnxruntime/tools/ci_build/github/linux/docker/inference/x86_64/default/cuda12/Dockerfile`
and 

`onnxruntime/tools/ci_build/github/linux/docker/inference/x86_64/default/cuda11/Dockerfile`
This parameter and variable is no longer needed.
2024-08-27 10:36:17 -07:00
Ye Wang
1d059b8702
Phi3 MoE cuda kernel (#21819)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <you@example.com>
2024-08-27 09:21:30 -07:00
Jiajia Qin
252222034f
[js/webgpu] Support Reshape/Shape 21+ on jsep (#21871)
### Description
<!-- Describe your changes. -->
#21618

With this PR, the cross device copying (`MemcpyToHost`) can totally be
removed for model `wav2vec2`. And the overall time becomes 48ms from
604ms.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-27 09:02:39 -07:00
mcollinswisc
5d54dc1462
Drop QDQ around more nodes (#21376)
### Description

Extends the Drop QDQ optimization to remove DequantizeLinear and
QuantizeLinear nodes from around operators:

- Flatten
- Expand
- Tile
- Slice
- GatherElements
- ReduceMin
- ReduceMax

### Motivation and Context

To reduce floating-point conversions in quantize inference. Mainly
motivated by the Flatten case, since that will show up in graphs
exported from PyTorch to ONNX. But to make the change complete,
extending to a larger set of ops for which this optimization is valid.

https://github.com/microsoft/onnxruntime/issues/21375

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-08-27 16:54:37 +10:00
Tianlei Wu
6e57576988
Support Smooth Softmax in GroupQueryAttention (#21867)
### Description

Softmax (formula 1) is like the following:
```math
y_{i} = \frac{exp(x_{i})}{\sum_{i} exp(x_{i})}
```
After applying softmax, each element will be in the range of $(0, 1)$,
and the elements will add up to 1, so that they can be interpreted as
probabilities.

However, in language model, softmax has two issues:
* When all elements are -inf (for example, a whole row is masked when a
query token is padding), the result is not defined since exp(-inf)=0 and
divided-by-zero is encountered in the above formula.
* Why do we need normalize in a way that each query word are treated as
equal important (each row has sum equals to1)?

**Smooth Softmax** (formula 2) is a modified version that introduces a
smooth factor like the following:
```math
s_{i} = \frac{exp(x_{i})}{1+ \sum_{i} exp(x_{i})}
```

This formula could tackle the above two issues:
* It could handle the special case that all elements are -inf: the
result $s_{i}$ is 0 for every element in such case.
* Sum of all elements $\sum_{i}{s_{i}} = \frac{\sum_{i}{exp(x_{i})}}{1+
\sum_{i} exp(x_{i})}$ is in the range of (0, 1), so that we can train
the model to assign different importance to different query words.

Since exponential is prone to overflow or underflow, to get stable
result, formula 3 can be used:
```math
s_{i} = \frac{exp(x_{i} + c)}{exp(c)+ \sum_{i} exp(x_{i} +c)}
```
c can be any value in theory. In practical, choice of constant c shall
avoid $exp(c)$ and $exp(x_{i} +c)$ overflow (or underflow) at the same
time. A reasonable choice is like formula 4:
```math
c=-\max_{i} \{ x_i \}
```
or  apply a constraint that c <=0 like the following formula 5:

```math
c=-\max(0, \max_{i} \{ x_i \})
```
The latter one (formula 5) ensures that $s_{i}$ will fallback to formula
2 when all elements are negative.

For CPU provider, smooth softmax is implemented in MLAS. CPU
implementation uses formula 5.

@wangyems implemented the smooth softmax in flash attention for CUDA,
which requires Ampere or newer GPU. The implementation of smooth softmax
in flash attention uses formula 4.

---------

Co-authored-by: Ye Wang
2024-08-26 23:13:15 -07:00
Yulong Wang
99bc45dcbd
[js] add big data file to formatter ignore list (#21767)
### Description

Add the big data file `web/test/data/ops/pad-big.jsonc` to formatter
ignore list. This file slows down the formatter quite a lot at local.
2024-08-26 22:08:26 -07:00
zz002
422e6e6fb0
[VitisAI] add OpSchema, VitisAI use IKernelLookup to check supported ops, VitisAI def_builder adds TypeConstraint related processing (#21688)
### Description
<!-- Describe your changes. -->
1. add OpSchema
2. VitisAI use IKernelLookup to check supported ops
3. VitisAI def_builder adds TypeConstraint related processing

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-08-26 21:16:44 -07:00
Satya Kumar Jandhyala
af18824f43
[JS/WebGPU] Add GatherBlockQuantized op support (#21734)
### Description
Add GatherBlockQuantized operator to JSEP.



### Motivation and Context
Gemma model requires this.
2024-08-26 14:46:04 -07:00
Tianlei Wu
ad382120fe
[CUDA] enable causal in MultiHeadAttention (#21852)
### Description
Enable causal in MultiHeadAttention cuda operator.

All formats (Q_K_V_BSNH_BSNH_BSNH, Q_K_V_BSNH_BNSH_BNSH, Q_KV_BSNH_BSN2H
and QKV_BSN3H) supports causal for now. Internally, casual will be
dispatch to flash attention, efficient attention or unfused attention
kernel.

### Motivation and Context
Currently, MultiHeadAttention has causal enabled in CPU ep, but not in
CUDA ep. It could cause issues in onnx conversion, like some model can
run in CPU but not in CUDA. Enable causal in CUDA will reduce the
difference of support matrix of CPU/CUDA.
2024-08-26 13:34:55 -07:00
Xu Xing
d9c57ac7db
[js/webgpu] Enable pad f16 uniform (#21691)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-08-26 07:58:48 -07:00
Yi Zhang
2877de73e1
sign native dll with correct cert (#21854)
### Description
Fixed #21775



### Motivation and Context
The dlls should be signed with Keycode CP-230012.
The default is the test code sign.
2024-08-26 16:46:19 +08:00
Caroline Zhu
983c4d57a4
Fix typo for react native pipeline (#21845)
### Description
fix typo

### Motivation and Context
[RN pipeline
failing](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=188&_a=summary)
since #21578 with this error:

![image](https://github.com/user-attachments/assets/75e5b968-572f-42cc-9816-7940de464cfa)
2024-08-26 12:05:11 +10:00
Ted Themistokleous
9a70475622
[MIGraphX EP Support]Remove default noopt for Migraphx EP in Benchmark.py (#21843)
…ripts (#58)

### Description
<!-- Describe your changes. -->
Removes the heavy handed no opt for all MIGraphX using the benchmark.py
scripts


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Finding this hurts performance if we remove all optimizations. Let the
fine tuning occur at the script level instead of a blanket NoOPT being
selected

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
2024-08-24 22:01:08 -07:00
Jiajia Qin
87165b92e9
[js/webgpu] optimize MatmulNBits (#21747)
### Description
<!-- Describe your changes. -->
See 2x speedup for phi3 on the integrated intel gpu with this
optimization.

The optimization is mainly to store input A's data into local variable
instead of loading them from global memory each time when calculate them
with B data.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-23 16:36:00 -07:00
duanshengliu
4af6291841
Refine op_types_to_quantize argument handling in matmul_4bits_quantizer.py (#21815)
### Description
<!-- Describe your changes. -->

Refine `op_types_to_quantize` argument handling in
matmul_4bits_quantizer.py

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The default `op_types_to_quantize "MatMul"` will cause
`tuple(args.op_types_to_quantize)` to become `('M', 'a', 't', 'M', 'u',
'l')`, which is not expected.
2024-08-23 13:45:06 -07:00
Sheil Kumar
44dcc3aafd
Replace "DML CPU" Allocator with onnxruntime::CpuAllocator (#21818)
### Description
Replace "DML CPU" Allocator with onnxruntime::CpuAllocator

### Motivation and Context
This allocator is being ignored by ORTExtensions and causes CPU memory
to be treated as non-CPU memory and crash in SentencepieceTokenizer.

In general it seems like this allocator is not used and can be handled
just fine by the default allocator.

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2024-08-23 10:35:57 -07:00
Edward Chen
5726318ec0
[CoreML EP] Fix ArgMaxOpBuilder::AddToModelBuilderImpl() nullptr Node access. (#21797) 2024-08-23 10:19:53 -07:00
Frank Dong
4c4ae1e490
enable large initializer offset align for save external data in ORT (#21604)
### Description
Address issue #21524 
Enable offset align for model saved as external data format

python data convertor fix here: https://github.com/onnx/onnx/pull/6248

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-22 23:29:14 -07:00
Jiajia Qin
27a6890529
[js/webgpu] Optimize conv1d by conv2d (#19388)
### Description
<!-- Describe your changes. -->

Optimize conv1d to go to the conv2d path to utilize the conv2d's
optimization path.

See whisper-tiny-encoder model becomes 158.66 ms from 532.28 ms. Conv
goes to Conv2DMatMul(8 ms) instead of GroupedConv(382 ms).

Old profiling result:
Kernel | Time (ms) | Percentage (%)
-- | -- | --
Conv\|GroupedConv | 382.99 | 71.95
MatMul | 126.16 | 23.70
Softmax | 7.01 | 1.32
Transpose | 4.59 | 0.86
Add | 4.39 | 0.82
Mul | 2.36 | 0.44
Div | 1.44 | 0.27
ReduceMean\|ReduceMeanShared | 1.25 | 0.23
Erf | 0.85 | 0.16
Sub | 0.72 | 0.14
Pow | 0.46 | 0.09
Sqrt | 0.07 | 0.01
Sum | 532.28 |  

New profiling result with this PR:

Kernel | Time (ms) | Percentage (%)
-- | -- | --
MatMul | 127.07 | 80.09
Conv\|Conv2DMatMul | 8.00 | 5.04
Softmax | 6.95 | 4.38
Transpose | 4.65 | 2.93
Add | 4.26 | 2.68
Mul | 2.56 | 1.61
Div | 1.51 | 0.95
ReduceMean\|ReduceMeanShared | 1.31 | 0.83
Erf | 0.85 | 0.54
Sub | 0.79 | 0.50
Pow | 0.46 | 0.29
Conv\|Transpose | 0.26 | 0.17
Sqrt | 0.00 | 0.00
Sum | 158.66 |  

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2024-08-22 22:56:07 -07:00
Preetha Veeramalai
0368dd4ea4
Ovep 1.19 bug fix 2 (#21829)
### Description
Handles bug fix for EPCtx file path assertions.

---------

Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
2024-08-22 19:36:22 -07:00
Yueqing Zhang
37a7dd7d63
[VitisAI] optimize model clone (#21706)
### Description
<!-- Describe your changes. -->
Optimize the memory consumption for model_clone which is a crucial part
of our model preparation


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is curcial for meeting the requirement for Microsoft's 8.15
release.

---------

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Chunye Wang <chunywan@amd.com>
2024-08-22 13:28:31 -07:00
Guenther Schmuelling
ba7baae994
Revert "Upgrade emsdk from 3.1.59 to 3.1.62" (#21817)
Reverts microsoft/onnxruntime#21421

Users are seeing chrome memory grow to 16GB before it crashes:
https://github.com/microsoft/onnxruntime/issues/21810

Revert for now so we have time to debug.
2024-08-22 11:21:00 -07:00
Adrian Lizarraga
514b4699b4
[QNN EP] Apply workaround for Conv validation bug when bias input is implicit (#21764)
### Description
- Adds a dummy bias of all zeros when translating a Conv without an
explicit bias input. This is a workaround for a QNN validation issue
that fails when the optional bias input is not provided.
- Corrects logic for unpacking of **non-zero int4** zero-points. Bug
does not impact models because we currently only support int4
zero-points equal to 0 (symmetric quant). But this would become an issue
in the future if/when QNN supports non-zero int4 zero-points (so good to
fix now).



### Motivation and Context
Support Conv operators without a bias input on QNN EP with the latest
QNN SDK.
2024-08-22 10:38:03 -07:00
Jian Chen
6c1a3f85a6
Do not allow clearing Android logs if the emulator is not running (#21578)
### Description
Do not allow clearing Android logs if the emulator is not running



### Motivation and Context
Previously the Clearing Android logs step stuck until the pipeline
timeout. If one of the previous steps failed.
2024-08-22 10:18:01 -07:00
Chen Feiyue
ff3e8b02c3
[VSINPU]Update vsinpu patches (#21402)
### Description
- update patches for accuracy modification && local result recording
2024-08-21 23:58:56 -07:00
Yueqing Zhang
3ff8ca29e5
[VitisAI] remove wrong error msg, required by Microsoft (#21715)
### Description
<!-- Describe your changes. -->
Remove legacy code and wrong message.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required by Microsoft to remove unwanted error message. This is
required for 8.15 release.

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
2024-08-21 21:10:28 -07:00
Tianlei Wu
25d7a4fa08
[CUDA] Update benchmark_mha.py to capture debug info to identify sdpa kernel (#21804)
Use debug info to identify sdpa kernel actually used, and show it in the
output of benchmark_mha.py. This updated benchmark script was used to
get the benchmark results in
https://github.com/microsoft/onnxruntime/pull/21629.
(1) Change the output format of debug info to output like SdpaKernel=*
(2) Add a step to capture stdout from onnxruntime session, and use
regular expression to parse SdpaKernel=* from the captured text.

Other minor changes:
(1) Set different default repeats during benchmark: 100 for CPU; and
10000 for CUDA.
(2) Fix PrintTensorByDims used in console dumper: if it is not enabled,
do not dump tensor.
(3) Update some comments

### Motivation and Context

Sometime, we will use fallback for a sdpa_kernel. It could confuse user
unless we can tell exact kernel is used in benchmark.
2024-08-21 17:30:16 -07:00
Tianlei Wu
44a3923ba5
run sparse attention test sequentially (#21808)
### Description

For some reason, run SparseAttention tests in parallel causes random
failure in CI pipeline. Maybe due to out of memory when too many tests
running in parallel.

This will run those tests in sequentially.
2024-08-21 17:24:58 -07:00
Jake Mathern
c0b68e77af
Fix warnings (#21809)
### Description
Minor changes to resolve some warnings in ORT

### Motivation and Context
Binskim for WindowsAI (which consumes ORT) treats warnings as errors,
and has hit these warnings.
As a security requirement, warnings like "signed/unsigned mismatch" must
be resolved.
2024-08-21 14:23:37 -07:00
Edward Chen
fb9ce18e88
Add K=0 check to MatMul<float>::Compute() specialization. (#21803)
Add K=0 check to `MatMul<float>::Compute()` specialization.
Add unit test to cover both primary template and float specialization.
2024-08-21 09:15:58 -07:00
Ted Themistokleous
0e827c27fb
[MIGraphX EP] Add support for MIGraphX Exhaustive tune flag (#46) (#21599)
### Description
<!-- Describe your changes. -->
Set the exhaustive tune flag through the MIGraphX API and make this a
Session option in Onnxruntime

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow users to use MIGraphX Exhaustive tuning with Onnxruntime
inferences
This goers hand in hand with save/load after a model and been compiled
and tuning has found.

---------

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2024-08-21 07:32:12 -07:00
Ted Themistokleous
26a499323f
[MIGraphX EP Support] Update migx scripts (#21806)
### Description
<!-- Describe your changes. -->
No code changes to the EP only changes to the scripts whihc invoke
MIGraphX EP

- One case be explicit to set MIGraphX EP when running gpt2 testing
- The other to ensure we turn off optimizations like tensorRT and allow
MIGraphX to handle graph optimizations


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
MIGraphX has moved away from using rocBLAS and without this, some cases
used in CI shall fail as optmizations will attempt to use rocBLAS
kernels instead of MIGraphx EP directly.
2024-08-21 07:22:42 -07:00
Ted Themistokleous
ed155ad46a
[MIGraphX EP] Ensure we support all inputs for MatMulInteger and ConvInteger. (#21680)
… to int8 for now

Allow for models with biases/full input and only check for int8 support
in EP

### Description
<!-- Describe your changes. -->
Allows for all inputs for MatMulInteger and ConvInteger to be supported
for prequantized models


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes issues when using prequantized models that contain weight biases

---------

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
2024-08-21 07:19:20 -07:00
mindest
009209e016
Fix Orttraining Linux Lazy Tensor CI Pipeline (#21652)
### Description
Fix `Orttraining Linux Lazy Tensor CI Pipeline`
- Remove unused import of `torch.onnx._internal.exporter`, whose path is
changed in newer torch (pytorch/pytorch#132429).
- Move import of `register_custom_op_symbolic` from `torch.onnx` into
local function, which causes circular import when running `import
torch.onnx` (at least in the CI environment).
2024-08-21 18:10:08 +08:00
Patrice Vignola
de6ebcbb54
[DML] Add int4 QDQ (#21592) 2024-08-20 23:44:58 -07:00
Yi Zhang
12f426c63f
update size limit check of training GPU wheel (#21762)
### Description
<!-- Describe your changes. -->



### Motivation and Context
The training wheel size limit should be 400M
2024-08-21 09:30:05 +08:00
Adrian Lizarraga
6fbb0ae81a
[TransposeOptimizer] Fix axis for QuantizeLinear inserted after DQ (per-channel) -> Unsqueeze (#21793)
### Description
- Fix computation of axis for `QuantizeLinear` inserted after the
sequence `DQ (per-channel) -> Unsqueeze`. Example:
  - Original: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Op`
- After QDQ fix-up: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Q
(axis = 3) -> DQ (axis = 3) -> Op`
- Before this PR, the axis for the inserted Q/DQ ops was not correctly
set to 3 (left as 0).
- Fix normalization of negative axis values for `QuantizeLinear`
inserted after the sequence `DQ (per-channel) ->Transpose`
  - Existing code added the wrong rank value to normalize the DQ axis.

### Motivation and Context
Fix errors in handling of per-channel DQ in code that fixes QDQ
NodeUnits.
2024-08-20 16:26:02 -07:00
Adrian Lizarraga
28c252c77e
[QNN EP] Fix compile error for QNN EP on Windows x64 due to missing /bigobj flag (#21795)
### Description
Compiling onnxruntime with QNN EP on Windows x86_64 results in a
compilation error:
```shell
$ onnxruntime\test\optimizer\qdq_transformer_test.cc(1,1): error C1128: num
ber of sections exceeded object file format limit: compile with /bigobj [...onnxruntime\build\Debug\onnxruntime_test_all.vcxproj]
```

This PR adds the `/bigobj` compilation flag for the
`qdq_transformer_test.cc` file.
2024-08-20 10:11:43 -07:00
Tianlei Wu
fbc3927231
[CUDA] cuDNN Flash Attention (#21629)
### Description
- [x] Add cuDNN flash attention using cudnn frontend, and enable it in
MultiHeadAttention operator.
- [x] Support attention mask.
- [x] Support attention bias.
- [x] Update tests and benchmark script.

The cuDNN SDPA is disabled by default. To enable it, need the following:
(1) Requires cuDNN 9.3 or newer version installed.
(2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or
set `sdpa_kernel=8` cuda provider option to enable it.
(3) Only works on devices with compute capability >= 8.0.

Note that some combinations of parameters might be rejected due to
limited support of head dimension or sequence lengths.

Future Works:
(1) FP8 and BF16 APIs.  Currently, only API for FP16 are exposed.
(2) Add API to support ragged batching (padding removed in inputs).
(3) Support other input formats (like QKV_BS3NH).
(4) Currently, q are converted to BSNH, k/v are converted to either BSNH
or BNSH format. May do some experiment to see whether converting q to
BNSH could be better in some case.

### Example Benchmark Results on H100

The following tests are on FP16 MultiHeadAttention operator without
attention mask and attention bias.

#### Test Setting 1
batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 256 | 0 | 32 | 128

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000075 | 229.5 | torch:flash
Q,K,V (BNSH) | 0.000119 | 144.8 | torch:efficient
Q,K,V (BNSH) | 0.000224 | 76.5 | torch:math
Q,K,V (BSNH) | 0.000075 | 227.8 | ort:cudnn
Q,K,V (BSNH) | 0.000094 | 182.8 | ort:flash
Q,K,V (BSNH) | 0.000138 | 124.7 | ort:efficient
Q,K,V (BSNH) | 0.000438 | 39.3 | ort:math
Q,KV | 0.000129 | 133.0 | ort:cudnn
Q,KV | 0.000151 | 114.1 | ort:flash
Q,KV | 0.000194 | 88.5 | ort:efficient
QKV | 0.000154 | 111.8 | ort:cudnn
QKV | 0.000175 | 98.0 | ort:flash
QKV | 0.000217 | 79.0 | ort:efficient

#### Test Setting 2

batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 512 | 0 | 16 | 64

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000069 | 249.2 | torch:flash
Q,K,V (BNSH) | 0.000141 | 121.7 | torch:efficient
Q,K,V (BNSH) | 0.000294 | 58.5 | torch:math
Q,K,V (BSNH) | 0.000077 | 221.7 | ort:cudnn
Q,K,V (BSNH)  | 0.000087 | 196.6 | ort:flash
Q,K,V (BSNH)  | 0.000163 | 105.6 | ort:efficient
Q,K,V (BSNH)  | 0.000651 | 26.4 | ort:math
Q,KV | 0.000103 | 167.1 | ort:cudnn
Q,KV | 0.000117 | 146.3 | ort:flash
Q,KV | 0.000192 | 89.6 | ort:efficient
QKV | 0.000113 | 151.5 | ort:cudnn
QKV | 0.000128 | 134.7 | ort:flash
QKV | 0.000201 | 85.3 | ort:efficient
2024-08-20 08:50:22 -07:00
Yi Zhang
9f7e19cedd
[Fix] Make python API doc generation in Microsoft-hosted Agent (#21766)
### Description
<!-- Describe your changes. -->



### Motivation and Context
1. Python API doc needs to be merged from a fork, but 1ES self-hosted
pool is only for one github repo.
2. ubuntu-latest will be install numpy above 2.0 by default, and current
python API doc generation doesn't support it.
So I pin numpy < 2.0.0

---------
2024-08-20 23:32:38 +08:00
Satya Kumar Jandhyala
1fb2e71ddc
[JS/WebGPU] Avoid producing presentKey/presentValue outputs if pastKey/pastValue … (#21782)
Avoid producing presentKey/presentValue outputs if pastKey/pastValue
don't exists.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-19 18:02:19 -07:00
Adrian Lizarraga
a22cc078b4
[QNN EP] Add support for GatherElements (#15966)
### Description
- Adds support for the GatherElements operator to QNN EP.
- Adds GatherElements to QDQ quantizer tool.

### Motivation and Context
Enable more models to run on QNN EP.
2024-08-19 14:33:40 -07:00