Commit graph

6778 commits

Author SHA1 Message Date
Adrian Lizarraga
514b4699b4
[QNN EP] Apply workaround for Conv validation bug when bias input is implicit (#21764)
### Description
- Adds a dummy bias of all zeros when translating a Conv without an
explicit bias input. This is a workaround for a QNN validation issue
that fails when the optional bias input is not provided.
- Corrects logic for unpacking of **non-zero int4** zero-points. Bug
does not impact models because we currently only support int4
zero-points equal to 0 (symmetric quant). But this would become an issue
in the future if/when QNN supports non-zero int4 zero-points (so good to
fix now).



### Motivation and Context
Support Conv operators without a bias input on QNN EP with the latest
QNN SDK.
2024-08-22 10:38:03 -07:00
Chen Feiyue
ff3e8b02c3
[VSINPU]Update vsinpu patches (#21402)
### Description
- update patches for accuracy modification && local result recording
2024-08-21 23:58:56 -07:00
Yueqing Zhang
3ff8ca29e5
[VitisAI] remove wrong error msg, required by Microsoft (#21715)
### Description
<!-- Describe your changes. -->
Remove legacy code and wrong message.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required by Microsoft to remove unwanted error message. This is
required for 8.15 release.

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
2024-08-21 21:10:28 -07:00
Tianlei Wu
25d7a4fa08
[CUDA] Update benchmark_mha.py to capture debug info to identify sdpa kernel (#21804)
Use debug info to identify sdpa kernel actually used, and show it in the
output of benchmark_mha.py. This updated benchmark script was used to
get the benchmark results in
https://github.com/microsoft/onnxruntime/pull/21629.
(1) Change the output format of debug info to output like SdpaKernel=*
(2) Add a step to capture stdout from onnxruntime session, and use
regular expression to parse SdpaKernel=* from the captured text.

Other minor changes:
(1) Set different default repeats during benchmark: 100 for CPU; and
10000 for CUDA.
(2) Fix PrintTensorByDims used in console dumper: if it is not enabled,
do not dump tensor.
(3) Update some comments

### Motivation and Context

Sometime, we will use fallback for a sdpa_kernel. It could confuse user
unless we can tell exact kernel is used in benchmark.
2024-08-21 17:30:16 -07:00
Tianlei Wu
44a3923ba5
run sparse attention test sequentially (#21808)
### Description

For some reason, run SparseAttention tests in parallel causes random
failure in CI pipeline. Maybe due to out of memory when too many tests
running in parallel.

This will run those tests in sequentially.
2024-08-21 17:24:58 -07:00
Jake Mathern
c0b68e77af
Fix warnings (#21809)
### Description
Minor changes to resolve some warnings in ORT

### Motivation and Context
Binskim for WindowsAI (which consumes ORT) treats warnings as errors,
and has hit these warnings.
As a security requirement, warnings like "signed/unsigned mismatch" must
be resolved.
2024-08-21 14:23:37 -07:00
Edward Chen
fb9ce18e88
Add K=0 check to MatMul<float>::Compute() specialization. (#21803)
Add K=0 check to `MatMul<float>::Compute()` specialization.
Add unit test to cover both primary template and float specialization.
2024-08-21 09:15:58 -07:00
Ted Themistokleous
0e827c27fb
[MIGraphX EP] Add support for MIGraphX Exhaustive tune flag (#46) (#21599)
### Description
<!-- Describe your changes. -->
Set the exhaustive tune flag through the MIGraphX API and make this a
Session option in Onnxruntime

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow users to use MIGraphX Exhaustive tuning with Onnxruntime
inferences
This goers hand in hand with save/load after a model and been compiled
and tuning has found.

---------

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2024-08-21 07:32:12 -07:00
Ted Themistokleous
26a499323f
[MIGraphX EP Support] Update migx scripts (#21806)
### Description
<!-- Describe your changes. -->
No code changes to the EP only changes to the scripts whihc invoke
MIGraphX EP

- One case be explicit to set MIGraphX EP when running gpt2 testing
- The other to ensure we turn off optimizations like tensorRT and allow
MIGraphX to handle graph optimizations


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
MIGraphX has moved away from using rocBLAS and without this, some cases
used in CI shall fail as optmizations will attempt to use rocBLAS
kernels instead of MIGraphx EP directly.
2024-08-21 07:22:42 -07:00
Ted Themistokleous
ed155ad46a
[MIGraphX EP] Ensure we support all inputs for MatMulInteger and ConvInteger. (#21680)
… to int8 for now

Allow for models with biases/full input and only check for int8 support
in EP

### Description
<!-- Describe your changes. -->
Allows for all inputs for MatMulInteger and ConvInteger to be supported
for prequantized models


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes issues when using prequantized models that contain weight biases

---------

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
2024-08-21 07:19:20 -07:00
Patrice Vignola
de6ebcbb54
[DML] Add int4 QDQ (#21592) 2024-08-20 23:44:58 -07:00
Adrian Lizarraga
6fbb0ae81a
[TransposeOptimizer] Fix axis for QuantizeLinear inserted after DQ (per-channel) -> Unsqueeze (#21793)
### Description
- Fix computation of axis for `QuantizeLinear` inserted after the
sequence `DQ (per-channel) -> Unsqueeze`. Example:
  - Original: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Op`
- After QDQ fix-up: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Q
(axis = 3) -> DQ (axis = 3) -> Op`
- Before this PR, the axis for the inserted Q/DQ ops was not correctly
set to 3 (left as 0).
- Fix normalization of negative axis values for `QuantizeLinear`
inserted after the sequence `DQ (per-channel) ->Transpose`
  - Existing code added the wrong rank value to normalize the DQ axis.

### Motivation and Context
Fix errors in handling of per-channel DQ in code that fixes QDQ
NodeUnits.
2024-08-20 16:26:02 -07:00
Tianlei Wu
fbc3927231
[CUDA] cuDNN Flash Attention (#21629)
### Description
- [x] Add cuDNN flash attention using cudnn frontend, and enable it in
MultiHeadAttention operator.
- [x] Support attention mask.
- [x] Support attention bias.
- [x] Update tests and benchmark script.

The cuDNN SDPA is disabled by default. To enable it, need the following:
(1) Requires cuDNN 9.3 or newer version installed.
(2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or
set `sdpa_kernel=8` cuda provider option to enable it.
(3) Only works on devices with compute capability >= 8.0.

Note that some combinations of parameters might be rejected due to
limited support of head dimension or sequence lengths.

Future Works:
(1) FP8 and BF16 APIs.  Currently, only API for FP16 are exposed.
(2) Add API to support ragged batching (padding removed in inputs).
(3) Support other input formats (like QKV_BS3NH).
(4) Currently, q are converted to BSNH, k/v are converted to either BSNH
or BNSH format. May do some experiment to see whether converting q to
BNSH could be better in some case.

### Example Benchmark Results on H100

The following tests are on FP16 MultiHeadAttention operator without
attention mask and attention bias.

#### Test Setting 1
batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 256 | 0 | 32 | 128

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000075 | 229.5 | torch:flash
Q,K,V (BNSH) | 0.000119 | 144.8 | torch:efficient
Q,K,V (BNSH) | 0.000224 | 76.5 | torch:math
Q,K,V (BSNH) | 0.000075 | 227.8 | ort:cudnn
Q,K,V (BSNH) | 0.000094 | 182.8 | ort:flash
Q,K,V (BSNH) | 0.000138 | 124.7 | ort:efficient
Q,K,V (BSNH) | 0.000438 | 39.3 | ort:math
Q,KV | 0.000129 | 133.0 | ort:cudnn
Q,KV | 0.000151 | 114.1 | ort:flash
Q,KV | 0.000194 | 88.5 | ort:efficient
QKV | 0.000154 | 111.8 | ort:cudnn
QKV | 0.000175 | 98.0 | ort:flash
QKV | 0.000217 | 79.0 | ort:efficient

#### Test Setting 2

batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 512 | 0 | 16 | 64

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000069 | 249.2 | torch:flash
Q,K,V (BNSH) | 0.000141 | 121.7 | torch:efficient
Q,K,V (BNSH) | 0.000294 | 58.5 | torch:math
Q,K,V (BSNH) | 0.000077 | 221.7 | ort:cudnn
Q,K,V (BSNH)  | 0.000087 | 196.6 | ort:flash
Q,K,V (BSNH)  | 0.000163 | 105.6 | ort:efficient
Q,K,V (BSNH)  | 0.000651 | 26.4 | ort:math
Q,KV | 0.000103 | 167.1 | ort:cudnn
Q,KV | 0.000117 | 146.3 | ort:flash
Q,KV | 0.000192 | 89.6 | ort:efficient
QKV | 0.000113 | 151.5 | ort:cudnn
QKV | 0.000128 | 134.7 | ort:flash
QKV | 0.000201 | 85.3 | ort:efficient
2024-08-20 08:50:22 -07:00
Adrian Lizarraga
a22cc078b4
[QNN EP] Add support for GatherElements (#15966)
### Description
- Adds support for the GatherElements operator to QNN EP.
- Adds GatherElements to QDQ quantizer tool.

### Motivation and Context
Enable more models to run on QNN EP.
2024-08-19 14:33:40 -07:00
Jing Fang
64674c50de
Added a tool to quantize Gather to GatherBlockQuantized (#21697)
### Description
Added code in MatMul4BitsQuantizer to quantize Gather to
GatherBlockQuantized.

Only Gather with constant data is quantized.

Since quantized data is in int4, the quantized model will force upgrade
to onnx opset 21.

The implementation purely relies on numpy. If optimization is needed,
C++ kernels can be added later.

Only support default RTN algorithm since GatherBlockQuantized require
zero points to have the same type as quantized data.

### Motivation and Context
Support quantizing gather to int4 in Web scenario.
2024-08-19 10:25:36 -07:00
Wanming Lin
7ae0b4ce64
[WebNN EP] Support Erf and Trilu for CPU backend (#21768) 2024-08-19 07:56:16 -07:00
jingyanwangms
c018ba43ef
[Running CI] [TensorRT EP] support TensorRT 10.3-GA (#21742)
### Description
- TensorRT 10.2.0.19 -> 10.3.0.26

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-18 13:26:41 -07:00
Tianlei Wu
d79e3c5791
Extend Attention Bias Broadcast Support (#21710)
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.

- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script

Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
2024-08-16 15:40:04 -07:00
Emmanuel
a4bec3d374
Enabled Dynamo exporter (#21713)
### Description
This PR modifies the run_dynamo_export function to ensure it mirrors the
behavior of run_torchscript_merged_export rather than
run_torchscript_separate_export. Additionally, I made adjustments to the
main function to ensure that run_dynamo is correctly invoked.



### Motivation and Context
The main motivation for this change is to enable successful export of
LLaMA-2 and LLaMA-3 models using the Dynamo exporter to ONNX.
Previously, the exporter was saving two copies of the weights, which is
inefficient. The modified approach ensures that only one copy of the
weights is saved, and the model can support both scenarios. These
changes enhance the compatibility of the exporter with LLaMA models and
subsequently other models and optimize the export process
2024-08-16 10:45:22 -07:00
Wanming Lin
b2d603abda
[WebNN EP] Remove workaround for scalar (#21704)
Currently Chromium has supported scalar with dims = {}, remove legacy
workaround for supporting scalar.
2024-08-15 22:59:51 -07:00
Dmitri Smirnov
754dba2674
Change to std::fill (#21759)
### Description
Replace `memset(0)` with `std::fill(T{})`. This would ensure that all
the types are initialized in a portable way.

### Motivation and Context
Some platforms exhibit intermittent failures with NaN results.
Follow up to: https://github.com/microsoft/onnxruntime/pull/21525

Cc: @ranjitshs
2024-08-15 16:16:54 -07:00
Guenther Schmuelling
d82f15d0e3
add Gelu opset-20 to webgpu (#21725)
https://github.com/microsoft/onnxruntime/issues/21618
2024-08-14 09:45:05 -07:00
Frank Dong
a0708a0d96
avoid redundant memory allocation for external initializers (#21682)
### Description
avoid redundant memory allocation for external initializers, we will use
mmap for external initializers later so no point to allocate memory in
advance then release them later.



### Motivation and Context
In current implementation, we will:
1. Allocate memory (with desired size of current initializer) for
initializer first:
[https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/session_state_utils.cc#L131](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fonnxruntime%2Fblob%2Fmain%2Fonnxruntime%2Fcore%2Fframework%2Fsession_state_utils.cc%23L131&data=05%7C02%7Cfrdong%40microsoft.com%7C1e126797c95149aa217d08dcb781cc60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638587015340041125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=6fN57MUsergrCX%2BBS7jztWBRmc8nx19EVvn0lUJ2Gtk%3D&reserved=0)
2. For external initializer, we will point initializer to mmaped object
in memory and release previously allocated tensor:
[https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/session_state_utils.cc#L89](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fonnxruntime%2Fblob%2Fmain%2Fonnxruntime%2Fcore%2Fframework%2Fsession_state_utils.cc%23L89&data=05%7C02%7Cfrdong%40microsoft.com%7C1e126797c95149aa217d08dcb781cc60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638587015340054491%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=yBtXLc%2Bhpx3IT1%2FX0664foqQ5X5O%2Fy5XNhj4Oed%2BAt4%3D&reserved=0)

For large models, we are keep allocating and release memory for external
initializers which seems unnecessary.

For phi silica model, with this change we can reduce transient memory
usage from 4,566MB to 2,724MB. Since these redundant memory is released
quickly when we mmap external initializers so this change has no much
impact on peak memory usage.
2024-08-13 23:13:49 -07:00
Xu Xing
7172aff1cf
[js/webgpu] Fix max pool shape end with 0 (#21698)
Bug: https://github.com/microsoft/onnxruntime/issues/21386

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-13 20:59:24 -07:00
Dmitri Smirnov
c2911bbb1c
[CUDA] Special case for K==0 in CUDA MatMul (#21525)
### Description
This change addresses a case where we multiply two matrices, and their
inner dimension is 0.
numpy and Eigen which is being used in our CPU EP implementation
correctly handle this case
and output a [M, N] matrix filled with zeros.

### Motivation and Context
This is required to support GenAI empty input Lora implementation.

Addresses: https://github.com/microsoft/onnxruntime/issues/21483
2024-08-13 11:27:05 -07:00
liqun Fu
3439429717
Fix neural-speed ci failure (#21694)
### Description
fix
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1461029&view=logs&j=3565c00d-48fa-5c65-7ab9-a05e12e29ed0&t=e43fe03a-689e-5dc5-9ad5-9f116eba3e9d&l=6341



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2024-08-13 10:48:25 -07:00
jingyanwangms
154084efaa
Security Fuzz Test Fixes (#21608)
### Description
Fix address sanitizer and memory access Bug 1, 4, 5, 7, 8 found in
security fuzz test

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-11 03:28:41 -07:00
Chi Lo
2abebb2a47
[TensorRT EP] No workspace size limit to TRT memory pool (#21643)
We saw some models failed to run due to OOM and can be fixed by increase
trt_max_workspace_size.
This PR makes no size limitation by default (max device memory) which is
aligned with trtexec.
2024-08-09 17:30:51 -07:00
Caroline Zhu
eeef0c8aca
Enable exporting for inference when loading from buffer without behavior changes (#21601)
### Description
Added eval model buffer as optional field in Module so that you can
export for inference using the eval model stored as a buffer.

### Motivation and Context
- Resolves #21152 
- Previous solution (PR #21422) produced an eval model that was specific
to the EP's used to train because of unavoidable runtime optimizations
that changed the graph stored with the eval session.
2024-08-09 16:59:50 -07:00
Krishna Bindumadhavan
37be90c9c8
[Quant tool]: Improve symmetric quantization range update for Relu/Clip (#21573)
### Description
This PR improves the range calculation for input to Relu/Clip nodes for
the symmetric quantization case.

### Motivation and Context
Currently, the issue we face is that for the common scenario of conv
followed by relu in the symmetric quantization config, different scales
could assigned for the tensors corresponding to input & output of relu.

The downside is that this may introduce noise due to multiple re-quant,
and makes it difficult to fuse conv-relu nodes for hardware accelerators
that support fused conv-relu.

Instead, it is more efficient to assign the output range of relu as the
input range of relu / output range of upstream op wherever possible.
This adjustment is currently only being done for the asymmetric
quantization case.

For the scenario where the upstream op has multiple consumers, this
assumption could be incorrect. For this case we do not adjust the
ranges.
2024-08-09 14:48:09 -07:00
Adrian Lizarraga
390f0fd8ce
[QNN Quant tool] Fix validation of per-channel overrides for models with external data (#21656)
### Description
Fixes validation of per-channel quantization overrides by not trying to
unnecessary load the external weights.

### Motivation and Context
The `get_qnn_qdq_config()` explicitly loads models without external data
(i.e., `onnx.load_model(load_external_data=False)`). Afterwards,
`get_qnn_qdq_config()` calls `tensor_proto_to_array()`, which expects
that the external weights are stored in the current working directory.
If the external weights are stored in a different directory, then we get
a crash.

Loading the actual weight values is unnecessary because we only need the
weight shape. This PR removes the unnecessary call to
`tensor_proto_to_array()` call.
2024-08-09 14:46:52 -07:00
Satya Kumar Jandhyala
51b2044120
[JS/WebGPU] Add Dequantizelinear operator (#21642)
### Description
Added DequantizeLinear operator for JSEP.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-09 14:44:19 -07:00
Yifan Li
906ae77eea
[TensorRT EP] Add null_ptr check to avoid crash when running session which was failed to generate trt_engine previously (#21621)
### Description
<!-- Describe your changes. -->
Add null_ptr check to avoid crash when running session which was failed
to generate trt_engine previously


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Reported and verified by
https://github.com/microsoft/onnxruntime/issues/21567
2024-08-09 14:09:22 -07:00
saurabh
88788474b9
fix handling of multiple QuantizeLinear nodes (#21675)
### Description
This fix addresses the issue of handling multiple QLinear nodes as
outputs from the target node in OVEP. Previously, the stripping logic
only supported a single Q node, leading to incorrect stripping of
additional Q nodes.



### Motivation and Context
The OVEP stripping logic was limited to handling a single Q node as an
output from the target node. As a result, additional Q nodes were being
stripped, despite the stripping rules indicating they should be
retained.

With this fix, OVEP can now properly handle multiple Q nodes according
to the specified stripping rules, ensuring that the fate of each Q node
is correctly determined.

---------

Co-authored-by: sfatimar <sahar.fatima@intel.com>
2024-08-09 14:04:05 -07:00
Jing Fang
53a66f4e02
When quantize 4bit mamtul, force upgrade onnx domain opset to 21 (#21693)
### Description
When quantize MatMul to DQ + MatMul using 4bit QDQ tool chain,
previously the opsets of domains are not changed.
Now, when quantize MatMul to DQ + MatMul in QDQ format, force upgrade
onnx domain to opset 21.

### Motivation and Context
In QDQ format, DQ with int4 and blocked quantization is used. This
requires DQ with opset >= 21.
When quantize MatMul to DQ + MatMul, force upgrade onnx domain to opset
21.
2024-08-09 13:50:12 -07:00
duanshengliu
c6a73defb8
Fix wrong per-tensor quantized weight type for matmul (#21347)
### Description
<!-- Describe your changes. -->
Fix wrong per-tensor quantized weight type for matmul.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix related bug as described in
https://github.com/microsoft/onnxruntime/issues/21346
2024-08-09 13:36:25 -07:00
Jing Fang
f30581ed2c
[CPU EP] Add block quantized Gather contrib op (#21630)
### Description
Add a gather that supports block-quantized input data.


### Motivation and Context
To support Web inference scenario with quantized vocabulary embeddings.
2024-08-09 12:15:11 -07:00
Sumit Agarwal
702b2e28e0
Fuse Pad even if Cast is present in-between (#21640)
### Description
This change enhances the existing Pad Fusion to fuse Pad even if a Cast
operator is present between Pad and Conv/MaxPool/AveragePool. It keeps
the Cast as it is.
<pre>
/*
 * Before Fusion:
 *     Pad
 *      |
 *    Cast (Optional)
 *      |
 *   Conv/MaxPool/AveragePool
 * 
 * After Fusion:
 *    Cast (Optional)
 *      |
 *   Conv/MaxPool/AveragePool
 */
</pre>


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-09 06:52:59 -07:00
Yulong Wang
f4ec85259a
[js/web] allow relative path matching (#21657)
### Description
<!-- Describe your changes. -->

This change allows to match external data path like `a.data` to
`./a.data`.


<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-09 03:13:40 -07:00
Tianlei Wu
9334d4e362
[CUDA] Fix MHA mask (#21655)
### Description
Fix a check of mask type introduced by me in a recent commit. Add tests.
2024-08-09 01:31:00 -07:00
Tianlei Wu
a46e49b439
Unblock migraphx and linux GPU training ci pipelines (#21662)
### Description
* Fix migraphx build error caused by
https://github.com/microsoft/onnxruntime/pull/21598:
Add a conditional compile on code block that depends on ROCm >= 6.2.
Note that the pipeline uses ROCm 6.0.

Unblock orttraining-linux-gpu-ci-pipeline and
orttraining-ortmodule-distributed and orttraining-amd-gpu-ci-pipeline
pipelines:
* Disable a model test in linux GPU training ci pipelines caused by
https://github.com/microsoft/onnxruntime/pull/19470:
Sometime, cudnn frontend throws exception that cudnn graph does not
support a Conv node of keras_lotus_resnet3D model on V100 GPU.
Note that same test does not throw exception in other GPU pipelines. The
failure might be related to cudnn 8.9 and V100 GPU used in the pipeline
(Amper GPUs and cuDNN 9.x do not have the issue).
The actual fix requires fallback logic, which will take time to
implement, so we temporarily disable the test in training pipelines.
* Force install torch for cuda 11.8. (The docker has torch 2.4.0 for
cuda 12.1 to build torch extension, which it is not compatible cuda
11.8). Note that this is temporary walkround. More elegant fix is to
make sure right torch version in docker build step, that might need
update install_python_deps.sh and corresponding requirements.txt.
* Skip test_gradient_correctness_conv1d since it causes segment fault.
Root cause need more investigation (maybe due to cudnn frontend as
well).
* Skip test_aten_attention since it causes assert failure. Root cause
need more investigation (maybe due to torch version).
* Skip orttraining_ortmodule_distributed_tests.py since it has error
that compiler for torch extension does not support c++17. One possible
fix it to set the following compile argument inside setup.py of
extension fused_adam: extra_compile_args['cxx'] = ['-std=c++17'].
However, due to the urgency of unblocking the pipelines, just disable
the test for now.
* skip test_softmax_bf16_large. For some reason,
torch.cuda.is_bf16_supported() returns True in V100 with torch 2.3.1, so
the test was run in CI, but V100 does not support bf16 natively.
* Fix typo of deterministic

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-08 19:44:15 -07:00
Xiang Zhang
c93b92a43f
fix wrong check for tree ensemble regressor (#21595)
Fix missed ORT_ENFORCE check which caused heap buffer overflow because
of out of bound access.
2024-08-07 16:27:18 -07:00
Yi Zhang
621b16f478
Pin transformer and optimum version (#21650)
### Description
<!-- Describe your changes. -->



### Motivation and Context
To fix whisper test failure
2024-08-07 17:47:15 +08:00
duanshengliu
b95aa0563f
Improve speed in combining per-channel data (#21563)
### Description
<!-- Describe your changes. -->
Improve speed in combining `per-channel` data for using a single
`np.concatenate` instead of multiple `np.concatenates` within a for
loop.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix the issue https://github.com/microsoft/onnxruntime/issues/21562

Signed-off-by: duansheng.liu <44742794+duanshengliu@users.noreply.github.com>
2024-08-06 16:23:20 -07:00
Adrian Lizarraga
0acefc7988
[QNN EP] Update QNN SDK to 2.25 (#21623)
### Description
- Update pipelines to use QNN SDK 2.25 by default
- Update ifdef condition to apply workaround for QNN LayerNorm
validation bug to QNN SDK 2.25 (as well as 2.24)



### Motivation and Context
Use the latest QNN SDK
2024-08-06 09:08:48 -07:00
liqun Fu
f6f9657fb6
Fix typos so to call correct vnni functions under vnni condition (#21625)
### Description
Fix 2 typos in mlas avx 4bit gemm implementation to call correct vnni
functions under vnni condition



### Motivation and Context
needed for 1.19.0 release

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
2024-08-05 20:52:26 -07:00
Prathik Rao
134f47743e
bumps up version in main from 1.19 -> 1.20 (#21588)
Bump up version in main from 1.19.0 to 1.20.0 since the release branch
has been cut.
2024-08-05 15:46:04 -07:00
Po-Wei (Vincent)
2653226ed0
Fail tests gracefully for the minimal cuda build (#21391)
### Description
Several tests result in segfaults during the minimal cuda build.
Although test failures are expected due to the limitation of the minimal
cuda EP, failing gracefully would be much preferred.



### Motivation and Context
To reproduce:
1. Build ORT with:
```bash
./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1
```
2. Run `onnxruntime_test_all`
```bash
...
[----------] 1 test from AllocationPlannerTest
[ RUN      ] AllocationPlannerTest.ReusedInputCrossDifferentStreams
Segmentation fault (core dumped)
```
2024-08-02 18:27:36 -07:00
Wanming Lin
8c641d7182
[WebNN EP] Support Dropout op (#21586)
### Description
WebNN only supports test mode, so we don't care about other inputs or
attributes about training mode, use WebNN's identity op to implement the
Dropout op directly.
2024-08-02 16:25:04 -07:00
Ted Themistokleous
45b7c41ef0
[MIGraphX EP] Set External Data Path (#21598)
### Description
<!-- Describe your changes. -->
Changes to add in Set external data path for model weight files.
Additional fixes to ensure this compiles off the latest v1.19
Onnxruntime


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Separate weights used for larger models (like stable diffusion) is
motivation for this change set

---------

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Artur Wojcik <artur.wojcik@amd.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
2024-08-02 16:19:04 -07:00