Commit graph

11529 commits

Author SHA1 Message Date
Adrian Lizarraga
514b4699b4
[QNN EP] Apply workaround for Conv validation bug when bias input is implicit (#21764)
### Description
- Adds a dummy bias of all zeros when translating a Conv without an
explicit bias input. This is a workaround for a QNN validation issue
that fails when the optional bias input is not provided.
- Corrects logic for unpacking of **non-zero int4** zero-points. Bug
does not impact models because we currently only support int4
zero-points equal to 0 (symmetric quant). But this would become an issue
in the future if/when QNN supports non-zero int4 zero-points (so good to
fix now).



### Motivation and Context
Support Conv operators without a bias input on QNN EP with the latest
QNN SDK.
2024-08-22 10:38:03 -07:00
Jian Chen
6c1a3f85a6
Do not allow clearing Android logs if the emulator is not running (#21578)
### Description
Do not allow clearing Android logs if the emulator is not running



### Motivation and Context
Previously the Clearing Android logs step stuck until the pipeline
timeout. If one of the previous steps failed.
2024-08-22 10:18:01 -07:00
Chen Feiyue
ff3e8b02c3
[VSINPU]Update vsinpu patches (#21402)
### Description
- update patches for accuracy modification && local result recording
2024-08-21 23:58:56 -07:00
Yueqing Zhang
3ff8ca29e5
[VitisAI] remove wrong error msg, required by Microsoft (#21715)
### Description
<!-- Describe your changes. -->
Remove legacy code and wrong message.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required by Microsoft to remove unwanted error message. This is
required for 8.15 release.

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
2024-08-21 21:10:28 -07:00
Tianlei Wu
25d7a4fa08
[CUDA] Update benchmark_mha.py to capture debug info to identify sdpa kernel (#21804)
Use debug info to identify sdpa kernel actually used, and show it in the
output of benchmark_mha.py. This updated benchmark script was used to
get the benchmark results in
https://github.com/microsoft/onnxruntime/pull/21629.
(1) Change the output format of debug info to output like SdpaKernel=*
(2) Add a step to capture stdout from onnxruntime session, and use
regular expression to parse SdpaKernel=* from the captured text.

Other minor changes:
(1) Set different default repeats during benchmark: 100 for CPU; and
10000 for CUDA.
(2) Fix PrintTensorByDims used in console dumper: if it is not enabled,
do not dump tensor.
(3) Update some comments

### Motivation and Context

Sometime, we will use fallback for a sdpa_kernel. It could confuse user
unless we can tell exact kernel is used in benchmark.
2024-08-21 17:30:16 -07:00
Tianlei Wu
44a3923ba5
run sparse attention test sequentially (#21808)
### Description

For some reason, run SparseAttention tests in parallel causes random
failure in CI pipeline. Maybe due to out of memory when too many tests
running in parallel.

This will run those tests in sequentially.
2024-08-21 17:24:58 -07:00
Jake Mathern
c0b68e77af
Fix warnings (#21809)
### Description
Minor changes to resolve some warnings in ORT

### Motivation and Context
Binskim for WindowsAI (which consumes ORT) treats warnings as errors,
and has hit these warnings.
As a security requirement, warnings like "signed/unsigned mismatch" must
be resolved.
2024-08-21 14:23:37 -07:00
Edward Chen
fb9ce18e88
Add K=0 check to MatMul<float>::Compute() specialization. (#21803)
Add K=0 check to `MatMul<float>::Compute()` specialization.
Add unit test to cover both primary template and float specialization.
2024-08-21 09:15:58 -07:00
Ted Themistokleous
0e827c27fb
[MIGraphX EP] Add support for MIGraphX Exhaustive tune flag (#46) (#21599)
### Description
<!-- Describe your changes. -->
Set the exhaustive tune flag through the MIGraphX API and make this a
Session option in Onnxruntime

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allow users to use MIGraphX Exhaustive tuning with Onnxruntime
inferences
This goers hand in hand with save/load after a model and been compiled
and tuning has found.

---------

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2024-08-21 07:32:12 -07:00
Ted Themistokleous
26a499323f
[MIGraphX EP Support] Update migx scripts (#21806)
### Description
<!-- Describe your changes. -->
No code changes to the EP only changes to the scripts whihc invoke
MIGraphX EP

- One case be explicit to set MIGraphX EP when running gpt2 testing
- The other to ensure we turn off optimizations like tensorRT and allow
MIGraphX to handle graph optimizations


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
MIGraphX has moved away from using rocBLAS and without this, some cases
used in CI shall fail as optmizations will attempt to use rocBLAS
kernels instead of MIGraphx EP directly.
2024-08-21 07:22:42 -07:00
Ted Themistokleous
ed155ad46a
[MIGraphX EP] Ensure we support all inputs for MatMulInteger and ConvInteger. (#21680)
… to int8 for now

Allow for models with biases/full input and only check for int8 support
in EP

### Description
<!-- Describe your changes. -->
Allows for all inputs for MatMulInteger and ConvInteger to be supported
for prequantized models


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes issues when using prequantized models that contain weight biases

---------

Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
2024-08-21 07:19:20 -07:00
mindest
009209e016
Fix Orttraining Linux Lazy Tensor CI Pipeline (#21652)
### Description
Fix `Orttraining Linux Lazy Tensor CI Pipeline`
- Remove unused import of `torch.onnx._internal.exporter`, whose path is
changed in newer torch (pytorch/pytorch#132429).
- Move import of `register_custom_op_symbolic` from `torch.onnx` into
local function, which causes circular import when running `import
torch.onnx` (at least in the CI environment).
2024-08-21 18:10:08 +08:00
Patrice Vignola
de6ebcbb54
[DML] Add int4 QDQ (#21592) 2024-08-20 23:44:58 -07:00
Yi Zhang
12f426c63f
update size limit check of training GPU wheel (#21762)
### Description
<!-- Describe your changes. -->



### Motivation and Context
The training wheel size limit should be 400M
2024-08-21 09:30:05 +08:00
Adrian Lizarraga
6fbb0ae81a
[TransposeOptimizer] Fix axis for QuantizeLinear inserted after DQ (per-channel) -> Unsqueeze (#21793)
### Description
- Fix computation of axis for `QuantizeLinear` inserted after the
sequence `DQ (per-channel) -> Unsqueeze`. Example:
  - Original: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Op`
- After QDQ fix-up: `DQ (axis = 0) -> Unsqueeze (axes = [0, 1, 2]) -> Q
(axis = 3) -> DQ (axis = 3) -> Op`
- Before this PR, the axis for the inserted Q/DQ ops was not correctly
set to 3 (left as 0).
- Fix normalization of negative axis values for `QuantizeLinear`
inserted after the sequence `DQ (per-channel) ->Transpose`
  - Existing code added the wrong rank value to normalize the DQ axis.

### Motivation and Context
Fix errors in handling of per-channel DQ in code that fixes QDQ
NodeUnits.
2024-08-20 16:26:02 -07:00
Adrian Lizarraga
28c252c77e
[QNN EP] Fix compile error for QNN EP on Windows x64 due to missing /bigobj flag (#21795)
### Description
Compiling onnxruntime with QNN EP on Windows x86_64 results in a
compilation error:
```shell
$ onnxruntime\test\optimizer\qdq_transformer_test.cc(1,1): error C1128: num
ber of sections exceeded object file format limit: compile with /bigobj [...onnxruntime\build\Debug\onnxruntime_test_all.vcxproj]
```

This PR adds the `/bigobj` compilation flag for the
`qdq_transformer_test.cc` file.
2024-08-20 10:11:43 -07:00
Tianlei Wu
fbc3927231
[CUDA] cuDNN Flash Attention (#21629)
### Description
- [x] Add cuDNN flash attention using cudnn frontend, and enable it in
MultiHeadAttention operator.
- [x] Support attention mask.
- [x] Support attention bias.
- [x] Update tests and benchmark script.

The cuDNN SDPA is disabled by default. To enable it, need the following:
(1) Requires cuDNN 9.3 or newer version installed.
(2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or
set `sdpa_kernel=8` cuda provider option to enable it.
(3) Only works on devices with compute capability >= 8.0.

Note that some combinations of parameters might be rejected due to
limited support of head dimension or sequence lengths.

Future Works:
(1) FP8 and BF16 APIs.  Currently, only API for FP16 are exposed.
(2) Add API to support ragged batching (padding removed in inputs).
(3) Support other input formats (like QKV_BS3NH).
(4) Currently, q are converted to BSNH, k/v are converted to either BSNH
or BNSH format. May do some experiment to see whether converting q to
BNSH could be better in some case.

### Example Benchmark Results on H100

The following tests are on FP16 MultiHeadAttention operator without
attention mask and attention bias.

#### Test Setting 1
batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 256 | 0 | 32 | 128

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000075 | 229.5 | torch:flash
Q,K,V (BNSH) | 0.000119 | 144.8 | torch:efficient
Q,K,V (BNSH) | 0.000224 | 76.5 | torch:math
Q,K,V (BSNH) | 0.000075 | 227.8 | ort:cudnn
Q,K,V (BSNH) | 0.000094 | 182.8 | ort:flash
Q,K,V (BSNH) | 0.000138 | 124.7 | ort:efficient
Q,K,V (BSNH) | 0.000438 | 39.3 | ort:math
Q,KV | 0.000129 | 133.0 | ort:cudnn
Q,KV | 0.000151 | 114.1 | ort:flash
Q,KV | 0.000194 | 88.5 | ort:efficient
QKV | 0.000154 | 111.8 | ort:cudnn
QKV | 0.000175 | 98.0 | ort:flash
QKV | 0.000217 | 79.0 | ort:efficient

#### Test Setting 2

batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 512 | 0 | 16 | 64

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000069 | 249.2 | torch:flash
Q,K,V (BNSH) | 0.000141 | 121.7 | torch:efficient
Q,K,V (BNSH) | 0.000294 | 58.5 | torch:math
Q,K,V (BSNH) | 0.000077 | 221.7 | ort:cudnn
Q,K,V (BSNH)  | 0.000087 | 196.6 | ort:flash
Q,K,V (BSNH)  | 0.000163 | 105.6 | ort:efficient
Q,K,V (BSNH)  | 0.000651 | 26.4 | ort:math
Q,KV | 0.000103 | 167.1 | ort:cudnn
Q,KV | 0.000117 | 146.3 | ort:flash
Q,KV | 0.000192 | 89.6 | ort:efficient
QKV | 0.000113 | 151.5 | ort:cudnn
QKV | 0.000128 | 134.7 | ort:flash
QKV | 0.000201 | 85.3 | ort:efficient
2024-08-20 08:50:22 -07:00
Yi Zhang
9f7e19cedd
[Fix] Make python API doc generation in Microsoft-hosted Agent (#21766)
### Description
<!-- Describe your changes. -->



### Motivation and Context
1. Python API doc needs to be merged from a fork, but 1ES self-hosted
pool is only for one github repo.
2. ubuntu-latest will be install numpy above 2.0 by default, and current
python API doc generation doesn't support it.
So I pin numpy < 2.0.0

---------
2024-08-20 23:32:38 +08:00
Satya Kumar Jandhyala
1fb2e71ddc
[JS/WebGPU] Avoid producing presentKey/presentValue outputs if pastKey/pastValue … (#21782)
Avoid producing presentKey/presentValue outputs if pastKey/pastValue
don't exists.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-19 18:02:19 -07:00
Adrian Lizarraga
a22cc078b4
[QNN EP] Add support for GatherElements (#15966)
### Description
- Adds support for the GatherElements operator to QNN EP.
- Adds GatherElements to QDQ quantizer tool.

### Motivation and Context
Enable more models to run on QNN EP.
2024-08-19 14:33:40 -07:00
Tianlei Wu
7c93d5ded1
Upgrade pytorch_lightning to 2.3.3 to fix orttraining_amd_gpu_ci_pipeline (#21789)
### Description
Upgrade pytorch_lightning to fix orttraining_amd_gpu_ci_pipeline
```
#24 1.838 WARNING: Ignoring version 1.6.0 of pytorch_lightning since it has invalid metadata:
#24 1.838 Requested pytorch_lightning==1.6.0 from cee67f4849/pytorch_lightning-1.6.0-py3-none-any.whl has invalid metadata: .* suffix can only be used with `==` or `!=` operators
#24 1.838     torch (>=1.8.*)
#24 1.838            ~~~~~~^
#24 1.838 Please use pip<24.1 if you need to use this version.
#24 1.838 ERROR: Ignored the following versions that require a different python version: 1.14.0 Requires-Python >=3.10; 1.14.0rc1 Requires-Python >=3.10; 1.14.0rc2 Requires-Python >=3.10; 2.1.0 Requires-Python >=3.10; 2.1.0rc1 Requires-Python >=3.10
#24 1.838 ERROR: Could not find a version that satisfies the requirement pytorch_lightning==1.6.0 (from versions: 0.0.2, 0.2, 0.2.2, 0.2.3, 0.2.4, 0.2.4.1, 0.2.5, 0.2.5.1, 0.2.5.2, 0.2.6, 0.3, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.4.1, 0.3.5, 0.3.6, 0.3.6.1, 0.3.6.3, 0.3.6.4, 0.3.6.5, 0.3.6.6, 0.3.6.7, 0.3.6.8, 0.3.6.9, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.5.0, 0.5.1, 0.5.1.2, 0.5.1.3, 0.5.2, 0.5.2.1, 0.5.3, 0.5.3.1, 0.5.3.2, 0.5.3.3, 0.6.0, 0.7.1, 0.7.3, 0.7.5, 0.7.6, 0.8.1, 0.8.3, 0.8.4, 0.8.5, 0.9.0, 0.10.0, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.0.6, 1.0.7, 1.0.8, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.1.6, 1.1.7, 1.1.8, 1.2.0rc0, 1.2.0rc1, 1.2.0rc2, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.2.5, 1.2.6, 1.2.7, 1.2.8, 1.2.9, 1.2.10, 1.3.0rc1, 1.3.0rc2, 1.3.0rc3, 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.3.6, 1.3.7, 1.3.7.post0, 1.3.8, 1.4.0rc0, 1.4.0rc1, 1.4.0rc2, 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.7, 1.4.8, 1.4.9, 1.5.0rc0, 1.5.0rc1, 1.5.0, 1.5.1, 1.5.2, 1.5.3, 1.5.4, 1.5.5, 1.5.6, 1.5.7, 1.5.8, 1.5.9, 1.5.10, 1.6.0rc0, 1.6.0rc1, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 1.6.4, 1.6.5, 1.7.0rc0, 1.7.0rc1, 1.7.0, 1.7.1, 1.7.2, 1.7.3, 1.7.4, 1.7.5, 1.7.6, 1.7.7, 1.8.0rc0, 1.8.0rc1, 1.8.0rc2, 1.8.0, 1.8.0.post1, 1.8.1, 1.8.2, 1.8.3, 1.8.3.post0, 1.8.3.post1, 1.8.3.post2, 1.8.4, 1.8.4.post0, 1.8.5, 1.8.5.post0, 1.8.6, 1.9.0rc0, 1.9.0, 1.9.1, 1.9.2, 1.9.3, 1.9.4, 1.9.5, 2.0.0rc0, 2.0.0, 2.0.1, 2.0.1.post0, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6, 2.0.7, 2.0.8, 2.0.9, 2.0.9.post0, 2.1.0rc0, 2.1.0rc1, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.1.4, 2.2.0rc0, 2.2.0, 2.2.0.post0, 2.2.1, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0)
#24 1.838 ERROR: No matching distribution found for pytorch_lightning==1.6.0
```
2024-08-19 12:58:22 -07:00
Jing Fang
64674c50de
Added a tool to quantize Gather to GatherBlockQuantized (#21697)
### Description
Added code in MatMul4BitsQuantizer to quantize Gather to
GatherBlockQuantized.

Only Gather with constant data is quantized.

Since quantized data is in int4, the quantized model will force upgrade
to onnx opset 21.

The implementation purely relies on numpy. If optimization is needed,
C++ kernels can be added later.

Only support default RTN algorithm since GatherBlockQuantized require
zero points to have the same type as quantized data.

### Motivation and Context
Support quantizing gather to int4 in Web scenario.
2024-08-19 10:25:36 -07:00
Wanming Lin
7ae0b4ce64
[WebNN EP] Support Erf and Trilu for CPU backend (#21768) 2024-08-19 07:56:16 -07:00
xhcao
417aa00406
[js/webgpu] fix conv1d error (#21585)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-18 15:45:13 -07:00
mingyueliuh
d1d40fbafd
[VitisAI][Fix] ShapeInferContext GetAttrxxxs support empty value (#21471)
### Description
Bug fix for the ShapeInferContext GetAttrxxxs APIs. Node attribute maybe
is empty.



### Motivation and Context
If the attr value is empty, the expected result through the interface is
empty , but currently, it returns a meaningless {0}.

---------

Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: Liu Minyue <mingyue@xilinx.com>
2024-08-18 13:51:25 -07:00
jingyanwangms
c018ba43ef
[Running CI] [TensorRT EP] support TensorRT 10.3-GA (#21742)
### Description
- TensorRT 10.2.0.19 -> 10.3.0.26

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-18 13:26:41 -07:00
Jiajia Qin
c4ade796d6
[js/webgpu] Fix attention shader recompilation issue (#21770)
### Description
<!-- Describe your changes. -->

This PR fixes the `AttentionProbsSoftmax` recompilation issue when
executing the phi3 model. With this fix, it will further improve the
phi3 performance.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-17 17:15:15 -07:00
Yang Gu
49fc168eed
[js/webgpu] Handle negative axis in op Split (#21771)
This is to fix issue #21703, where the axis is a negative value in the
model. According to the spec
(https://onnx.ai/onnx/operators/onnx__Split.html), negative axis means
counting dimensions from the back.
2024-08-17 16:41:23 -07:00
Tianlei Wu
d79e3c5791
Extend Attention Bias Broadcast Support (#21710)
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.

- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script

Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
2024-08-16 15:40:04 -07:00
Edward Chen
63e8849992
build_aar_package.py - Check that executable is present before trying to copy it. (#21730)
Check that executable is present before trying to copy it.

Accommodate builds where we skip building the test executables.
2024-08-16 11:21:09 -07:00
Emmanuel
a4bec3d374
Enabled Dynamo exporter (#21713)
### Description
This PR modifies the run_dynamo_export function to ensure it mirrors the
behavior of run_torchscript_merged_export rather than
run_torchscript_separate_export. Additionally, I made adjustments to the
main function to ensure that run_dynamo is correctly invoked.



### Motivation and Context
The main motivation for this change is to enable successful export of
LLaMA-2 and LLaMA-3 models using the Dynamo exporter to ONNX.
Previously, the exporter was saving two copies of the weights, which is
inefficient. The modified approach ensures that only one copy of the
weights is saved, and the model can support both scenarios. These
changes enhance the compatibility of the exporter with LLaMA models and
subsequently other models and optimize the export process
2024-08-16 10:45:22 -07:00
Wanming Lin
b2d603abda
[WebNN EP] Remove workaround for scalar (#21704)
Currently Chromium has supported scalar with dims = {}, remove legacy
workaround for supporting scalar.
2024-08-15 22:59:51 -07:00
Scott McKay
c97cc5c1b0
Put all external project targets under the 'External' folder in VS (#21765)
### Description
<!-- Describe your changes. -->
Handle targets in subdirectories for external projects. All targets will
now go in a per-project folder under 'External'

e.g. gmock and gtest now get handled correctly and are under
External/googletest vs. existing setup where they ended up as top-level
projects.


![image](https://github.com/user-attachments/assets/99ec259c-47cd-44f3-954d-58569c941cc2)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve developer experience.
2024-08-16 15:51:50 +10:00
Yulong Wang
ef2ccc477b
[js/web] Add support for int4/uint4 tensor (#21720)
### Description
Add support for int4/uint4 tensor.
2024-08-15 21:32:10 -07:00
Yulong Wang
d4d0bea1fb
[js] update docs for new code formatter (#21743)
### Description

Update README.md for code formatter change (#21728)
2024-08-15 20:17:08 -07:00
Yang Gu
f8efc086ce
[js/webgpu] Support Chrome Canary in unit tests (#21750)
Chrome Canary is helpful to test some new features. With this PR, we can
enable Chrome Canary in unit tests with command like "npm test -- op
abs.jsonc -b=webgpu -e=chromecanary".
2024-08-15 19:27:54 -07:00
Dmitri Smirnov
754dba2674
Change to std::fill (#21759)
### Description
Replace `memset(0)` with `std::fill(T{})`. This would ensure that all
the types are initialized in a portable way.

### Motivation and Context
Some platforms exhibit intermittent failures with NaN results.
Follow up to: https://github.com/microsoft/onnxruntime/pull/21525

Cc: @ranjitshs
2024-08-15 16:16:54 -07:00
Tianlei Wu
b9f3a5d5b6
Exclude cudnn 8 DLLs from manylinux package (#21746)
### Description
It is a follow up of https://github.com/microsoft/onnxruntime/pull/21738
to exclude cudnn 8 DLLs since some python packaging pipelines (like
training package) are still using cudnn 8.9 and cuda 11.8.

### Motivation and Context

Size of python package for training pipeline increases a lot due to some
DLLs are added to package:

![image](https://github.com/user-attachments/assets/643a808e-760b-4382-ba55-57d7d722ee9a)
2024-08-15 07:48:42 -07:00
Yi Zhang
8a59b4dc4b
Move Python Training CUDA 12.2 pipeline to another pool. (#21745)
### Description
<!-- Describe your changes. -->



### Motivation and Context
[Python Training CUDA 12.2
pipeline](https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1308&_a=summary)
has been always cancelled by remote provider since Aug 2nd.
But other workflows with the same pool haven't this issue.
 It looks like there're some weird things in Azure devops.
It works by using another pool. In fact, the SKU is smaller than the
old.

### Verification
https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1308&_a=summary
2024-08-15 17:31:56 +08:00
Tianlei Wu
212bcc9967
Exclude cuDNN 9 and CUDA 12 DLLs from manylinux wheel (#21738)
### Description
Exclude cuDNN 9 and CUDA 12 DLLs from manylinux wheel to reduce python
package size.

### Motivation and Context

The 1.20.0 ort-nightly-gpu python wheels on linux are suddenly > 800 MB
in size. The wheels built on 1.19 release branch have a size of around
220 MB.

The size change is caused by
https://github.com/microsoft/onnxruntime/pull/19470.
2024-08-15 00:03:10 -07:00
Yulong Wang
abdc31de40
[js] change default formatter for JavaScript/TypeScript from clang-format to Prettier (#21728)
### Description

See
454996d496
for manual changes (excluded auto-generated formatting changes)

### Why

Because the toolsets for old clang-format is out-of-date. This reduces
the development efficiency.

- The NPM package `clang-format` is already in maintenance mode. not
updated since 2 years ago.
- The VSCode extension for clang-format is not maintained for a while,
and a recent Node.js security update made it not working at all in
Windows.

No one in community seems interested in fixing those.

Choose Prettier as it is the most popular TS/JS formatter.

### How to merge

It's easy to break the build:
- Be careful of any new commits on main not included in this PR.
- Be careful that after this PR is merged, other PRs that already passed
CI can merge.

So, make sure there is no new commits before merging this one, and
invalidate js PRs that already passed CI, force them to merge to latest.
2024-08-14 16:51:22 -07:00
Satya Kumar Jandhyala
6d8de1f7b8
Upgrade emsdk from 3.1.59 to 3.1.62 (#21421)
### Description
Upgrade EM SDK to 3.1.62.



### Motivation and Context
The changes are required to clear wasm64 errors.
2024-08-14 12:38:52 -07:00
Guenther Schmuelling
d82f15d0e3
add Gelu opset-20 to webgpu (#21725)
https://github.com/microsoft/onnxruntime/issues/21618
2024-08-14 09:45:05 -07:00
Frank Dong
a0708a0d96
avoid redundant memory allocation for external initializers (#21682)
### Description
avoid redundant memory allocation for external initializers, we will use
mmap for external initializers later so no point to allocate memory in
advance then release them later.



### Motivation and Context
In current implementation, we will:
1. Allocate memory (with desired size of current initializer) for
initializer first:
[https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/session_state_utils.cc#L131](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fonnxruntime%2Fblob%2Fmain%2Fonnxruntime%2Fcore%2Fframework%2Fsession_state_utils.cc%23L131&data=05%7C02%7Cfrdong%40microsoft.com%7C1e126797c95149aa217d08dcb781cc60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638587015340041125%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=6fN57MUsergrCX%2BBS7jztWBRmc8nx19EVvn0lUJ2Gtk%3D&reserved=0)
2. For external initializer, we will point initializer to mmaped object
in memory and release previously allocated tensor:
[https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/session_state_utils.cc#L89](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fonnxruntime%2Fblob%2Fmain%2Fonnxruntime%2Fcore%2Fframework%2Fsession_state_utils.cc%23L89&data=05%7C02%7Cfrdong%40microsoft.com%7C1e126797c95149aa217d08dcb781cc60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638587015340054491%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=yBtXLc%2Bhpx3IT1%2FX0664foqQ5X5O%2Fy5XNhj4Oed%2BAt4%3D&reserved=0)

For large models, we are keep allocating and release memory for external
initializers which seems unnecessary.

For phi silica model, with this change we can reduce transient memory
usage from 4,566MB to 2,724MB. Since these redundant memory is released
quickly when we mmap external initializers so this change has no much
impact on peak memory usage.
2024-08-13 23:13:49 -07:00
Xu Xing
7172aff1cf
[js/webgpu] Fix max pool shape end with 0 (#21698)
Bug: https://github.com/microsoft/onnxruntime/issues/21386

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-13 20:59:24 -07:00
Prathik Rao
e32e3575d8
pin pytorch lightning version for training CI (#21731)
### Description
<!-- Describe your changes. -->

Pins pytorch-lightning package to version 2.3.3 since version >=2.4.0
requires torch > 2.1.0 which is not compatible with cu118.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

ORT 1.19 Release Preparation
2024-08-13 20:04:56 -07:00
Yi Zhang
b92908e197
[Fix] Python API doc generation (#21717)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Make Python API doc generation workflow work.

### Verification Run
https://github.com/microsoft/onnxruntime/actions/runs/10364762858
2024-08-14 08:48:29 +08:00
Dmitri Smirnov
c2911bbb1c
[CUDA] Special case for K==0 in CUDA MatMul (#21525)
### Description
This change addresses a case where we multiply two matrices, and their
inner dimension is 0.
numpy and Eigen which is being used in our CPU EP implementation
correctly handle this case
and output a [M, N] matrix filled with zeros.

### Motivation and Context
This is required to support GenAI empty input Lora implementation.

Addresses: https://github.com/microsoft/onnxruntime/issues/21483
2024-08-13 11:27:05 -07:00
Scott McKay
6af5394bd7
Replace usage of jcenter in React Native build.gradle files (#21714)
### Description
<!-- Describe your changes. -->
Replace jcenter. It's deprecated and not responding.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CIs
2024-08-13 11:10:51 -07:00
liqun Fu
3439429717
Fix neural-speed ci failure (#21694)
### Description
fix
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1461029&view=logs&j=3565c00d-48fa-5c65-7ab9-a05e12e29ed0&t=e43fe03a-689e-5dc5-9ad5-9f116eba3e9d&l=6341



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2024-08-13 10:48:25 -07:00