Commit graph

10624 commits

Author SHA1 Message Date
pengwa
026e3178ae
Improve memory matrix for ORTModule (#19620)
### Memory matrix for ORTModule

Collect  parameter/gradient/buffers sizes also. 
Exposed as a function, can be used externally for debugging purpose. 


```
2024-02-27 07:18:55,283 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 816 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,322 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 816 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,358 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 816 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,438 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▏                                                                                                                                                                                                                                              | 2/3200 [01:27<32:05:11, 36.12s/it]2024-02-27 07:18:55,498 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,537 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,576 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,657 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▏                                                                                                                                                                                                                                              | 3/3200 [01:27<17:30:57, 19.72s/it]2024-02-27 07:18:55,711 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,750 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,786 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,867 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
[2024-02-27 07:18:55,886] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
  0%|▎                                                                                                                                                                                                                                              | 4/3200 [01:28<10:39:52, 12.01s/it]2024-02-27 07:18:55,902 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,944 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,979 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,060 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▍                                                                                                                                                                                                                                               | 5/3200 [01:28<6:53:04,  7.76s/it]2024-02-27 07:18:56,115 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,154 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,190 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,270 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▍                                                                                                                                                                                                                                               | 6/3200 [01:28<4:36:19,  5.19s/it]2024-02-27 07:18:56,323 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,365 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,398 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,478 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▌                                                                                                                                                                                                                                               | 7/3200 [01:28<3:09:33,  3.56s/it]2024-02-27 07:18:56,533 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,572 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,608 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,727 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▌                                                                                                                                                                                                                                               | 8/3200 [01:28<2:13:48,  2.52s/it]2024-02-27 07:18:56,806 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,846 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,882 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,962 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▋                                                                                                                                                                                                                                               | 9/3200 [01:29<1:36:03,  1.81s/it]2024-02-27 07:18:57,053 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:57,094 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8

```
2024-02-28 15:57:05 +08:00
Yi Zhang
f95c0773a1
Add share memory Flag in docker (#19672)
### Description



### Motivation and Context
Ref:
https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#setincshmem

Co-authored-by: Your Name <your@email.com>
2024-02-28 10:40:40 +08:00
Maximilian Müller
c20ced4132
Use CMake's find package for CUDA libs (#19673)
### Description
Answers issue #19640 
More details are in the issue, basically I am changing all the include
directory and link directory usage to CMake's `CUDA::*` targets
2024-02-27 11:26:48 -08:00
Yulong Wang
3cb81cdde2
[js/common] move 'env.wasm.trace' to 'env.trace' (#19617)
### Description

Try to move 'env.wasm.trace' to 'env.trace' to make it less confusing,
because it also works in webgpu. Marked 'env.wasm.trace' as deprecated.
2024-02-27 11:07:15 -08:00
zesongw
2e4d1b8f1b
[WebNN EP] Add support for Op MatMul of WebNN CPU backend (#19413)
Enable MatMul support for WebNN CPU backend to support more models.
2024-02-27 10:01:12 -08:00
Scott McKay
1c468a03b9
Improve Nuget-CUDA-Packaging-Pipeline (#19668)
### Description
<!-- Describe your changes. -->
* Publish the artifacts as late as possible
* once published the artifacts are immutable, and any retry will fail if
they exist
  * if any step fails after publishing the stage cannot be retried
* use powershell to cleanup
  * DeleteFiles is taking >30 mins and causing the stage to timeout
  * powershell took < 1s

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make pipeline more robust
2024-02-27 09:27:43 -08:00
Scott McKay
580ee20dfc
Tweak Windows build parallelization settings (#19664)
### Description
<!-- Describe your changes. -->
Use UseMultiToolTask and limit the number of cl.exe instances running. 

MultiToolTask info:
https://devblogs.microsoft.com/cppblog/improved-parallelism-in-msbuild/

Info on why limiting CL_MPCount can help:
https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows

The current CIs have 4 cores (both physical and logical). Hardcoded the
GPU build in win-ci.yml to use CL_MPCount of 2 as that seems to work
fine. Can adjust if needed to base it on the actual number of cores or
to use build.py to build.

Caveat: I've run about 16 builds and haven't seen a slow build yet, but
as the root cause of the slow builds isn't really known this isn't
guaranteed to be a fix.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Try and prevent super slow GPU builds by reducing number of tasks
potentially running in parallel.
2024-02-27 08:56:16 -08:00
Yi Zhang
3b46ab6439
Re-add testing removed by mistake. (#19647) 2024-02-27 08:46:29 -08:00
Adrian Lizarraga
4838cb6b3e
[QNN Quantization] Ensure fused nodes have names (#19650)
### Description
- Updates the `qnn_preprocess_model()` method to set a name for any new
nodes added to the graph (due to fusion).
- Updates the `qnn_preprocess_model()` method to set a name for any
unnamed nodes that previously existed in the original graph.
- Adds unit tests for fusions (previously missing)
  - Checks that fused node names exist and are unique
  - Checks that fused graph is equivalent to original graph


### Motivation and Context
Nodes are not strictly required to have names. However, a
planned/upcoming feature to support mixed-precision (integer) quantized
models needs nodes to have names.
2024-02-27 02:27:35 -08:00
cloudhan
1e69b61238
Make version string detection more robust (#19615)
`/opt/rocm/.info/version-dev` is only available if the `rocm-dev`
metapackage is installed. This will bring a lot of unused packages which
are not needed by the users, they may opt for fine grained control.
Fallback to `rocm_version.h` in case `rocm-dev` is not installed.
2024-02-27 16:06:06 +08:00
duanshengliu
9e19684944
Fix the TypeError issue in quantize.py (#19459)
### Description
<!-- Describe your changes. -->
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix related bug as described in
https://github.com/microsoft/onnxruntime/issues/19430
2024-02-26 20:56:32 -08:00
Rachel Guo
5bb58a10e7
Enable the most verbose logging level in detox E2E React Native CI (#19659)
### Description
<!-- Describe your changes. -->

The RN CI has intermittent failure error with "app seems to idle".
enable the most verbose logging level (and can add steps to dump
device.log from the detox folder/artifacts if necessary) to at least get
more information.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2024-02-26 20:00:14 -08:00
kailums
6f566562ce
support user_compute_stream for rocm ep (#19619)
### Description
<!-- Describe your changes. -->
According to the pr #19229 supporting cuda EP use external compute
stream, we add support for rocm EP.

And when we testing this feature with torch, we found torch use stream 0
for the default stream, and `torch.cuda.current_stream()` returns `0`
for current stream, but ort treat `0` or `nullptr` as invalid, and reset
has_user_compute_stream to false. 

Will remove has_user_compute_stream option in the future.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The motivation for this pr is that we want to use torch.cuda.graph to
capture ort running kernel, which requires torch and ort are running in
the same stream, so we use this API to set ort's working stream.
2024-02-27 11:31:03 +08:00
Scott McKay
8a71b65765
Remove skipping of Reshape from NNAPI EP (#19618)
### Description
<!-- Describe your changes. -->
A number of Qualcomm Snapdragon chipsets do not produce correct output
if we skip the Reshape, which ironically was a performance optimization
for Snapdragon chips.

Perf testing showed that Squeeze also seems to execute on CPU so there's
no benefit to using that as an alternative where possible e.g.
Global*Pool -> Reshape to 2D -> Gemm could be potentially be replaced
with Global*Pool -> Squeeze dims 2 and 3 -> Gemm if that offered better
performance.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19518
2024-02-27 11:35:27 +10:00
Changming Sun
18c8fab1ae
Fix a bug in build.py (#19652)
### Description
Fix a bug in build.py that accidentally disabled C# tests for most
builds when "--build_nuget" is specified.

### Motivation and Context
The bug was introduced in PR #8892 .
2024-02-26 15:58:09 -08:00
Scott McKay
8bd943be39
Retry flaky XCode iOS UI tests if we get a known error (#19639)
### Description
<!-- Describe your changes. -->
Xcode UI tests seem to be flaky:
https://github.com/orgs/community/discussions/68807
Add a couple of retries if we get a "Timed out while loading
Accessibility." error which is transient.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-27 09:31:32 +10:00
Sumit Agarwal
a9568935a5
[DML EP] Enable DML Graph Serialization (#19505)
### Description
This PR adds a feature to serialize all DML EP partitions into DML
currency individually for a given a model. This feature can be
dynamically turned on by using DML EP option
`ep.dml.enable_graph_serialization`.


### Motivation and Context
- Why is this change required? What problem does it solve?
Useful when user want to capture the DML EP specific partition into DML
currency to mitigate the dependency on the framework.
<!-- - If it fixes an open issue, please link to the issue here. -->
2024-02-26 11:35:13 -08:00
Yufeng Li
430a086f22
fix memory mapping on Windows (#19623)
### Description
<!-- Describe your changes. -->
Windows memory map casts mapped_offset to DWORD directly. It will be
truncated if it is larger than 2^32-1. We need to set high
dwFileOffsetHigh for this case.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The bug was found from #19450
2024-02-25 08:50:45 -08:00
Yi Zhang
0fcc6fb760
Add Whisper model in CI (#19604)
### Description
 Add Whisper Conversion and E2E into Big Models pipeline



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <your@email.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
2024-02-25 14:04:22 +08:00
Yi Zhang
c980149c85
Add log for random exception in Linux GPU Test Stage. (#19569)
### Description
1. check GPU status in docker
2. use stages to make test stage can leverage existing building
artifacts


### Motivation and Context
To investigate the root cause of the random exception
`CUDA failure 100: no CUDA-capable device is detected`
2024-02-24 13:00:53 -08:00
Yulong Wang
0edb035808
[js/web] fix suite test list for zero sized tensor (#19638)
### Description

Fixes build break brought by #19614

Currently WebGL backend does not support zero sized tensor. This change
split test data into 2 parts, and only enable zero sized tensor tests
for WebGPU.
2024-02-24 10:09:07 -08:00
Changming Sun
9ccdc4961a
Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib (#19632)
### Description
Stop using apiset in OneCore build: use onecoreuap.lib instead of
onecoreuap_apiset.lib in onecore build.


### Motivation and Context
1. Now all Windows Editions come with Reverse Forwarders. We should just
use the normal onecore libs.
2. Many new Windows APIs are only available in [windows umbrella
libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries).
So these libraries are not specific for Windows CoreOS or Onecore.
3. Going forward we should use "IsApiSetImplemented" to guard our API
usages:

https://learn.microsoft.com/en-us/windows/win32/apiindex/detect-api-set-availability
.

After this change, our built binaries can pass apivalidator's check.

```
C:\local\apivalidator>apivalidator.exe -BinaryPath:C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll -SupportedApiXmlFiles:onecoreuap_DDIs.xml
ApiValidation:
Summary:
        "C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll" is Universal


ApiValidation: All binaries are Universal
```
So it will give an easy way to test ONNX Runtime's compatibility to
Windows versions.
2024-02-23 22:31:57 -08:00
Scott McKay
c12a20bef9
Add helper to run CIs for a branch using az pipelines. (#16843)
### Description
<!-- Describe your changes. -->
Add helper to run CIs for a branch using `az pipelines`.
This can be used to easily kick off multiple CIs for a branch prior to
creating a PR.

Update run_CIs_for_external_pr.py so the CI list can be shared. 
Request json output from `gh pr view` so the current state is more
easily parsed.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-24 14:06:30 +10:00
Sheil Kumar
46c4d7fe4a
Disable gemm activation for non-float data types (#19612)
### Description
Disable gemm activation for non-float data types


### Motivation and Context
When a float16 model contains a Gemm+Relu subgraph, the
gemm_activation_fusion will kick in and cause the two nodes to be
eliminated and replaced with a FusedGemm. This however is only
registered for the float data type. This causes model load failures.

Disable the fusion for non-float data types.

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2024-02-23 18:20:22 -08:00
Guenther Schmuelling
bb43a0f133
[js/webgpu] minor fixes to make tinyllama work (#19564) 2024-02-23 15:45:30 -08:00
Yulong Wang
aec2389ad0
[js/webgpu] allows a ProgramInfo's RunData to use zero sized output (#19614)
### Description
This PR allows zero-sized output.

To make the implementation simple, it does not support partial
zero-sized tensor. Which means, either all outputs are zero-sized, or an
error will be reported.

added 2 tests:
 - op test of `Add` with input T[2,0] T[2,1], and
 - test_split_zero_size_splits
2024-02-23 12:52:47 -08:00
Markus Tavenrath
efbe2b8455
Fix cuDNN v9 build by replacing removed cuDNN v6 RNN API usage by cuDNN v8 RNN API and reenable RNN tests for CUDA EP (#19419)
Replace deprecated cuDNN RNN based API by cuDNN v8 RNN API and re-enable
RNN tests for the CUDA EP.

### Motivation and Context
The deprecated cuDNN RNN API might vanish soon and in addition for the
current CUDA EP RNN implementation all RNN tests are disabled due to
failures. With this change the deprecated API has been removed and the
new updated implemented doesn't fail the tests anymore.
2024-02-23 08:45:17 -08:00
cao lei
f430600432
Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515 (#19609)
### Description
<!-- Describe your changes. -->
Enable streams for DML EP. This change is to revert PR 19481 since the
bug 19480 is fixed by PR 19515


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable streams for DML EP. This change is to revert PR 19481 since the
bug 19480 is fixed by PR 19515
2024-02-23 06:02:05 -08:00
satyajandhyala
ae3d73c981
[JS/WebGPU] Fix Split and Where to handle corner cases. (#19613)
### Description
<!-- Describe your changes. -->
1. Fix Where operator to handle Boolean input less than 4 bytes.
2. Fix JSEP test harness to use tensor names consistently.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-23 00:21:15 -08:00
Markus Tavenrath
5e432a3ae6
Add support for NHWC GridSample in the CUDA EP and enable grid_sample_test for all EPs (#19562)
I've added NHWC GridSample support to the CUDA EP to reduce the number
of layout transforms. Also I've enabled the full set of GridSampleTests
for all EPs. I've also added the GridSample OpSet 16 to the registered
kernels.

### Motivation and Context
This is the first PR is a series of enhancements of the CUDA EP
improving NHWC support to avoid costly layout transforms between NWHC
and NCHW nodes which are layout sensitive. Also testing was quite
rudimentary for the CUDA EP while it was great for the CPU path. I've
regenerated grid_sample_test.cc enabling tests for other platforms as
well. Those tests resurfaced #10607 again which is fixed as well.
2024-02-22 19:47:15 -08:00
pengwa
ae92d593c0
ONNX Gelu Op in Opset 20 (#19560)
### ONNX Gelu Op in Opset 20

Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op

1. Move CPU-GELU implmentation from
`onnxruntime/contrib_ops/cpu/activations.h/cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'none'.
2. Dumplicate some logic from
`onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'tanh'.
3. Register ONNX domain Gelu CPU kernel from opset 20 in
`onnxruntime/core/providers/cpu/cpu_execution_provider.cc`.
4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to
`onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and
`onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu`
respectively, as the implementation for approximate attribute to be
'tanh'.
5. Implement the logic for approximate attribute to be 'none' in
`onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`.
6. Register ONNX domain Gelu CUDA kernel from opset 20 in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
7. ROCM ep related changes. 
8. Enrich the tests for ONNX domain Gelu in
`onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.
2024-02-23 11:05:16 +08:00
Segev Finer
29b1106033
[node] Switch to setImmediate to avoid starving the Node.js event loop (#19610)
### Description
<!-- Describe your changes. -->
Switch to setImmediate to avoid starving the Node.js event loop

There should really be a true async version though, running
computationally intensive things on the event loop will stop everything
else from happening while it is running, e.g. a web server from
answering requests.

This can be done by wrapping `RunAsync` behind a
[`napi::Promise`](https://github.com/nodejs/node-addon-api/blob/main/doc/promises.md)
to run on the onnxruntime thread pool or [`AsyncWorker`](
https://github.com/nodejs/node-addon-api/blob/main/doc/async_worker.md)
for the Node.js/libuv thread pool.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Without this, if you run inference in a tight loop, without anything
else in between that is async/deferred, `process.nextTick` will lead to
starving the event loop and not letting anything else run,
`setImmediate` at least lets the event loop spin between calls to `run`.

See
https://dev.to/ynmanware/setimmediate-settimeout-and-process-nexttick-3mfd

Contributed on behalf of [Swimm](https://swimm.io/)
2024-02-22 18:53:50 -08:00
Hector Li
4ab497603e
Enable user to set QNN HTP performance mode for every session run (#19521)
### Description
Currently, the QNN HTP performance mode is set during session creation, there's no way to change it afterwards. There's requirement to set it high performance mode for high priority request and set it back to low performance mode later to save the power when the incoming request is idle for example.

Now, still keeps the performance mode at the session level in QNN EP options which is used at the default one. Ort QNN EP will set it once if user set it.
And there are setting (qnn.htp_perf_mode and qnn.htp_perf_mode_post_run) in run option to change the performance mode before and after session run. There's recommended scenario that user set the mode to high performance mode before the the inference sun so that user can get the result back ASAP. And set the mode to low performance mode after the inference to save the power.
2024-02-22 17:04:59 -08:00
AtomicVar
5e5c36f6df
Fix citation author name issue (#19597)
Use `name` rather than `given-names` to set author name.

### Motivation and Context
The old CITATION.cff uses `given-names` to set author names, which won't
be rendered properly with some bibtex style of LaTeX:

<img width="680" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/22856433/c509400e-5b16-4400-8950-550b05186369">

The problem is that **the `"ONNX Runtime developers"` is regarded as a
human name**.

How to fix: by using `name` to set author name, the generated Bibtex
entry will use `{}` to enclose the `"ONNX Runtime developers"`. Then it
is displayed literally:

<img width="742" alt="image"
src="https://github.com/microsoft/onnxruntime/assets/22856433/94083c9f-0daa-4c51-92e1-c966b88d09d2">
2024-02-22 17:03:56 -08:00
dependabot[bot]
76a2a487a1
Bump ip from 1.1.8 to 1.1.9 in /js/react_native/e2e (#19583)
Bumps [ip](https://github.com/indutny/node-ip) from 1.1.8 to 1.1.9.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1ecbf2fd8c"><code>1ecbf2f</code></a>
1.1.9</li>
<li><a
href="6a3ada9b47"><code>6a3ada9</code></a>
lib: fixed CVE-2023-42282 and added unit test</li>
<li>See full diff in <a
href="https://github.com/indutny/node-ip/compare/v1.1.8...v1.1.9">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=ip&package-manager=npm_and_yarn&previous-version=1.1.8&new-version=1.1.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by
@fs-eire.

[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-22 13:58:17 -08:00
Hector Li
09622418c4
Add special handling if there is only 1 graph inside the cached QNN context binary (#19594)
Add special handling if there is only 1 graph inside the cached QNN context binary. No need to make the EPContext node name match the QNN graph name. This is for better backward compatibility in case the QNN context model is generated before the PR for QNN context binary model support multi-partition.
2024-02-22 13:15:13 -08:00
Xu Xing
fe82fccf1a
[js/webgpu] Fix Conv2DTransposeMatMul f16 compilation failure (#19596)
This is used in sam-h-decoder-f16.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-22 13:09:28 -08:00
jingyanwangms
3bdb10d5ca
Move import to when needed to avoid circular dependency error (#19579)
### Description
Move import to when needed to avoid circular dependency error


### Motivation and Context
Fixes dependency error described here:
https://github.com/microsoft/DeepSpeed/issues/5140

---------

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2024-02-22 10:56:25 -08:00
cao lei
6b73ab3e3e
Introduce reused_buffer_index_per_stream in allocation planner which will be reset after computing the reuse buffer for each stream (#19515)
### Description
<!-- Describe your changes. -->
Introduce reused_buffer_index_per_stream in allocation planner which
will be reset after computing the reuse buffer for each stream. So if a
NodeArg is an input of several Ops across different streams and reuses
other NodeArg, the reused NodeArg won't be involved when computing the
second stream's reuse plan.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to fix https://github.com/microsoft/onnxruntime/issues/19480,
which is a crash for the scenario mentioned above.

---------

Co-authored-by: Lei Cao <leca@microsoft.com>
2024-02-22 10:19:08 -08:00
PeixuanZuo
05ed89f469
[ROCm] Add excluded libs for ROCm python package (#19586)
The rocm lib version has changed in rocm 6.0

Using libs packaged in whl might cause errors. 
For example, `libamdhip64.so.6` packaged in whl will cause compute error
when training gpt2 model.

The root cause still in investigating.
2024-02-22 13:34:55 +08:00
PeixuanZuo
8354329086
[ROCm] SkipGroupNorm triton (#19408)
Change GroupNorm triton to support SkipGroupNorm
2024-02-22 13:34:45 +08:00
Vincent Wang
3d88487c96
Minor Triton Fix (#19589)
Including removing a unnecessary assert, and add support of passing
string attribute from ONNX node attribute to python functoin kwargs
(mainly for passing debug info from graph to python for now).
2024-02-22 10:35:26 +08:00
Sheil Kumar
5197db1980
Diable __cpuid call for ARM64EC (#19592)
Diable __cpuid call for ARM64EC

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2024-02-21 15:45:44 -08:00
dependabot[bot]
38c3432393
Bump ip from 1.1.8 to 1.1.9 in /js/react_native (#19582)
Bumps [ip](https://github.com/indutny/node-ip) from 1.1.8 to 1.1.9.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1ecbf2fd8c"><code>1ecbf2f</code></a>
1.1.9</li>
<li><a
href="6a3ada9b47"><code>6a3ada9</code></a>
lib: fixed CVE-2023-42282 and added unit test</li>
<li>See full diff in <a
href="https://github.com/indutny/node-ip/compare/v1.1.8...v1.1.9">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=ip&package-manager=npm_and_yarn&previous-version=1.1.8&new-version=1.1.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by
@fs-eire.

[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-02-21 13:58:53 -08:00
Matttttt
ebd220b073
Misspelling in README.md (#19433)
Fixed a misspelling.
2024-02-21 13:38:18 -08:00
Tianlei Wu
3afb38cfb7
[CUDA] Add use_tf32 cuda provider option (for FP32 Conv) (#19426)
Follow up of https://github.com/microsoft/onnxruntime/pull/19357 to apply the use_tf32 option on fp32 cuDNN convolution.

When use_tf32 = 0, we will disable TF32 in cuDNN convolution for FP32 inputs.

https://docs.nvidia.com/deeplearning/cudnn/api/cudnn-graph-library.html#cudnnmathtype-t
**CUDNN_FMA_MATH**
- Restricted to only kernels that use FMA instructions.
- On pre-NVIDIA A100 GPU devices, CUDNN_DEFAULT_MATH and CUDNN_FMA_MATH
have the same behavior: Tensor Core kernels will not be selected.
- With NVIDIA Ampere architecture and CUDA toolkit 11,
CUDNN_DEFAULT_MATH permits TF32 Tensor Core operation and CUDNN_FMA_MATH
does not.
- The TF32 behavior for CUDNN_DEFAULT_MATH and the other Tensor Core
math types can be explicitly disabled by the environment variable
NVIDIA_TF32_OVERRIDE=0.
2024-02-21 12:46:16 -08:00
Adam Pocock
e5ce81ae84
[java] Adding ML program flag for CoreML (#19551)
### Description
Adds the new CoreML enum flags to enable ML Program support in Java.

### Motivation and Context
Adds support for #19347 to the Java API.
2024-02-21 12:24:41 -08:00
Xu Xing
57d6819212
[js/web] Fix fused-conv is not included in npm test (#19581)
BUG: https://github.com/microsoft/onnxruntime/issues/18855

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-21 08:08:47 -08:00
Yulong Wang
58f4921686
[js] changes to allow Float16Array if any polyfill is available (#19305)
### Description

This change adds only necessary code to enable ort-web works with any
Float16Array polyfill. Unlike #19302, in this PR, ort-web does not
include any specific polyfill; instead, it's user's choice for how to
use a polyfill.

ORT-web uses Float16Array if it's available; otherwise, fallback to use
Uint16Array.

```js
// case 1: user does not use polyfill:
import * as ort from 'onnxruntime-web';

const myF16Data = new Uint16Array(...);  // need to use Uint16Array
const myF16tensor = new ort.Tensor('float16', myF16Data, dims);
```

```js
// case 2: user use polyfill:
import * as ort from 'onnxruntime-web';
import {
  Float16Array, isFloat16Array, isTypedArray,
  getFloat16, setFloat16,
  f16round,
} from "@petamoriken/float16";
globalThis.Float16Array = Float16Array;  // ort-web will pick the global Float16Array

const myF16Data = new Float16Array(...);  // Use the polyfilled Float16Array type
const myF16tensor = new ort.Tensor('float16', myF16Data, dims);
```
2024-02-21 00:31:06 -08:00
satyajandhyala
8092a89688
Changed command line argpasrse to process '--symmetric [True|False]'. (#19577)
### Description
<!-- Describe your changes. -->
Accept the command line option --symmetric and its optional value
correctly. If the optional value matches uncased to 'True' then set
symmetric to True else set symmetric to False. Asymmetric quantization
will generate zero_point input.
```
usage: matmul_4bits_quantizer.py [-h] --input_model INPUT_MODEL --output_model OUTPUT_MODEL [--block_size BLOCK_SIZE] [--symmetric [{True,False}]] [--accuracy_level ACCURACY_LEVEL] [-v]
                                 [--nodes_to_exclude NODES_TO_EXCLUDE [NODES_TO_EXCLUDE ...]]
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-20 21:18:54 -08:00