Commit graph

10651 commits

Author SHA1 Message Date
Dmitri Smirnov
0cdf36faeb
Expose SessionOtions.DisablePerSessionThreads (#19730)
### Description

### Motivation and Context
ML.NET needs to run mltiple sessions on a single threadpool.
2024-03-04 13:46:51 -08:00
raoanag
27b1dc91ab
[DML] MatrixMultiplyIntegerToFloat (#19608)
### Description
DML Implementation for
[com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat)

```
.\onnxruntime_test_all.exe --gtest_filter="*MatMulIntegerToFloat.*"
Note: Google Test filter = *MatMulIntegerToFloat.*
[==========] Running 22 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 22 tests from MatMulIntegerToFloat
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint
[       OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms)
[----------] 22 tests from MatMulIntegerToFloat (8679 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 1 test suite ran. (8680 ms total)
[  PASSED  ] 22 tests.
memleakdbg:
----- No memory leaks detected -----
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* `CalculateMatMulIntegerToFloat` to replace CPU EP run reference
* Added more FP32 testcases to isolate all input datatype combinations 
* Added fixed input to `MatMulIntegerToFloat_FP16*` test cases as for
FP16 test cases.
* onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of
generating FP16 models, but we do not produce any for now
2024-03-04 11:55:35 -08:00
inisis
2e13d5f0ab
fix split shape inference error for opset >= 13 (#19756)
### Description
get split operator split section by opset

### Motivation and Context
for opset higher than 13, split section is treated as an input.
2024-03-04 09:41:36 -08:00
ironman
9acaf534a6
Benchmark - Updating llama-2 requirement files (#19716)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-04 07:29:58 -08:00
Yi Zhang
9460597b21
Update copying API header files (#19736)
### Description
Make Linux logic consistent as Windows


### Motivation and Context
onnxruntime_lite_custom_op.h in Windows zip package but not in Linux zip
package

acbfc29f27/tools/ci_build/github/azure-pipelines/templates/c-api-artifacts-package-and-publish-steps-windows.yml (L67)

Co-authored-by: Your Name <your@email.com>
2024-03-02 11:33:47 +08:00
Adrian Lizarraga
2d79052ec3
[QNN Quant] Add preprocessing option to transpose graph inputs/outputs to channel-last (#19731)
### Description
Adds the optional parameters `inputs_to_make_channel_last` and
`outputs_to_make_channel_last` to the `qnn_preprocess_model()` function.

```python
"""
 inputs_to_make_channel_last: List of graph input names to transpose to be "channel-last". For example,
      if "input0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change input0's
      shape to (N, D1, D2, ..., Dn, C) and add a transpose node after it.
      Original:
          input0 (N, C, D1, D2, ..., Dn) --> <Nodes>
      Updated:
          input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes>
      This can potentially improve inference latency for QDQ models running on QNN EP because the
      additional transpose node may allow other transpose nodes inserted during ORT layout transformation
      to cancel out.
 outputs_to_make_channel_last: List of graph output names to transpose to be "channel-last". For example,
      if "output0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change output0's
      shape to (N, D1, D2, ..., Dn, C) and add a transpose node before it.
      Original:
          <Nodes> --> output0 (N, C, D1, D2, ..., Dn)
      Updated:
          <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C)
      This can potentially improve inference latency for QDQ models running on QNN EP because the
      additional transpose node may allow other transpose nodes inserted during ORT layout transformation
      to cancel out.
"""
```

**NOTE: If you use these options with the quantization scripts, you'll
have to make sure your data_reader feeds in transposed input data. It
won't happen automatically.**

### Motivation and Context
Native QNN operators use the channel-last data layout, but ONNX uses
channel-first. To bridge the gap, ORT's layout transformer inserts
transposes around layout-sensitive nodes and updates their domain to
indicate that they now operate on channel-last data. The transpose
optimizer is able to remove most of these inserted transposes, but not
all transposes can always be removed (i.e., some could remain at the
graph's inputs and outputs).

We've found that these extra transpose nodes can significantly degrade
inference latency on QNN EP. One workaround (provided by this PR) is to
add _additional_ transpose nodes at the graph inputs or outputs. These
additional nodes can often help the ORT transpose optimizer cancel out
any remaining transpose nodes, which significantly improves latency.

Additionally, it may make more sense for some kinds of inputs to just be
in channel-last form (e.g., images), avoiding the need to pre-transpose
of the input data before inference.

Example at the input:
```
Original:
    input0 (N, C, D1, D2, ..., Dn) --> <Nodes>
Updated:
    input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes>
```

Example at the output:
```
Original:
   <Nodes> --> output0 (N, C, D1, D2, ..., Dn)
Updated:
   <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C)
```
2024-03-01 18:39:51 -08:00
zesongw
de3158e78d
[WebNN EP] Add contraints for MatMul (#19713)
### Description
Add constraints to MatMul:
- The input must be at least 2D.
- CPU backend: The input rank must be the same.
- CPU backend: The input shape except for the last two axis must be the
same.



### Motivation and Context
Prevent regression for some models.
2024-03-01 16:55:50 -08:00
Changming Sun
a0521f899e
Enable CPUINFO for all Windows build (#19655)
### Description
It was disabled in PR #9065. And the reason was:
" api-ms-win-core-kernel32-legacy-*.dll wasn't available in Windows 8
and was added in Windows 10, so cpuinfo breaks our Windows 8 support.
I'm disabling it again."

We no longer support Windows 8.  Therefore we can add CPUINFO back.

### Motivation and Context
To make the code simpler. If in any case the library doesn't work as
expected, we can submit a PR to their code base and fix it.
2024-03-01 16:23:20 -08:00
Yulong Wang
f06164ef8b
[js/web] transfer input buffer back to caller thread (#19677)
### Description

When using proxy worker, input buffers should be transferred back to the
caller thread after `run()` call is done.

Fixes #19488
2024-03-01 14:50:06 -08:00
Yufeng Li
22176a5fa8
disable gemm f16 on CPU (#19744)
### Description
<!-- Describe your changes. -->
Temporarily disable fp16 gemm on CPU because it usually needs a
following Cast which offsets the gain. Need more fp16 operators
implementation and performance tuning.

Also fix a fusion error of LayerNormalization.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-01 13:44:29 -08:00
Edward Chen
5672cdebdf
Update google benchmark to 1.8.3. (#19734)
Update google benchmark to 1.8.3.
Update deps_update_and_upload.py script to make it easier to use.
2024-03-01 11:01:58 -08:00
Changming Sun
ed550b5fe5
Change webgpu CI pipeline to use a preinstalled chrome (#19729)
### Description
Change webgpu CI pipeline to use a preinstalled chrome. Hopefully it can
increase the stability. Now the chrome got from puppeteer often failed
to start.
2024-02-29 20:36:29 -08:00
pengwa
acbfc29f27
Follow up fix for Gelu impl (#19693)
### Follow up fix for Gelu impl

There are two minor comments in
https://github.com/microsoft/onnxruntime/pull/19560.

Fix them in this pull request. 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-01 10:57:14 +08:00
Scott McKay
2a857d9a86
Add ML Program support for more operators (#19527)
### Description
<!-- Describe your changes. -->

Add support for:
- Clip/Relu/Relu6
- Add/Mul/Div/Sub/Pow
- GlobalAveragePool/GlobalMaxPool/AveragePool/MaxPool
- Reshape
- Gemm/MatMul

Fix some build issues/warnings from changes.

Fix a couple of potential issues with the Resize op as well (noticed due
to change to reject inputs with empty data at a higher level).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable mobilenetv2 with ML Program
2024-03-01 10:23:29 +10:00
Dmitri Smirnov
5ee62a6bcc
CUDA Resize-18 implementation (#19595)
### Description
Implement Resize-18 on CUDA.

### Motivation and Context
Performance
2024-02-29 14:46:42 -08:00
Adam Louly
d5606cd7ee
Introducing customizable input names for loss in generate_artifacts. (#19705)
# loss function extra inputs.
Currently, the loss functions in onnxblock expect exactly two inputs in
their build method.
Occasionally, models may pass additional inputs, causing the build
function to fail.
To solve this issue, we can let users pass a list of loss input names to
be used in the loss function.
2024-02-29 13:40:56 -08:00
Yi-Hong Lyu
ec0e4d3b65
Parallel Transpose_BSNH_to_BNSH (#19406)
Achieved a speedup of 1.098 in MultiHeadAttention and an end-to-end
speedup of 1.021 in the OCR model through parallelization of the
Transpose_BSNH_to_BNSH operation.
2024-02-29 10:31:57 -08:00
Vincent Wang
937cdd651e
[ORTMODULE] Support Register Custom Triton Kernel (#19690)
Add support for registering custom Triton kernel function.
2024-02-29 23:03:57 +08:00
PeixuanZuo
c311d1faf5
[ROCm] Update dockerfile (#19661)
Update dockerfile to ROCm6.0
2024-02-29 17:51:29 +08:00
Adrian Lizarraga
c1bf7fcd2f
[QNN Quant] Ensure 16bit tensor quant overrides set MS domain (#19684)
### Description
Ensures that DQ and Q ops use the msft domain if tensor quantization
overrides specify 16-bit integer types.

### Motivation and Context
ONNX does not yet support 16bit integer types for QuantizeLinear and
DequantizeLinear ops (coming soon). For now, DQ/Q ops must use the MSFT
domain.

We have to also check if tensor quantization overrides force the use of
16-bit quantization types. If so, we must correctly set the domain for
Q/DQ ops.
2024-02-29 01:19:25 -08:00
Vincent Wang
d2e6dd25ea
Merge GatherToSplitFusion and #19218 to a General Fusion (#19600)
#19218 tried to fuse Gather/Slice to Split, but the logic has problem.
Scalar value or 1-dim value of indices in Gather node will produce
different result, scalar value will produce a result tensor by removing
the axis dim, will 1-dim indices value will keep that dim, even when the
dim value is 1. For example,

Node
    |-> Gather(indices=[0], axis=axis)
    |-> Gather(indices=[1], axis=axis)
    |-> Slice(index=2, axis=axis)
is same as
Node
   |-> Split(axis=axis)

But
Node
    |-> Gather(indices=0, axis=axis)
    |-> Gather(indices=1, axis=axis)
    |-> Slice(index=2, axis=axis)
is same as
Node
    |-> Split(axis=axis)
        ||-> Squeeze(axis=axis)
        ||-> Squeeze(axis=axis)
        ||->

Previous PR doesn't take such case related to Squeeze/Unsqueeze into
account.

This PR merges #19218 and GatherToSplitFusion to a general fusion, which
relaxes the limit the number of Gather and Slice node number, check all
Gather and Slice consumers, if the indices of Gather and start/end of
Slice can cover the specific dim of the input tensor, then we can fuse
them to a Split, and adding Squeeze if necessary according to the dim
count of the indices tensor in Gather.

@rui-ren, please check if the fix can still be applied to your model.
2024-02-29 13:45:58 +08:00
Sophie Schoenmeyer
7455dd1f32
Update labeler.yml to change permissions (#19709)
### Description
Updated github/issue-labeler permissions to give write access for
issues. Tried to submit the same PR last week, but the checks kept
failing, so I couldn't merge.



### Motivation and Context
Enables issue labeling again, which has been broken since GitHub Actions
permissions were changed a couple weeks ago.
2024-02-28 21:10:25 -08:00
Changming Sun
250779474d
Change "onnxruntime-Linux-CPU-For-Android-CI" machine pool to "onnxruntime-Ubuntu2204-AMD-CPU" (#19698)
### Description
The original one reports "out of disk space", which needs to be
investigated.
2024-02-28 19:36:26 -08:00
Yulong Wang
e30618d055
[js/webgpu] use Headless for webgpu test by default (#19702)
### Description
use Chromium Headless for webgpu test by default. Still use normal
Chromium with window when debug=true or perfMode=true.

Use the
[`--headless=new`](https://developer.chrome.com/docs/chromium/new-headless)
mode.



### Motivation and Context
try to use a more stable way to launch npm tests to avoid a "chrome not
found" issue in pipeline, which may potentially caused by windowed
application.
2024-02-28 16:05:08 -08:00
Changming Sun
a93c31e3c9
Update dml-vs-2022.yml (#19687)
### Description
Fix a build error in "Zip-Nuget-Java-Nodejs Packaging Pipeline" which
deletes files too early.
2024-02-28 12:03:17 -08:00
Adrian Lizarraga
913bdc7306
[QNN Quant] Handle external data for QNN preprocessing/quant (#19670)
### Description
- Adds parameters to `qnn_preprocess_model()` to allow saving the new
model with external data.
- Updates `get_qnn_qdq_config()` to:
  - Load model without external data (it is not needed)
- Return a quantization configuration with `use_external_data_format`
set to `True` if the model has external data or if the model is >= 2GB.

### Motivation and Context
Update QNN quantization to better handle large models that use external
data.
2024-02-28 08:30:12 -08:00
Changming Sun
7a147fc6f7
Remove a bash task from webgpu CI pipeline (#19682)
### Description
It is a "Bash" task that requires running bash on Windows. Most Windows
operating systems do not have Bash installed. Given this task is only
debugging purposes, we can remove it for now.


### Motivation and Context
I am making this change because I am regenerating the VM image in a
different manner, and the new image does not contain bash. Once this PR
is in, I can switch the images.
2024-02-28 18:20:53 +08:00
pengwa
026e3178ae
Improve memory matrix for ORTModule (#19620)
### Memory matrix for ORTModule

Collect  parameter/gradient/buffers sizes also. 
Exposed as a function, can be used externally for debugging purpose. 


```
2024-02-27 07:18:55,283 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 816 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,322 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 816 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,358 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 816 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,438 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▏                                                                                                                                                                                                                                              | 2/3200 [01:27<32:05:11, 36.12s/it]2024-02-27 07:18:55,498 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,537 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,576 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,657 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▏                                                                                                                                                                                                                                              | 3/3200 [01:27<17:30:57, 19.72s/it]2024-02-27 07:18:55,711 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,750 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,786 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,867 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
[2024-02-27 07:18:55,886] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
  0%|▎                                                                                                                                                                                                                                              | 4/3200 [01:28<10:39:52, 12.01s/it]2024-02-27 07:18:55,902 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,944 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:55,979 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,060 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▍                                                                                                                                                                                                                                               | 5/3200 [01:28<6:53:04,  7.76s/it]2024-02-27 07:18:56,115 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,154 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,190 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,270 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▍                                                                                                                                                                                                                                               | 6/3200 [01:28<4:36:19,  5.19s/it]2024-02-27 07:18:56,323 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,365 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,398 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,478 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▌                                                                                                                                                                                                                                               | 7/3200 [01:28<3:09:33,  3.56s/it]2024-02-27 07:18:56,533 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,572 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,608 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,727 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▌                                                                                                                                                                                                                                               | 8/3200 [01:28<2:13:48,  2.52s/it]2024-02-27 07:18:56,806 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,846 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,882 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) | phase: pre_backward | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:56,962 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) | phase: post_backward | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 831 | param: 5314 | grad: 12 | buffer: 8
  0%|▋                                                                                                                                                                                                                                               | 9/3200 [01:29<1:36:03,  1.81s/it]2024-02-27 07:18:57,053 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) | phase: pre_forward | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8
2024-02-27 07:18:57,094 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) | phase: post_forward | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 831 | param: 5314 | grad: 0 | buffer: 8

```
2024-02-28 15:57:05 +08:00
Yi Zhang
f95c0773a1
Add share memory Flag in docker (#19672)
### Description



### Motivation and Context
Ref:
https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#setincshmem

Co-authored-by: Your Name <your@email.com>
2024-02-28 10:40:40 +08:00
Maximilian Müller
c20ced4132
Use CMake's find package for CUDA libs (#19673)
### Description
Answers issue #19640 
More details are in the issue, basically I am changing all the include
directory and link directory usage to CMake's `CUDA::*` targets
2024-02-27 11:26:48 -08:00
Yulong Wang
3cb81cdde2
[js/common] move 'env.wasm.trace' to 'env.trace' (#19617)
### Description

Try to move 'env.wasm.trace' to 'env.trace' to make it less confusing,
because it also works in webgpu. Marked 'env.wasm.trace' as deprecated.
2024-02-27 11:07:15 -08:00
zesongw
2e4d1b8f1b
[WebNN EP] Add support for Op MatMul of WebNN CPU backend (#19413)
Enable MatMul support for WebNN CPU backend to support more models.
2024-02-27 10:01:12 -08:00
Scott McKay
1c468a03b9
Improve Nuget-CUDA-Packaging-Pipeline (#19668)
### Description
<!-- Describe your changes. -->
* Publish the artifacts as late as possible
* once published the artifacts are immutable, and any retry will fail if
they exist
  * if any step fails after publishing the stage cannot be retried
* use powershell to cleanup
  * DeleteFiles is taking >30 mins and causing the stage to timeout
  * powershell took < 1s

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make pipeline more robust
2024-02-27 09:27:43 -08:00
Scott McKay
580ee20dfc
Tweak Windows build parallelization settings (#19664)
### Description
<!-- Describe your changes. -->
Use UseMultiToolTask and limit the number of cl.exe instances running. 

MultiToolTask info:
https://devblogs.microsoft.com/cppblog/improved-parallelism-in-msbuild/

Info on why limiting CL_MPCount can help:
https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows

The current CIs have 4 cores (both physical and logical). Hardcoded the
GPU build in win-ci.yml to use CL_MPCount of 2 as that seems to work
fine. Can adjust if needed to base it on the actual number of cores or
to use build.py to build.

Caveat: I've run about 16 builds and haven't seen a slow build yet, but
as the root cause of the slow builds isn't really known this isn't
guaranteed to be a fix.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Try and prevent super slow GPU builds by reducing number of tasks
potentially running in parallel.
2024-02-27 08:56:16 -08:00
Yi Zhang
3b46ab6439
Re-add testing removed by mistake. (#19647) 2024-02-27 08:46:29 -08:00
Adrian Lizarraga
4838cb6b3e
[QNN Quantization] Ensure fused nodes have names (#19650)
### Description
- Updates the `qnn_preprocess_model()` method to set a name for any new
nodes added to the graph (due to fusion).
- Updates the `qnn_preprocess_model()` method to set a name for any
unnamed nodes that previously existed in the original graph.
- Adds unit tests for fusions (previously missing)
  - Checks that fused node names exist and are unique
  - Checks that fused graph is equivalent to original graph


### Motivation and Context
Nodes are not strictly required to have names. However, a
planned/upcoming feature to support mixed-precision (integer) quantized
models needs nodes to have names.
2024-02-27 02:27:35 -08:00
cloudhan
1e69b61238
Make version string detection more robust (#19615)
`/opt/rocm/.info/version-dev` is only available if the `rocm-dev`
metapackage is installed. This will bring a lot of unused packages which
are not needed by the users, they may opt for fine grained control.
Fallback to `rocm_version.h` in case `rocm-dev` is not installed.
2024-02-27 16:06:06 +08:00
duanshengliu
9e19684944
Fix the TypeError issue in quantize.py (#19459)
### Description
<!-- Describe your changes. -->
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix related bug as described in
https://github.com/microsoft/onnxruntime/issues/19430
2024-02-26 20:56:32 -08:00
Rachel Guo
5bb58a10e7
Enable the most verbose logging level in detox E2E React Native CI (#19659)
### Description
<!-- Describe your changes. -->

The RN CI has intermittent failure error with "app seems to idle".
enable the most verbose logging level (and can add steps to dump
device.log from the detox folder/artifacts if necessary) to at least get
more information.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
2024-02-26 20:00:14 -08:00
kailums
6f566562ce
support user_compute_stream for rocm ep (#19619)
### Description
<!-- Describe your changes. -->
According to the pr #19229 supporting cuda EP use external compute
stream, we add support for rocm EP.

And when we testing this feature with torch, we found torch use stream 0
for the default stream, and `torch.cuda.current_stream()` returns `0`
for current stream, but ort treat `0` or `nullptr` as invalid, and reset
has_user_compute_stream to false. 

Will remove has_user_compute_stream option in the future.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The motivation for this pr is that we want to use torch.cuda.graph to
capture ort running kernel, which requires torch and ort are running in
the same stream, so we use this API to set ort's working stream.
2024-02-27 11:31:03 +08:00
Scott McKay
8a71b65765
Remove skipping of Reshape from NNAPI EP (#19618)
### Description
<!-- Describe your changes. -->
A number of Qualcomm Snapdragon chipsets do not produce correct output
if we skip the Reshape, which ironically was a performance optimization
for Snapdragon chips.

Perf testing showed that Squeeze also seems to execute on CPU so there's
no benefit to using that as an alternative where possible e.g.
Global*Pool -> Reshape to 2D -> Gemm could be potentially be replaced
with Global*Pool -> Squeeze dims 2 and 3 -> Gemm if that offered better
performance.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19518
2024-02-27 11:35:27 +10:00
Changming Sun
18c8fab1ae
Fix a bug in build.py (#19652)
### Description
Fix a bug in build.py that accidentally disabled C# tests for most
builds when "--build_nuget" is specified.

### Motivation and Context
The bug was introduced in PR #8892 .
2024-02-26 15:58:09 -08:00
Scott McKay
8bd943be39
Retry flaky XCode iOS UI tests if we get a known error (#19639)
### Description
<!-- Describe your changes. -->
Xcode UI tests seem to be flaky:
https://github.com/orgs/community/discussions/68807
Add a couple of retries if we get a "Timed out while loading
Accessibility." error which is transient.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-27 09:31:32 +10:00
Sumit Agarwal
a9568935a5
[DML EP] Enable DML Graph Serialization (#19505)
### Description
This PR adds a feature to serialize all DML EP partitions into DML
currency individually for a given a model. This feature can be
dynamically turned on by using DML EP option
`ep.dml.enable_graph_serialization`.


### Motivation and Context
- Why is this change required? What problem does it solve?
Useful when user want to capture the DML EP specific partition into DML
currency to mitigate the dependency on the framework.
<!-- - If it fixes an open issue, please link to the issue here. -->
2024-02-26 11:35:13 -08:00
Yufeng Li
430a086f22
fix memory mapping on Windows (#19623)
### Description
<!-- Describe your changes. -->
Windows memory map casts mapped_offset to DWORD directly. It will be
truncated if it is larger than 2^32-1. We need to set high
dwFileOffsetHigh for this case.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The bug was found from #19450
2024-02-25 08:50:45 -08:00
Yi Zhang
0fcc6fb760
Add Whisper model in CI (#19604)
### Description
 Add Whisper Conversion and E2E into Big Models pipeline



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Your Name <your@email.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
2024-02-25 14:04:22 +08:00
Yi Zhang
c980149c85
Add log for random exception in Linux GPU Test Stage. (#19569)
### Description
1. check GPU status in docker
2. use stages to make test stage can leverage existing building
artifacts


### Motivation and Context
To investigate the root cause of the random exception
`CUDA failure 100: no CUDA-capable device is detected`
2024-02-24 13:00:53 -08:00
Yulong Wang
0edb035808
[js/web] fix suite test list for zero sized tensor (#19638)
### Description

Fixes build break brought by #19614

Currently WebGL backend does not support zero sized tensor. This change
split test data into 2 parts, and only enable zero sized tensor tests
for WebGPU.
2024-02-24 10:09:07 -08:00
Changming Sun
9ccdc4961a
Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib (#19632)
### Description
Stop using apiset in OneCore build: use onecoreuap.lib instead of
onecoreuap_apiset.lib in onecore build.


### Motivation and Context
1. Now all Windows Editions come with Reverse Forwarders. We should just
use the normal onecore libs.
2. Many new Windows APIs are only available in [windows umbrella
libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries).
So these libraries are not specific for Windows CoreOS or Onecore.
3. Going forward we should use "IsApiSetImplemented" to guard our API
usages:

https://learn.microsoft.com/en-us/windows/win32/apiindex/detect-api-set-availability
.

After this change, our built binaries can pass apivalidator's check.

```
C:\local\apivalidator>apivalidator.exe -BinaryPath:C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll -SupportedApiXmlFiles:onecoreuap_DDIs.xml
ApiValidation:
Summary:
        "C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll" is Universal


ApiValidation: All binaries are Universal
```
So it will give an easy way to test ONNX Runtime's compatibility to
Windows versions.
2024-02-23 22:31:57 -08:00
Scott McKay
c12a20bef9
Add helper to run CIs for a branch using az pipelines. (#16843)
### Description
<!-- Describe your changes. -->
Add helper to run CIs for a branch using `az pipelines`.
This can be used to easily kick off multiple CIs for a branch prior to
creating a PR.

Update run_CIs_for_external_pr.py so the CI list can be shared. 
Request json output from `gh pr view` so the current state is more
easily parsed.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-24 14:06:30 +10:00