onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-10 17:37:14 +00:00

Author	SHA1	Message	Date
Dmitri Smirnov	0cdf36faeb	Expose SessionOtions.DisablePerSessionThreads (#19730 ) ### Description ### Motivation and Context ML.NET needs to run mltiple sessions on a single threadpool.	2024-03-04 13:46:51 -08:00
raoanag	27b1dc91ab	[DML] MatrixMultiplyIntegerToFloat (#19608 ) ### Description DML Implementation for [com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat) ``` .\onnxruntime_test_all.exe --gtest_filter="MatMulIntegerToFloat." Note: Google Test filter = MatMulIntegerToFloat. [==========] Running 22 tests from 1 test suite. [----------] Global test environment set-up. [----------] 22 tests from MatMulIntegerToFloat [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms) [ RUN ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint [ OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms) [----------] 22 tests from MatMulIntegerToFloat (8679 ms total) [----------] Global test environment tear-down [==========] 22 tests from 1 test suite ran. (8680 ms total) [ PASSED ] 22 tests. memleakdbg: ----- No memory leaks detected ----- ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> * `CalculateMatMulIntegerToFloat` to replace CPU EP run reference * Added more FP32 testcases to isolate all input datatype combinations * Added fixed input to `MatMulIntegerToFloat_FP16` test cases as for FP16 test cases. onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of generating FP16 models, but we do not produce any for now	2024-03-04 11:55:35 -08:00
inisis	2e13d5f0ab	fix split shape inference error for opset >= 13 (#19756 ) ### Description get split operator split section by opset ### Motivation and Context for opset higher than 13, split section is treated as an input.	2024-03-04 09:41:36 -08:00
ironman	9acaf534a6	Benchmark - Updating llama-2 requirement files (#19716 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-04 07:29:58 -08:00
Yi Zhang	9460597b21	Update copying API header files (#19736 ) ### Description Make Linux logic consistent as Windows ### Motivation and Context onnxruntime_lite_custom_op.h in Windows zip package but not in Linux zip package `acbfc29f27/tools/ci_build/github/azure-pipelines/templates/c-api-artifacts-package-and-publish-steps-windows.yml (L67)` Co-authored-by: Your Name <your@email.com>	2024-03-02 11:33:47 +08:00
Adrian Lizarraga	2d79052ec3	[QNN Quant] Add preprocessing option to transpose graph inputs/outputs to channel-last (#19731 ) ### Description Adds the optional parameters `inputs_to_make_channel_last` and `outputs_to_make_channel_last` to the `qnn_preprocess_model()` function. ```python """ inputs_to_make_channel_last: List of graph input names to transpose to be "channel-last". For example, if "input0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change input0's shape to (N, D1, D2, ..., Dn, C) and add a transpose node after it. Original: input0 (N, C, D1, D2, ..., Dn) --> <Nodes> Updated: input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes> This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out. outputs_to_make_channel_last: List of graph output names to transpose to be "channel-last". For example, if "output0" originally has the shape (N, C, D1, D2, ..., Dn), the resulting model will change output0's shape to (N, D1, D2, ..., Dn, C) and add a transpose node before it. Original: <Nodes> --> output0 (N, C, D1, D2, ..., Dn) Updated: <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C) This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out. """ ``` NOTE: If you use these options with the quantization scripts, you'll have to make sure your data_reader feeds in transposed input data. It won't happen automatically. ### Motivation and Context Native QNN operators use the channel-last data layout, but ONNX uses channel-first. To bridge the gap, ORT's layout transformer inserts transposes around layout-sensitive nodes and updates their domain to indicate that they now operate on channel-last data. The transpose optimizer is able to remove most of these inserted transposes, but not all transposes can always be removed (i.e., some could remain at the graph's inputs and outputs). We've found that these extra transpose nodes can significantly degrade inference latency on QNN EP. One workaround (provided by this PR) is to add _additional_ transpose nodes at the graph inputs or outputs. These additional nodes can often help the ORT transpose optimizer cancel out any remaining transpose nodes, which significantly improves latency. Additionally, it may make more sense for some kinds of inputs to just be in channel-last form (e.g., images), avoiding the need to pre-transpose of the input data before inference. Example at the input: ``` Original: input0 (N, C, D1, D2, ..., Dn) --> <Nodes> Updated: input0 (N, D1, D2, ..., Dn, C) --> Transpose --> input0_chanfirst (N, C, D1, D2, ..., Dn) --> <Nodes> ``` Example at the output: ``` Original: <Nodes> --> output0 (N, C, D1, D2, ..., Dn) Updated: <Nodes> --> output0_chanfirst (N, C, D1, D2, ..., Dn) --> Transpose --> output0 (N, D1, D2, ..., Dn, C) ```	2024-03-01 18:39:51 -08:00
zesongw	de3158e78d	[WebNN EP] Add contraints for MatMul (#19713 ) ### Description Add constraints to MatMul: - The input must be at least 2D. - CPU backend: The input rank must be the same. - CPU backend: The input shape except for the last two axis must be the same. ### Motivation and Context Prevent regression for some models.	2024-03-01 16:55:50 -08:00
Changming Sun	a0521f899e	Enable CPUINFO for all Windows build (#19655 ) ### Description It was disabled in PR #9065. And the reason was: " api-ms-win-core-kernel32-legacy-*.dll wasn't available in Windows 8 and was added in Windows 10, so cpuinfo breaks our Windows 8 support. I'm disabling it again." We no longer support Windows 8. Therefore we can add CPUINFO back. ### Motivation and Context To make the code simpler. If in any case the library doesn't work as expected, we can submit a PR to their code base and fix it.	2024-03-01 16:23:20 -08:00
Yulong Wang	f06164ef8b	[js/web] transfer input buffer back to caller thread (#19677 ) ### Description When using proxy worker, input buffers should be transferred back to the caller thread after `run()` call is done. Fixes #19488	2024-03-01 14:50:06 -08:00
Yufeng Li	22176a5fa8	disable gemm f16 on CPU (#19744 ) ### Description <!-- Describe your changes. --> Temporarily disable fp16 gemm on CPU because it usually needs a following Cast which offsets the gain. Need more fp16 operators implementation and performance tuning. Also fix a fusion error of LayerNormalization. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-01 13:44:29 -08:00
Edward Chen	5672cdebdf	Update google benchmark to 1.8.3. (#19734 ) Update google benchmark to 1.8.3. Update deps_update_and_upload.py script to make it easier to use.	2024-03-01 11:01:58 -08:00
Changming Sun	ed550b5fe5	Change webgpu CI pipeline to use a preinstalled chrome (#19729 ) ### Description Change webgpu CI pipeline to use a preinstalled chrome. Hopefully it can increase the stability. Now the chrome got from puppeteer often failed to start.	2024-02-29 20:36:29 -08:00
pengwa	acbfc29f27	Follow up fix for Gelu impl (#19693 ) ### Follow up fix for Gelu impl There are two minor comments in https://github.com/microsoft/onnxruntime/pull/19560. Fix them in this pull request. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-01 10:57:14 +08:00
Scott McKay	2a857d9a86	Add ML Program support for more operators (#19527 ) ### Description <!-- Describe your changes. --> Add support for: - Clip/Relu/Relu6 - Add/Mul/Div/Sub/Pow - GlobalAveragePool/GlobalMaxPool/AveragePool/MaxPool - Reshape - Gemm/MatMul Fix some build issues/warnings from changes. Fix a couple of potential issues with the Resize op as well (noticed due to change to reject inputs with empty data at a higher level). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable mobilenetv2 with ML Program	2024-03-01 10:23:29 +10:00
Dmitri Smirnov	5ee62a6bcc	CUDA Resize-18 implementation (#19595 ) ### Description Implement Resize-18 on CUDA. ### Motivation and Context Performance	2024-02-29 14:46:42 -08:00
Adam Louly	d5606cd7ee	Introducing customizable input names for loss in generate_artifacts. (#19705 ) # loss function extra inputs. Currently, the loss functions in onnxblock expect exactly two inputs in their build method. Occasionally, models may pass additional inputs, causing the build function to fail. To solve this issue, we can let users pass a list of loss input names to be used in the loss function.	2024-02-29 13:40:56 -08:00
Yi-Hong Lyu	ec0e4d3b65	Parallel Transpose_BSNH_to_BNSH (#19406 ) Achieved a speedup of 1.098 in MultiHeadAttention and an end-to-end speedup of 1.021 in the OCR model through parallelization of the Transpose_BSNH_to_BNSH operation.	2024-02-29 10:31:57 -08:00
Vincent Wang	937cdd651e	[ORTMODULE] Support Register Custom Triton Kernel (#19690 ) Add support for registering custom Triton kernel function.	2024-02-29 23:03:57 +08:00
PeixuanZuo	c311d1faf5	[ROCm] Update dockerfile (#19661 ) Update dockerfile to ROCm6.0	2024-02-29 17:51:29 +08:00
Adrian Lizarraga	c1bf7fcd2f	[QNN Quant] Ensure 16bit tensor quant overrides set MS domain (#19684 ) ### Description Ensures that DQ and Q ops use the msft domain if tensor quantization overrides specify 16-bit integer types. ### Motivation and Context ONNX does not yet support 16bit integer types for QuantizeLinear and DequantizeLinear ops (coming soon). For now, DQ/Q ops must use the MSFT domain. We have to also check if tensor quantization overrides force the use of 16-bit quantization types. If so, we must correctly set the domain for Q/DQ ops.	2024-02-29 01:19:25 -08:00
Vincent Wang	d2e6dd25ea	Merge GatherToSplitFusion and #19218 to a General Fusion (#19600 ) #19218 tried to fuse Gather/Slice to Split, but the logic has problem. Scalar value or 1-dim value of indices in Gather node will produce different result, scalar value will produce a result tensor by removing the axis dim, will 1-dim indices value will keep that dim, even when the dim value is 1. For example, Node \|-> Gather(indices=[0], axis=axis) \|-> Gather(indices=[1], axis=axis) \|-> Slice(index=2, axis=axis) is same as Node \|-> Split(axis=axis) But Node \|-> Gather(indices=0, axis=axis) \|-> Gather(indices=1, axis=axis) \|-> Slice(index=2, axis=axis) is same as Node \|-> Split(axis=axis) \|\|-> Squeeze(axis=axis) \|\|-> Squeeze(axis=axis) \|\|-> Previous PR doesn't take such case related to Squeeze/Unsqueeze into account. This PR merges #19218 and GatherToSplitFusion to a general fusion, which relaxes the limit the number of Gather and Slice node number, check all Gather and Slice consumers, if the indices of Gather and start/end of Slice can cover the specific dim of the input tensor, then we can fuse them to a Split, and adding Squeeze if necessary according to the dim count of the indices tensor in Gather. @rui-ren, please check if the fix can still be applied to your model.	2024-02-29 13:45:58 +08:00
Sophie Schoenmeyer	7455dd1f32	Update labeler.yml to change permissions (#19709 ) ### Description Updated github/issue-labeler permissions to give write access for issues. Tried to submit the same PR last week, but the checks kept failing, so I couldn't merge. ### Motivation and Context Enables issue labeling again, which has been broken since GitHub Actions permissions were changed a couple weeks ago.	2024-02-28 21:10:25 -08:00
Changming Sun	250779474d	Change "onnxruntime-Linux-CPU-For-Android-CI" machine pool to "onnxruntime-Ubuntu2204-AMD-CPU" (#19698 ) ### Description The original one reports "out of disk space", which needs to be investigated.	2024-02-28 19:36:26 -08:00
Yulong Wang	e30618d055	[js/webgpu] use Headless for webgpu test by default (#19702 ) ### Description use Chromium Headless for webgpu test by default. Still use normal Chromium with window when debug=true or perfMode=true. Use the [`--headless=new`](https://developer.chrome.com/docs/chromium/new-headless) mode. ### Motivation and Context try to use a more stable way to launch npm tests to avoid a "chrome not found" issue in pipeline, which may potentially caused by windowed application.	2024-02-28 16:05:08 -08:00
Changming Sun	a93c31e3c9	Update dml-vs-2022.yml (#19687 ) ### Description Fix a build error in "Zip-Nuget-Java-Nodejs Packaging Pipeline" which deletes files too early.	2024-02-28 12:03:17 -08:00
Adrian Lizarraga	913bdc7306	[QNN Quant] Handle external data for QNN preprocessing/quant (#19670 ) ### Description - Adds parameters to `qnn_preprocess_model()` to allow saving the new model with external data. - Updates `get_qnn_qdq_config()` to: - Load model without external data (it is not needed) - Return a quantization configuration with `use_external_data_format` set to `True` if the model has external data or if the model is >= 2GB. ### Motivation and Context Update QNN quantization to better handle large models that use external data.	2024-02-28 08:30:12 -08:00
Changming Sun	7a147fc6f7	Remove a bash task from webgpu CI pipeline (#19682 ) ### Description It is a "Bash" task that requires running bash on Windows. Most Windows operating systems do not have Bash installed. Given this task is only debugging purposes, we can remove it for now. ### Motivation and Context I am making this change because I am regenerating the VM image in a different manner, and the new image does not contain bash. Once this PR is in, I can switch the images.	2024-02-28 18:20:53 +08:00
pengwa	026e3178ae	Improve memory matrix for ORTModule (#19620 ) ### Memory matrix for ORTModule Collect parameter/gradient/buffers sizes also. Exposed as a function, can be used externally for debugging purpose. ``` 2024-02-27 07:18:55,283 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,322 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,358 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,438 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▏ \| 2/3200 [01:27<32:05:11, 36.12s/it]2024-02-27 07:18:55,498 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,537 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,576 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,657 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▏ \| 3/3200 [01:27<17:30:57, 19.72s/it]2024-02-27 07:18:55,711 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,750 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,786 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,867 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 [2024-02-27 07:18:55,886] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 0%\|▎ \| 4/3200 [01:28<10:39:52, 12.01s/it]2024-02-27 07:18:55,902 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,944 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,979 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,060 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▍ \| 5/3200 [01:28<6:53:04, 7.76s/it]2024-02-27 07:18:56,115 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,154 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,190 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,270 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▍ \| 6/3200 [01:28<4:36:19, 5.19s/it]2024-02-27 07:18:56,323 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,365 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,398 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,478 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▌ \| 7/3200 [01:28<3:09:33, 3.56s/it]2024-02-27 07:18:56,533 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,572 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,608 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,727 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▌ \| 8/3200 [01:28<2:13:48, 2.52s/it]2024-02-27 07:18:56,806 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,846 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,882 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,962 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▋ \| 9/3200 [01:29<1:36:03, 1.81s/it]2024-02-27 07:18:57,053 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:57,094 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 ```	2024-02-28 15:57:05 +08:00
Yi Zhang	f95c0773a1	Add share memory Flag in docker (#19672 ) ### Description ### Motivation and Context Ref: https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#setincshmem Co-authored-by: Your Name <your@email.com>	2024-02-28 10:40:40 +08:00
Maximilian Müller	c20ced4132	Use CMake's find package for CUDA libs (#19673 ) ### Description Answers issue #19640 More details are in the issue, basically I am changing all the include directory and link directory usage to CMake's `CUDA::*` targets	2024-02-27 11:26:48 -08:00
Yulong Wang	3cb81cdde2	[js/common] move 'env.wasm.trace' to 'env.trace' (#19617 ) ### Description Try to move 'env.wasm.trace' to 'env.trace' to make it less confusing, because it also works in webgpu. Marked 'env.wasm.trace' as deprecated.	2024-02-27 11:07:15 -08:00
zesongw	2e4d1b8f1b	[WebNN EP] Add support for Op MatMul of WebNN CPU backend (#19413 ) Enable MatMul support for WebNN CPU backend to support more models.	2024-02-27 10:01:12 -08:00
Scott McKay	1c468a03b9	Improve Nuget-CUDA-Packaging-Pipeline (#19668 ) ### Description <!-- Describe your changes. --> * Publish the artifacts as late as possible * once published the artifacts are immutable, and any retry will fail if they exist * if any step fails after publishing the stage cannot be retried * use powershell to cleanup * DeleteFiles is taking >30 mins and causing the stage to timeout * powershell took < 1s ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make pipeline more robust	2024-02-27 09:27:43 -08:00
Scott McKay	580ee20dfc	Tweak Windows build parallelization settings (#19664 ) ### Description <!-- Describe your changes. --> Use UseMultiToolTask and limit the number of cl.exe instances running. MultiToolTask info: https://devblogs.microsoft.com/cppblog/improved-parallelism-in-msbuild/ Info on why limiting CL_MPCount can help: https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows The current CIs have 4 cores (both physical and logical). Hardcoded the GPU build in win-ci.yml to use CL_MPCount of 2 as that seems to work fine. Can adjust if needed to base it on the actual number of cores or to use build.py to build. Caveat: I've run about 16 builds and haven't seen a slow build yet, but as the root cause of the slow builds isn't really known this isn't guaranteed to be a fix. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Try and prevent super slow GPU builds by reducing number of tasks potentially running in parallel.	2024-02-27 08:56:16 -08:00
Yi Zhang	3b46ab6439	Re-add testing removed by mistake. (#19647 )	2024-02-27 08:46:29 -08:00
Adrian Lizarraga	4838cb6b3e	[QNN Quantization] Ensure fused nodes have names (#19650 ) ### Description - Updates the `qnn_preprocess_model()` method to set a name for any new nodes added to the graph (due to fusion). - Updates the `qnn_preprocess_model()` method to set a name for any unnamed nodes that previously existed in the original graph. - Adds unit tests for fusions (previously missing) - Checks that fused node names exist and are unique - Checks that fused graph is equivalent to original graph ### Motivation and Context Nodes are not strictly required to have names. However, a planned/upcoming feature to support mixed-precision (integer) quantized models needs nodes to have names.	2024-02-27 02:27:35 -08:00
cloudhan	1e69b61238	Make version string detection more robust (#19615 ) `/opt/rocm/.info/version-dev` is only available if the `rocm-dev` metapackage is installed. This will bring a lot of unused packages which are not needed by the users, they may opt for fine grained control. Fallback to `rocm_version.h` in case `rocm-dev` is not installed.	2024-02-27 16:06:06 +08:00
duanshengliu	9e19684944	Fix the TypeError issue in quantize.py (#19459 ) ### Description <!-- Describe your changes. --> <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix related bug as described in https://github.com/microsoft/onnxruntime/issues/19430	2024-02-26 20:56:32 -08:00
Rachel Guo	5bb58a10e7	Enable the most verbose logging level in detox E2E React Native CI (#19659 ) ### Description <!-- Describe your changes. --> The RN CI has intermittent failure error with "app seems to idle". enable the most verbose logging level (and can add steps to dump device.log from the detox folder/artifacts if necessary) to at least get more information. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2024-02-26 20:00:14 -08:00
kailums	6f566562ce	support user_compute_stream for rocm ep (#19619 ) ### Description <!-- Describe your changes. --> According to the pr #19229 supporting cuda EP use external compute stream, we add support for rocm EP. And when we testing this feature with torch, we found torch use stream 0 for the default stream, and `torch.cuda.current_stream()` returns `0` for current stream, but ort treat `0` or `nullptr` as invalid, and reset has_user_compute_stream to false. Will remove has_user_compute_stream option in the future. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The motivation for this pr is that we want to use torch.cuda.graph to capture ort running kernel, which requires torch and ort are running in the same stream, so we use this API to set ort's working stream.	2024-02-27 11:31:03 +08:00
Scott McKay	8a71b65765	Remove skipping of Reshape from NNAPI EP (#19618 ) ### Description <!-- Describe your changes. --> A number of Qualcomm Snapdragon chipsets do not produce correct output if we skip the Reshape, which ironically was a performance optimization for Snapdragon chips. Perf testing showed that Squeeze also seems to execute on CPU so there's no benefit to using that as an alternative where possible e.g. GlobalPool -> Reshape to 2D -> Gemm could be potentially be replaced with GlobalPool -> Squeeze dims 2 and 3 -> Gemm if that offered better performance. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19518	2024-02-27 11:35:27 +10:00
Changming Sun	18c8fab1ae	Fix a bug in build.py (#19652 ) ### Description Fix a bug in build.py that accidentally disabled C# tests for most builds when "--build_nuget" is specified. ### Motivation and Context The bug was introduced in PR #8892 .	2024-02-26 15:58:09 -08:00
Scott McKay	8bd943be39	Retry flaky XCode iOS UI tests if we get a known error (#19639 ) ### Description <!-- Describe your changes. --> Xcode UI tests seem to be flaky: https://github.com/orgs/community/discussions/68807 Add a couple of retries if we get a "Timed out while loading Accessibility." error which is transient. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-27 09:31:32 +10:00
Sumit Agarwal	a9568935a5	[DML EP] Enable DML Graph Serialization (#19505 ) ### Description This PR adds a feature to serialize all DML EP partitions into DML currency individually for a given a model. This feature can be dynamically turned on by using DML EP option `ep.dml.enable_graph_serialization`. ### Motivation and Context - Why is this change required? What problem does it solve? Useful when user want to capture the DML EP specific partition into DML currency to mitigate the dependency on the framework. <!-- - If it fixes an open issue, please link to the issue here. -->	2024-02-26 11:35:13 -08:00
Yufeng Li	430a086f22	fix memory mapping on Windows (#19623 ) ### Description <!-- Describe your changes. --> Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. We need to set high dwFileOffsetHigh for this case. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The bug was found from #19450	2024-02-25 08:50:45 -08:00
Yi Zhang	0fcc6fb760	Add Whisper model in CI (#19604 ) ### Description Add Whisper Conversion and E2E into Big Models pipeline ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Your Name <your@email.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>	2024-02-25 14:04:22 +08:00
Yi Zhang	c980149c85	Add log for random exception in Linux GPU Test Stage. (#19569 ) ### Description 1. check GPU status in docker 2. use stages to make test stage can leverage existing building artifacts ### Motivation and Context To investigate the root cause of the random exception `CUDA failure 100: no CUDA-capable device is detected`	2024-02-24 13:00:53 -08:00
Yulong Wang	0edb035808	[js/web] fix suite test list for zero sized tensor (#19638 ) ### Description Fixes build break brought by #19614 Currently WebGL backend does not support zero sized tensor. This change split test data into 2 parts, and only enable zero sized tensor tests for WebGPU.	2024-02-24 10:09:07 -08:00
Changming Sun	9ccdc4961a	Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib (#19632 ) ### Description Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib in onecore build. ### Motivation and Context 1. Now all Windows Editions come with Reverse Forwarders. We should just use the normal onecore libs. 2. Many new Windows APIs are only available in [windows umbrella libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries). So these libraries are not specific for Windows CoreOS or Onecore. 3. Going forward we should use "IsApiSetImplemented" to guard our API usages: https://learn.microsoft.com/en-us/windows/win32/apiindex/detect-api-set-availability . After this change, our built binaries can pass apivalidator's check. ``` C:\local\apivalidator>apivalidator.exe -BinaryPath:C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll -SupportedApiXmlFiles:onecoreuap_DDIs.xml ApiValidation: Summary: "C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll" is Universal ApiValidation: All binaries are Universal ``` So it will give an easy way to test ONNX Runtime's compatibility to Windows versions.	2024-02-23 22:31:57 -08:00
Scott McKay	c12a20bef9	Add helper to run CIs for a branch using `az pipelines`. (#16843 ) ### Description <!-- Describe your changes. --> Add helper to run CIs for a branch using `az pipelines`. This can be used to easily kick off multiple CIs for a branch prior to creating a PR. Update run_CIs_for_external_pr.py so the CI list can be shared. Request json output from `gh pr view` so the current state is more easily parsed. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-24 14:06:30 +10:00

1 2 3 4 5 ...

10651 commits