onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-04 04:07:22 +00:00

Author	SHA1	Message	Date
Yulong Wang	f06164ef8b	[js/web] transfer input buffer back to caller thread (#19677 ) ### Description When using proxy worker, input buffers should be transferred back to the caller thread after `run()` call is done. Fixes #19488	2024-03-01 14:50:06 -08:00
Yufeng Li	22176a5fa8	disable gemm f16 on CPU (#19744 ) ### Description <!-- Describe your changes. --> Temporarily disable fp16 gemm on CPU because it usually needs a following Cast which offsets the gain. Need more fp16 operators implementation and performance tuning. Also fix a fusion error of LayerNormalization. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-01 13:44:29 -08:00
Edward Chen	5672cdebdf	Update google benchmark to 1.8.3. (#19734 ) Update google benchmark to 1.8.3. Update deps_update_and_upload.py script to make it easier to use.	2024-03-01 11:01:58 -08:00
Changming Sun	ed550b5fe5	Change webgpu CI pipeline to use a preinstalled chrome (#19729 ) ### Description Change webgpu CI pipeline to use a preinstalled chrome. Hopefully it can increase the stability. Now the chrome got from puppeteer often failed to start.	2024-02-29 20:36:29 -08:00
pengwa	acbfc29f27	Follow up fix for Gelu impl (#19693 ) ### Follow up fix for Gelu impl There are two minor comments in https://github.com/microsoft/onnxruntime/pull/19560. Fix them in this pull request. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-01 10:57:14 +08:00
Scott McKay	2a857d9a86	Add ML Program support for more operators (#19527 ) ### Description <!-- Describe your changes. --> Add support for: - Clip/Relu/Relu6 - Add/Mul/Div/Sub/Pow - GlobalAveragePool/GlobalMaxPool/AveragePool/MaxPool - Reshape - Gemm/MatMul Fix some build issues/warnings from changes. Fix a couple of potential issues with the Resize op as well (noticed due to change to reject inputs with empty data at a higher level). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable mobilenetv2 with ML Program	2024-03-01 10:23:29 +10:00
Dmitri Smirnov	5ee62a6bcc	CUDA Resize-18 implementation (#19595 ) ### Description Implement Resize-18 on CUDA. ### Motivation and Context Performance	2024-02-29 14:46:42 -08:00
Adam Louly	d5606cd7ee	Introducing customizable input names for loss in generate_artifacts. (#19705 ) # loss function extra inputs. Currently, the loss functions in onnxblock expect exactly two inputs in their build method. Occasionally, models may pass additional inputs, causing the build function to fail. To solve this issue, we can let users pass a list of loss input names to be used in the loss function.	2024-02-29 13:40:56 -08:00
Yi-Hong Lyu	ec0e4d3b65	Parallel Transpose_BSNH_to_BNSH (#19406 ) Achieved a speedup of 1.098 in MultiHeadAttention and an end-to-end speedup of 1.021 in the OCR model through parallelization of the Transpose_BSNH_to_BNSH operation.	2024-02-29 10:31:57 -08:00
Vincent Wang	937cdd651e	[ORTMODULE] Support Register Custom Triton Kernel (#19690 ) Add support for registering custom Triton kernel function.	2024-02-29 23:03:57 +08:00
PeixuanZuo	c311d1faf5	[ROCm] Update dockerfile (#19661 ) Update dockerfile to ROCm6.0	2024-02-29 17:51:29 +08:00
Adrian Lizarraga	c1bf7fcd2f	[QNN Quant] Ensure 16bit tensor quant overrides set MS domain (#19684 ) ### Description Ensures that DQ and Q ops use the msft domain if tensor quantization overrides specify 16-bit integer types. ### Motivation and Context ONNX does not yet support 16bit integer types for QuantizeLinear and DequantizeLinear ops (coming soon). For now, DQ/Q ops must use the MSFT domain. We have to also check if tensor quantization overrides force the use of 16-bit quantization types. If so, we must correctly set the domain for Q/DQ ops.	2024-02-29 01:19:25 -08:00
Vincent Wang	d2e6dd25ea	Merge GatherToSplitFusion and #19218 to a General Fusion (#19600 ) #19218 tried to fuse Gather/Slice to Split, but the logic has problem. Scalar value or 1-dim value of indices in Gather node will produce different result, scalar value will produce a result tensor by removing the axis dim, will 1-dim indices value will keep that dim, even when the dim value is 1. For example, Node \|-> Gather(indices=[0], axis=axis) \|-> Gather(indices=[1], axis=axis) \|-> Slice(index=2, axis=axis) is same as Node \|-> Split(axis=axis) But Node \|-> Gather(indices=0, axis=axis) \|-> Gather(indices=1, axis=axis) \|-> Slice(index=2, axis=axis) is same as Node \|-> Split(axis=axis) \|\|-> Squeeze(axis=axis) \|\|-> Squeeze(axis=axis) \|\|-> Previous PR doesn't take such case related to Squeeze/Unsqueeze into account. This PR merges #19218 and GatherToSplitFusion to a general fusion, which relaxes the limit the number of Gather and Slice node number, check all Gather and Slice consumers, if the indices of Gather and start/end of Slice can cover the specific dim of the input tensor, then we can fuse them to a Split, and adding Squeeze if necessary according to the dim count of the indices tensor in Gather. @rui-ren, please check if the fix can still be applied to your model.	2024-02-29 13:45:58 +08:00
Sophie Schoenmeyer	7455dd1f32	Update labeler.yml to change permissions (#19709 ) ### Description Updated github/issue-labeler permissions to give write access for issues. Tried to submit the same PR last week, but the checks kept failing, so I couldn't merge. ### Motivation and Context Enables issue labeling again, which has been broken since GitHub Actions permissions were changed a couple weeks ago.	2024-02-28 21:10:25 -08:00
Changming Sun	250779474d	Change "onnxruntime-Linux-CPU-For-Android-CI" machine pool to "onnxruntime-Ubuntu2204-AMD-CPU" (#19698 ) ### Description The original one reports "out of disk space", which needs to be investigated.	2024-02-28 19:36:26 -08:00
Yulong Wang	e30618d055	[js/webgpu] use Headless for webgpu test by default (#19702 ) ### Description use Chromium Headless for webgpu test by default. Still use normal Chromium with window when debug=true or perfMode=true. Use the [`--headless=new`](https://developer.chrome.com/docs/chromium/new-headless) mode. ### Motivation and Context try to use a more stable way to launch npm tests to avoid a "chrome not found" issue in pipeline, which may potentially caused by windowed application.	2024-02-28 16:05:08 -08:00
Changming Sun	a93c31e3c9	Update dml-vs-2022.yml (#19687 ) ### Description Fix a build error in "Zip-Nuget-Java-Nodejs Packaging Pipeline" which deletes files too early.	2024-02-28 12:03:17 -08:00
Adrian Lizarraga	913bdc7306	[QNN Quant] Handle external data for QNN preprocessing/quant (#19670 ) ### Description - Adds parameters to `qnn_preprocess_model()` to allow saving the new model with external data. - Updates `get_qnn_qdq_config()` to: - Load model without external data (it is not needed) - Return a quantization configuration with `use_external_data_format` set to `True` if the model has external data or if the model is >= 2GB. ### Motivation and Context Update QNN quantization to better handle large models that use external data.	2024-02-28 08:30:12 -08:00
Changming Sun	7a147fc6f7	Remove a bash task from webgpu CI pipeline (#19682 ) ### Description It is a "Bash" task that requires running bash on Windows. Most Windows operating systems do not have Bash installed. Given this task is only debugging purposes, we can remove it for now. ### Motivation and Context I am making this change because I am regenerating the VM image in a different manner, and the new image does not contain bash. Once this PR is in, I can switch the images.	2024-02-28 18:20:53 +08:00
pengwa	026e3178ae	Improve memory matrix for ORTModule (#19620 ) ### Memory matrix for ORTModule Collect parameter/gradient/buffers sizes also. Exposed as a function, can be used externally for debugging purpose. ``` 2024-02-27 07:18:55,283 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,322 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,358 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 816 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,438 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▏ \| 2/3200 [01:27<32:05:11, 36.12s/it]2024-02-27 07:18:55,498 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,537 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,576 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,657 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▏ \| 3/3200 [01:27<17:30:57, 19.72s/it]2024-02-27 07:18:55,711 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,750 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,786 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,867 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 [2024-02-27 07:18:55,886] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 0%\|▎ \| 4/3200 [01:28<10:39:52, 12.01s/it]2024-02-27 07:18:55,902 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,944 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:55,979 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,060 orttraining.rank-0 [INFO] - rank-0 step 4 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▍ \| 5/3200 [01:28<6:53:04, 7.76s/it]2024-02-27 07:18:56,115 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,154 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,190 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,270 orttraining.rank-0 [INFO] - rank-0 step 5 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▍ \| 6/3200 [01:28<4:36:19, 5.19s/it]2024-02-27 07:18:56,323 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,365 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,398 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,478 orttraining.rank-0 [INFO] - rank-0 step 6 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▌ \| 7/3200 [01:28<3:09:33, 3.56s/it]2024-02-27 07:18:56,533 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,572 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,608 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,727 orttraining.rank-0 [INFO] - rank-0 step 7 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▌ \| 8/3200 [01:28<2:13:48, 2.52s/it]2024-02-27 07:18:56,806 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,846 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,882 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: pre_backward \| allocated: 8926 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:56,962 orttraining.rank-0 [INFO] - rank-0 step 8 memory (MiB) \| phase: post_backward \| allocated: 6098 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 218 \| max inactive: 831 \| param: 5314 \| grad: 12 \| buffer: 8 0%\|▋ \| 9/3200 [01:29<1:36:03, 1.81s/it]2024-02-27 07:18:57,053 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) \| phase: pre_forward \| allocated: 5331 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 219 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 2024-02-27 07:18:57,094 orttraining.rank-0 [INFO] - rank-0 step 9 memory (MiB) \| phase: post_forward \| allocated: 8162 \| max allocated: 9039 \| cached: 9382 \| max cached: 9382 \| inactive: 400 \| max inactive: 831 \| param: 5314 \| grad: 0 \| buffer: 8 ```	2024-02-28 15:57:05 +08:00
Yi Zhang	f95c0773a1	Add share memory Flag in docker (#19672 ) ### Description ### Motivation and Context Ref: https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#setincshmem Co-authored-by: Your Name <your@email.com>	2024-02-28 10:40:40 +08:00
Maximilian Müller	c20ced4132	Use CMake's find package for CUDA libs (#19673 ) ### Description Answers issue #19640 More details are in the issue, basically I am changing all the include directory and link directory usage to CMake's `CUDA::*` targets	2024-02-27 11:26:48 -08:00
Yulong Wang	3cb81cdde2	[js/common] move 'env.wasm.trace' to 'env.trace' (#19617 ) ### Description Try to move 'env.wasm.trace' to 'env.trace' to make it less confusing, because it also works in webgpu. Marked 'env.wasm.trace' as deprecated.	2024-02-27 11:07:15 -08:00
zesongw	2e4d1b8f1b	[WebNN EP] Add support for Op MatMul of WebNN CPU backend (#19413 ) Enable MatMul support for WebNN CPU backend to support more models.	2024-02-27 10:01:12 -08:00
Scott McKay	1c468a03b9	Improve Nuget-CUDA-Packaging-Pipeline (#19668 ) ### Description <!-- Describe your changes. --> * Publish the artifacts as late as possible * once published the artifacts are immutable, and any retry will fail if they exist * if any step fails after publishing the stage cannot be retried * use powershell to cleanup * DeleteFiles is taking >30 mins and causing the stage to timeout * powershell took < 1s ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make pipeline more robust	2024-02-27 09:27:43 -08:00
Scott McKay	580ee20dfc	Tweak Windows build parallelization settings (#19664 ) ### Description <!-- Describe your changes. --> Use UseMultiToolTask and limit the number of cl.exe instances running. MultiToolTask info: https://devblogs.microsoft.com/cppblog/improved-parallelism-in-msbuild/ Info on why limiting CL_MPCount can help: https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows The current CIs have 4 cores (both physical and logical). Hardcoded the GPU build in win-ci.yml to use CL_MPCount of 2 as that seems to work fine. Can adjust if needed to base it on the actual number of cores or to use build.py to build. Caveat: I've run about 16 builds and haven't seen a slow build yet, but as the root cause of the slow builds isn't really known this isn't guaranteed to be a fix. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Try and prevent super slow GPU builds by reducing number of tasks potentially running in parallel.	2024-02-27 08:56:16 -08:00
Yi Zhang	3b46ab6439	Re-add testing removed by mistake. (#19647 )	2024-02-27 08:46:29 -08:00
Adrian Lizarraga	4838cb6b3e	[QNN Quantization] Ensure fused nodes have names (#19650 ) ### Description - Updates the `qnn_preprocess_model()` method to set a name for any new nodes added to the graph (due to fusion). - Updates the `qnn_preprocess_model()` method to set a name for any unnamed nodes that previously existed in the original graph. - Adds unit tests for fusions (previously missing) - Checks that fused node names exist and are unique - Checks that fused graph is equivalent to original graph ### Motivation and Context Nodes are not strictly required to have names. However, a planned/upcoming feature to support mixed-precision (integer) quantized models needs nodes to have names.	2024-02-27 02:27:35 -08:00
cloudhan	1e69b61238	Make version string detection more robust (#19615 ) `/opt/rocm/.info/version-dev` is only available if the `rocm-dev` metapackage is installed. This will bring a lot of unused packages which are not needed by the users, they may opt for fine grained control. Fallback to `rocm_version.h` in case `rocm-dev` is not installed.	2024-02-27 16:06:06 +08:00
duanshengliu	9e19684944	Fix the TypeError issue in quantize.py (#19459 ) ### Description <!-- Describe your changes. --> <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix related bug as described in https://github.com/microsoft/onnxruntime/issues/19430	2024-02-26 20:56:32 -08:00
Rachel Guo	5bb58a10e7	Enable the most verbose logging level in detox E2E React Native CI (#19659 ) ### Description <!-- Describe your changes. --> The RN CI has intermittent failure error with "app seems to idle". enable the most verbose logging level (and can add steps to dump device.log from the detox folder/artifacts if necessary) to at least get more information. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2024-02-26 20:00:14 -08:00
kailums	6f566562ce	support user_compute_stream for rocm ep (#19619 ) ### Description <!-- Describe your changes. --> According to the pr #19229 supporting cuda EP use external compute stream, we add support for rocm EP. And when we testing this feature with torch, we found torch use stream 0 for the default stream, and `torch.cuda.current_stream()` returns `0` for current stream, but ort treat `0` or `nullptr` as invalid, and reset has_user_compute_stream to false. Will remove has_user_compute_stream option in the future. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The motivation for this pr is that we want to use torch.cuda.graph to capture ort running kernel, which requires torch and ort are running in the same stream, so we use this API to set ort's working stream.	2024-02-27 11:31:03 +08:00
Scott McKay	8a71b65765	Remove skipping of Reshape from NNAPI EP (#19618 ) ### Description <!-- Describe your changes. --> A number of Qualcomm Snapdragon chipsets do not produce correct output if we skip the Reshape, which ironically was a performance optimization for Snapdragon chips. Perf testing showed that Squeeze also seems to execute on CPU so there's no benefit to using that as an alternative where possible e.g. GlobalPool -> Reshape to 2D -> Gemm could be potentially be replaced with GlobalPool -> Squeeze dims 2 and 3 -> Gemm if that offered better performance. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19518	2024-02-27 11:35:27 +10:00
Changming Sun	18c8fab1ae	Fix a bug in build.py (#19652 ) ### Description Fix a bug in build.py that accidentally disabled C# tests for most builds when "--build_nuget" is specified. ### Motivation and Context The bug was introduced in PR #8892 .	2024-02-26 15:58:09 -08:00
Scott McKay	8bd943be39	Retry flaky XCode iOS UI tests if we get a known error (#19639 ) ### Description <!-- Describe your changes. --> Xcode UI tests seem to be flaky: https://github.com/orgs/community/discussions/68807 Add a couple of retries if we get a "Timed out while loading Accessibility." error which is transient. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-27 09:31:32 +10:00
Sumit Agarwal	a9568935a5	[DML EP] Enable DML Graph Serialization (#19505 ) ### Description This PR adds a feature to serialize all DML EP partitions into DML currency individually for a given a model. This feature can be dynamically turned on by using DML EP option `ep.dml.enable_graph_serialization`. ### Motivation and Context - Why is this change required? What problem does it solve? Useful when user want to capture the DML EP specific partition into DML currency to mitigate the dependency on the framework. <!-- - If it fixes an open issue, please link to the issue here. -->	2024-02-26 11:35:13 -08:00
Yufeng Li	430a086f22	fix memory mapping on Windows (#19623 ) ### Description <!-- Describe your changes. --> Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. We need to set high dwFileOffsetHigh for this case. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The bug was found from #19450	2024-02-25 08:50:45 -08:00
Yi Zhang	0fcc6fb760	Add Whisper model in CI (#19604 ) ### Description Add Whisper Conversion and E2E into Big Models pipeline ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Your Name <your@email.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>	2024-02-25 14:04:22 +08:00
Yi Zhang	c980149c85	Add log for random exception in Linux GPU Test Stage. (#19569 ) ### Description 1. check GPU status in docker 2. use stages to make test stage can leverage existing building artifacts ### Motivation and Context To investigate the root cause of the random exception `CUDA failure 100: no CUDA-capable device is detected`	2024-02-24 13:00:53 -08:00
Yulong Wang	0edb035808	[js/web] fix suite test list for zero sized tensor (#19638 ) ### Description Fixes build break brought by #19614 Currently WebGL backend does not support zero sized tensor. This change split test data into 2 parts, and only enable zero sized tensor tests for WebGPU.	2024-02-24 10:09:07 -08:00
Changming Sun	9ccdc4961a	Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib (#19632 ) ### Description Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib in onecore build. ### Motivation and Context 1. Now all Windows Editions come with Reverse Forwarders. We should just use the normal onecore libs. 2. Many new Windows APIs are only available in [windows umbrella libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries). So these libraries are not specific for Windows CoreOS or Onecore. 3. Going forward we should use "IsApiSetImplemented" to guard our API usages: https://learn.microsoft.com/en-us/windows/win32/apiindex/detect-api-set-availability . After this change, our built binaries can pass apivalidator's check. ``` C:\local\apivalidator>apivalidator.exe -BinaryPath:C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll -SupportedApiXmlFiles:onecoreuap_DDIs.xml ApiValidation: Summary: "C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll" is Universal ApiValidation: All binaries are Universal ``` So it will give an easy way to test ONNX Runtime's compatibility to Windows versions.	2024-02-23 22:31:57 -08:00
Scott McKay	c12a20bef9	Add helper to run CIs for a branch using `az pipelines`. (#16843 ) ### Description <!-- Describe your changes. --> Add helper to run CIs for a branch using `az pipelines`. This can be used to easily kick off multiple CIs for a branch prior to creating a PR. Update run_CIs_for_external_pr.py so the CI list can be shared. Request json output from `gh pr view` so the current state is more easily parsed. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-24 14:06:30 +10:00
Sheil Kumar	46c4d7fe4a	Disable gemm activation for non-float data types (#19612 ) ### Description Disable gemm activation for non-float data types ### Motivation and Context When a float16 model contains a Gemm+Relu subgraph, the gemm_activation_fusion will kick in and cause the two nodes to be eliminated and replaced with a FusedGemm. This however is only registered for the float data type. This causes model load failures. Disable the fusion for non-float data types. --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2024-02-23 18:20:22 -08:00
Guenther Schmuelling	bb43a0f133	[js/webgpu] minor fixes to make tinyllama work (#19564 )	2024-02-23 15:45:30 -08:00
Yulong Wang	aec2389ad0	[js/webgpu] allows a ProgramInfo's RunData to use zero sized output (#19614 ) ### Description This PR allows zero-sized output. To make the implementation simple, it does not support partial zero-sized tensor. Which means, either all outputs are zero-sized, or an error will be reported. added 2 tests: - op test of `Add` with input T[2,0] T[2,1], and - test_split_zero_size_splits	2024-02-23 12:52:47 -08:00
Markus Tavenrath	efbe2b8455	Fix cuDNN v9 build by replacing removed cuDNN v6 RNN API usage by cuDNN v8 RNN API and reenable RNN tests for CUDA EP (#19419 ) Replace deprecated cuDNN RNN based API by cuDNN v8 RNN API and re-enable RNN tests for the CUDA EP. ### Motivation and Context The deprecated cuDNN RNN API might vanish soon and in addition for the current CUDA EP RNN implementation all RNN tests are disabled due to failures. With this change the deprecated API has been removed and the new updated implemented doesn't fail the tests anymore.	2024-02-23 08:45:17 -08:00
cao lei	f430600432	Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515 (#19609 ) ### Description <!-- Describe your changes. --> Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515	2024-02-23 06:02:05 -08:00
satyajandhyala	ae3d73c981	[JS/WebGPU] Fix Split and Where to handle corner cases. (#19613 ) ### Description <!-- Describe your changes. --> 1. Fix Where operator to handle Boolean input less than 4 bytes. 2. Fix JSEP test harness to use tensor names consistently. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-23 00:21:15 -08:00
Markus Tavenrath	5e432a3ae6	Add support for NHWC GridSample in the CUDA EP and enable grid_sample_test for all EPs (#19562 ) I've added NHWC GridSample support to the CUDA EP to reduce the number of layout transforms. Also I've enabled the full set of GridSampleTests for all EPs. I've also added the GridSample OpSet 16 to the registered kernels. ### Motivation and Context This is the first PR is a series of enhancements of the CUDA EP improving NHWC support to avoid costly layout transforms between NWHC and NCHW nodes which are layout sensitive. Also testing was quite rudimentary for the CUDA EP while it was great for the CPU path. I've regenerated grid_sample_test.cc enabling tests for other platforms as well. Those tests resurfaced #10607 again which is fixed as well.	2024-02-22 19:47:15 -08:00
pengwa	ae92d593c0	ONNX Gelu Op in Opset 20 (#19560 ) ### ONNX Gelu Op in Opset 20 Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op 1. Move CPU-GELU implmentation from `onnxruntime/contrib_ops/cpu/activations.h/cc` to `onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation for approximate attribute to be 'none'. 2. Dumplicate some logic from `onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to `onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation for approximate attribute to be 'tanh'. 3. Register ONNX domain Gelu CPU kernel from opset 20 in `onnxruntime/core/providers/cpu/cpu_execution_provider.cc`. 4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to `onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and `onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu` respectively, as the implementation for approximate attribute to be 'tanh'. 5. Implement the logic for approximate attribute to be 'none' in `onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`. 6. Register ONNX domain Gelu CUDA kernel from opset 20 in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. 7. ROCM ep related changes. 8. Enrich the tests for ONNX domain Gelu in `onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.	2024-02-23 11:05:16 +08:00

1 2 3 4 5 ...

10643 commits