onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-26 03:00:54 +00:00

Author	SHA1	Message	Date
sfatimar	4d1963c2a2	OpenVINO EP Rel 1.18 Changes (#20337 ) ### Description These changes include Support to OpenVINO 2024.1 Import PreCompiled Blobs with EPContext Blob Separate Device/Precision as input Deprecate CPU_FP32 , GPU_FP32 terminology , introduce CPU, GPU AUTO GPU, CPU will only create GPU Blob and not CPU Blob. ### Motivation and Context - OpenVINO 2024.1 will be out soon - Import Precompiled Blob can greatly reduce FEIL/FIL Time. - Separating Device/Precision will make the input cleaner - --------- Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2024-04-19 00:31:38 -07:00
Patrice Vignola	12569626cb	Update DML to 1.14.1 (#20380 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-18 22:43:41 -07:00
Patrice Vignola	745b426c60	[DML] Update DML to 1.14 (#20304 ) I am prefiring this change to pre-run the non-dml checks, and also to give folks the time to review it before DML gets released. When DML 1.14 officially releases, we'll only need to run the DML pipeline to automatically pick up the nuget package. This should save us some valuable time. Note that DML 1.14 is the release needed for ORT 1.17.4, and DML 1.15 will come soon after.	2024-04-18 16:22:57 -07:00
Adam Louly	ee74fb6908	Introducing ORTPipelineModule - DeepSpeed Parallel Pipeline Support. (#20287 ) ### Description Introducing a new class ORTPipelineModule to handle wrapping layers in DeepSpeed pipeline parallel. ### Motivation and Context To support pipeline parallelism on ORTModule. This PR will include an initial support of deepspeed Pipeline parallelism. - [x] Support Pipeline parallel where layers are nn Modules in Sequential. - [ ] Support LayerSpec and TiedLayerSpec - [ ] Enable partitioning to accept List - [ ] Full-GPU Graph Consolidation - [ ] Subgraph Merging for Inference	2024-04-18 11:30:15 -07:00
Sumit Agarwal	f664f91298	[DML EP] Expose NPU macro via build command (#20306 ) ### Description This fixes following things: - Expose `ENABLE_NPU_ADAPTER_ENUMERATION` macro via build command, so that a user can enable NPU support for DML EP seamlessly. - Add keyword `_dmlEp_` as part of the node name, which would be useful for debugging purpose. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-18 11:23:13 -07:00
Patrice Vignola	76434907fb	[DML EP] Add graph capture (#20257 ) This adds a new "Graph Capture" option to the DML ep, similar to the cuda graph functionality. Here's how graph capture works: - A user can enable graph capture in the session options by setting `ep.dml.enable_graph_capture` to `true` - When they want to capture a run, they set `gpu_graph_id` in their `RunOptions` to a number bigger than 0 (0 is reserved for internal use according to the cuda graph documentation). - Then, when they start the inference, the graph will be captured and stored in the DML EP for future use - When they execute the run for a second time with the same id, the `ReplayGraph` function in the DML EP will be called instead of executing the kernels, resulting in very low overhead and avoiding kernel recompilation. This feature can give up-to-par or even better performance than specifying the static dimensions at session creation time, but is also much more flexible.	2024-04-18 10:15:00 -07:00
Adrian Lizarraga	0a1902525f	Add patch for ONNX 1.16.0 shape inference bug (#20316 ) ### Description - Adds a patch that fixes a shape inference bug that caused a segfault: https://github.com/onnx/onnx/pull/6080 - Fix documentation describing why QLinearMatMul tests are currently being skipped. ### Motivation and Context The [PR for integrating with ONNX 1.16.0](https://github.com/microsoft/onnxruntime/pull/19745) disabled various python quantization tests due to a shape inference bug. This PR applies the ONNX fix as a patch. We still can't enable the tests because some of our CIs pip install onnx-1.16.0, which doesn't include the fix.	2024-04-17 10:23:22 -07:00
Yi-Hong Lyu	6b6a62fb40	Add vectorized AVX512F kernel for ReduceMaximumF32Kernel (#20268 ) ### Description <!-- Describe your changes. --> This commit introduces a new vectorized AVX512F kernel, MlasReduceMaximumF32KernelAvx512F, which efficiently computes the maximum value of the supplied buffer. Additionally, microbenchmarks have been added for MlasComputeSoftmax (inplace), MlasReduceMaximumF32KernelAvx, MlasComputeSumExpF32KernelAvx512F, and MlasComputeSoftmaxOutputF32KernelAvx. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The goal of this commit is to enhance the performance of ReduceMaximumF32Kernel on CPUs with AVX512F instruction support. \| AVX \| \| \| AVX512 \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- name \| iterations \| real_time \| cpu_time \| iterations \| real_time \| cpu_time \| time_unit REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:3/real_time \| 271277304 \| 2.58095 \| 2.58091 \| 263338132 \| 2.65661 \| 2.65661 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:3/real_time \| 271220477 \| 2.58095 \| 2.58095 \| 263509929 \| 2.65652 \| 2.65649 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:3/real_time \| 271240587 \| 2.58064 \| 2.58064 \| 263479542 \| 2.65671 \| 2.65665 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:3/real_time \| 271227745 \| 2.58083 \| 2.58079 \| 263402506 \| 2.65657 \| 2.65657 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:3/real_time \| 271255069 \| 2.58073 \| 2.58071 \| 263463858 \| 2.65682 \| 2.65682 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:3/real_time \| 271257174 \| 2.58058 \| 2.58052 \| 263460120 \| 2.65682 \| 2.65682 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:4/real_time \| 174395051 \| 4.01401 \| 4.01401 \| 197330481 \| 3.5465 \| 3.54636 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:4/real_time \| 174645502 \| 3.99691 \| 3.99691 \| 197474831 \| 3.54298 \| 3.54278 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:4/real_time \| 174523308 \| 4.01391 \| 4.01386 \| 197389981 \| 3.54518 \| 3.54506 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:4/real_time \| 174779200 \| 3.99874 \| 3.99874 \| 197519075 \| 3.54227 \| 3.54209 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:4/real_time \| 174642874 \| 4.00645 \| 4.00641 \| 197642101 \| 3.54195 \| 3.54188 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:4/real_time \| 174546754 \| 4.0061 \| 4.00608 \| 197621033 \| 3.54296 \| 3.54281 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:5/real_time \| 162752651 \| 4.30119 \| 4.30114 \| 215552503 \| 3.24767 \| 3.24752 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:5/real_time \| 162717463 \| 4.30123 \| 4.30116 \| 215541082 \| 3.24711 \| 3.24695 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:5/real_time \| 162718819 \| 4.3016 \| 4.30153 \| 215589239 \| 3.24725 \| 3.24708 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:5/real_time \| 162719596 \| 4.30151 \| 4.30145 \| 215563846 \| 3.24956 \| 3.24949 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:5/real_time \| 162753333 \| 4.30125 \| 4.30125 \| 215537315 \| 3.24924 \| 3.24908 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:5/real_time \| 162752258 \| 4.3014 \| 4.30141 \| 215526482 \| 3.24744 \| 3.24735 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:7/real_time \| 143579660 \| 4.87526 \| 4.87516 \| 100000000 \| 5.25767 \| 5.25752 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:7/real_time \| 143585097 \| 4.87476 \| 4.87467 \| 100000000 \| 5.41583 \| 5.41567 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:7/real_time \| 143571011 \| 4.87506 \| 4.87503 \| 182359467 \| 3.83773 \| 3.83764 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:7/real_time \| 143587142 \| 4.87487 \| 4.8748 \| 182397261 \| 3.83807 \| 3.8379 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:7/real_time \| 143578465 \| 4.87525 \| 4.87521 \| 182428602 \| 3.83777 \| 3.83768 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:7/real_time \| 143588555 \| 4.87491 \| 4.87488 \| 125280452 \| 5.59791 \| 5.59766 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:9/real_time \| 284851058 \| 2.43476 \| 2.43476 \| 156879863 \| 4.42895 \| 4.42884 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:9/real_time \| 270700898 \| 2.59031 \| 2.59024 \| 157953114 \| 4.42995 \| 4.42968 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:9/real_time \| 282871172 \| 2.45385 \| 2.45385 \| 157801156 \| 4.42817 \| 4.42804 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:9/real_time \| 285307738 \| 2.47009 \| 2.47005 \| 158058507 \| 4.4279 \| 4.42786 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:9/real_time \| 285709536 \| 2.45481 \| 2.45476 \| 158070961 \| 4.42809 \| 4.42799 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:9/real_time \| 285449733 \| 2.47495 \| 2.47491 \| 158069718 \| 4.45026 \| 4.45017 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:11/real_time \| 189213618 \| 3.79684 \| 3.79676 \| 139459497 \| 5.01882 \| 5.01871 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:11/real_time \| 185600468 \| 3.76394 \| 3.76376 \| 139444892 \| 5.01922 \| 5.01905 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:11/real_time \| 184968668 \| 3.80636 \| 3.80636 \| 139470834 \| 5.01948 \| 5.01936 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:11/real_time \| 183867226 \| 3.80432 \| 3.80427 \| 139481986 \| 5.01975 \| 5.01944 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:11/real_time \| 184301650 \| 3.81634 \| 3.81634 \| 139452846 \| 5.01983 \| 5.01972 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:11/real_time \| 186215795 \| 3.82659 \| 3.82654 \| 139497736 \| 5.02119 \| 5.02113 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:13/real_time \| 135622415 \| 5.16256 \| 5.16252 \| 124661337 \| 5.61227 \| 5.61194 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:13/real_time \| 135618907 \| 5.15967 \| 5.1596 \| 124805224 \| 5.6088 \| 5.60854 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:13/real_time \| 135612192 \| 5.15506 \| 5.15501 \| 124803221 \| 5.60901 \| 5.60869 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:13/real_time \| 135906082 \| 5.15818 \| 5.15818 \| 124776601 \| 5.60898 \| 5.60886 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:13/real_time \| 135369523 \| 5.15709 \| 5.15682 \| 124790370 \| 5.60927 \| 5.60902 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:13/real_time \| 135596827 \| 5.1603 \| 5.1603 \| 124792145 \| 5.61637 \| 5.61614 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:15/real_time \| 110947137 \| 5.96511 \| 5.96495 \| 112861522 \| 6.20035 \| 6.20014 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:15/real_time \| 118004792 \| 6.22645 \| 6.22628 \| 112909900 \| 6.20073 \| 6.20073 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:15/real_time \| 112630319 \| 6.25564 \| 6.25552 \| 112874563 \| 6.19932 \| 6.19924 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:15/real_time \| 117403034 \| 6.17263 \| 6.17258 \| 112927318 \| 6.19866 \| 6.19842 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:15/real_time \| 108921863 \| 6.48624 \| 6.48612 \| 112927746 \| 6.20057 \| 6.20026 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:15/real_time \| 110358148 \| 6.66805 \| 6.66789 \| 112907312 \| 6.19938 \| 6.19908 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:16/real_time \| 203419574 \| 3.4415 \| 3.44137 \| 237134525 \| 2.95649 \| 2.95638 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:16/real_time \| 203414035 \| 3.4411 \| 3.44099 \| 237129564 \| 2.95178 \| 2.95171 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:16/real_time \| 203404068 \| 3.44157 \| 3.44151 \| 236981704 \| 2.9518 \| 2.95167 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:16/real_time \| 203391471 \| 3.44146 \| 3.44137 \| 237108807 \| 2.95203 \| 2.95196 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:16/real_time \| 203393801 \| 3.44131 \| 3.44127 \| 237126460 \| 2.95278 \| 2.95272 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:16/real_time \| 203407476 \| 3.44181 \| 3.44162 \| 237154444 \| 2.95293 \| 2.9528 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:500/real_time \| 37551439 \| 18.6407 \| 18.6407 \| 39222534 \| 17.858 \| 17.8571 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:500/real_time \| 37544097 \| 18.6404 \| 18.6401 \| 39174151 \| 17.8539 \| 17.8536 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:500/real_time \| 37549837 \| 18.6391 \| 18.6391 \| 39233956 \| 17.8507 \| 17.8505 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:500/real_time \| 45996345 \| 15.2157 \| 15.2153 \| 39285929 \| 17.848 \| 17.8474 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:500/real_time \| 46012429 \| 15.2184 \| 15.2179 \| 65664865 \| 10.7366 \| 10.7364 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:500/real_time \| 45912375 \| 15.2349 \| 15.2346 \| 65205908 \| 10.8498 \| 10.8492 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:2000/real_time \| 9493955 \| 73.7232 \| 73.7203 \| 10188090 \| 68.7931 \| 68.7908 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:2000/real_time \| 9495562 \| 73.7173 \| 73.7173 \| 10180895 \| 68.7533 \| 68.7511 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:2000/real_time \| 9487371 \| 73.7852 \| 73.7831 \| 10164473 \| 68.7279 \| 68.725 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:2000/real_time \| 10816047 \| 64.7322 \| 64.7287 \| 10168481 \| 68.8109 \| 68.8096 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:2000/real_time \| 10808802 \| 64.7232 \| 64.721 \| 19478320 \| 36.1471 \| 36.1461 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:2000/real_time \| 10818192 \| 64.7304 \| 64.728 \| 19419672 \| 35.9635 \| 35.9635 \| ns	2024-04-16 13:52:43 -07:00
George Wu	08d208b969	[QNN EP] refactor QNN deps/copy logic. start copying deps to target python loc… (#20317 ) copy QNN deps when building python bindings as well. tweak the wildcard to only copy QNN related files. latest sdk from Qualcomm (>= 2.21) also include SNPE dll's which we don't want to include.	2024-04-15 22:33:12 -07:00
liqun Fu	cd7112f800	Integration with ONNX 1.16.0 (#19745 ) ### Description update with ONNX 1.16.0 branch according to https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md ONNX 1.16.0 release notes: https://github.com/onnx/onnx/releases/tag/v1.16.0 #### Updated ops for CPU EP: - DequantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block dequantization support - QuantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block quantization support - Cast(21) - Missing int4 and uint4 support - CastLike(21) - Missing int4 and uint4 support - ConstantOfShape(21) - Missing int4 and uint4 support - Identity(21) - Missing int4 and uint4 support - If(21) - Missing int4 and uint4 support - Loop(21) - Missing int4 and uint4 support - Reshape(21) - Missing int4 and uint4 support - Scan(21) - Missing int4 and uint4 support - Shape(21) - Missing int4 and uint4 support - Size(21) - Missing int4 and uint4 support - Flatten(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Pad(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Squeeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Transpose(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Unsqueeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support #### Unimplemented opset 21 features/ops - int4 and uint4 data type - QLinearMatMul(21) - GroupNormalization(21) - ai.onnx.ml.TreeEnsemble(5) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Disabled tests #### ORT Training orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py - test_ort_custom_ops: Potential shape inference bug for custom ops #### Python quantization unit tests test/onnx/python/quantization (shape inference bug) - test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16 - test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16 - test_op_gemm.py: test_quantize_qop_gemm_s8s8 - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3 - test_op_matmul.py: test_quantize_matmul_u8u8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy - test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile - test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution - test_op_relu.py: test_quantize_qop_relu_s8s8 #### ONNX tests - test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a maxpool output size bug and added this test. Enable this test when [ORT PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged. Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741). - test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op ai.onnx.ml.TreeEnsemble - test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same - test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same - test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same - test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4 yet - test_cast_INT4_to_INT8_cpu: same - test_cast_UINT4_to_FLOAT_cpu: same - test_cast_UINT4_to_UINT8_cpu: same - test_cast_INT4_to_FLOAT_cuda - test_cast_INT4_to_INT8_cuda - test_cast_UINT4_to_FLOAT_cuda - test_cast_UINT4_to_UINT8_cuda - test_constantofshape_float_ones_cuda: ConstantOfShape(21) not implemented for cuda - test_constantofshape_int_shape_zero_cuda: same - test_constantofshape_int_zeros_cuda: same - test_flatten_axis0_cuda: Flatten(21) not implemented for cuda - test_flatten_axis1_cuda: same - test_flatten_axis2_cuda: same - test_flatten_axis3_cuda: same - test_flatten_default_axis_cuda: same - test_flatten_negative_axis1_cuda: same - test_flatten_negative_axis2_cuda: same - test_flatten_negative_axis3_cuda: same - test_flatten_negative_axis4_cuda: same - test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not implemented in ORT yet - test_qlinearmatmul_2D_int8_float32_cpu: same - test_qlinearmatmul_2D_uint8_float16_cpu: same - test_qlinearmatmul_2D_uint8_float32_cpu: same - test_qlinearmatmul_3D_int8_float16_cpu: same - test_qlinearmatmul_3D_int8_float32_cpu: same - test_qlinearmatmul_3D_uint8_float16_cpu: same - test_qlinearmatmul_3D_uint8_float32_cpu: same - test_qlinearmatmul_2D_int8_float16_cuda: same - test_qlinearmatmul_2D_int8_float32_cuda: same - test_qlinearmatmul_2D_uint8_float16_cuda: same - test_qlinearmatmul_2D_uint8_float32_cuda: same - test_qlinearmatmul_3D_int8_float16_cuda: same - test_qlinearmatmul_3D_int8_float32_cuda: same - test_qlinearmatmul_3D_uint8_float16_cuda: same - test_qlinearmatmul_3D_uint8_float32_cuda: same - test_size_cuda: Size(21) not implemented for cuda - test_size_example_cuda: same - test_dequantizelinear_blocked: Missing implementation for block dequant for DequantizeLinear(21) - test_quantizelinear_blocked_asymmetric: Missing implementation for block quant for QuantizeLinear(21) - test_quantizelinear_blocked_symmetric: Missing implementation for block quant for QuantizeLinear(21) --------- Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>	2024-04-12 09:46:49 -07:00
Andrew Fantino	7303a90f49	Fix build errors from date/date.h C++20 compatibility (#20139 ) ### Description For C++ standards >= 20, use `std::chrono::operator<<` in place of `date::operator<<` to fix ambiguous operator compile error. ### Motivation and Context The external dependency HowardHinnant/date has a conflict with std::chrono for >=C++20. Solves #20137	2024-04-02 22:10:25 -07:00
Yi Zhang	dae77e6014	Support building Windows CUDA with Ninja (#20176 ) ### How to run it locally 1. conda install ninja 2. "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64 3. python.exe {ort_repo}\tools\ci_build\build.py --config RelWithDebInfo --build_dir {ort_repo}\build_cuda --skip_submodule_sync --build_csharp --update --parallel --cmake_generator "Ninja" --build_shared_lib --enable_onnx_tests --enable_pybind --build_java --build_nodejs --use_cuda "--cuda_home=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8" --enable_cuda_profiling --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=60 4. cd build_cuda\RelWithDebInfo 5. cmake --build . j16 ### Motivation and Context In packaging pipelines, we often come across a random issue that the building with CUDA on Windows takes too much time. Although it has been reduced much by moving the building to the CPU machine. We're planning to build with Ninja instead of msbuild in Packaging pipelines, thus, nvcc can run parallelly. It's the first step to support it locally.	2024-04-03 11:19:31 +08:00
Jeff Bloomfield	2f31560430	Enable generic feature level devices in DML EP (#20114 ) ### Description Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX headers through the DirectX-Headers repo. Note that this includes an update to cgamanifest.json for onnx-tensorrt which is triggered during re-generation due to a prior changes to deps.txt. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-29 14:37:30 -07:00
Ye Wang	17919717b5	add QMoE (#20108 ) ### Description <!-- Describe your changes. --> 1. Introduce latest cutlass extension from TRTLLM that gives us cutlass upgrade(to 3.4) opportunity from MoE side. 2. Fix Windows build issue 3. Add Int4 MoE op and ut ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-29 10:24:19 -07:00
Dmitri Smirnov	b95fd4e644	Enable CUDA EP unit testing on Windows (#20039 ) ### Description Address build issues and source code discrepancies. Fix cuda_test_provider gtest argument stack corruption. ### Motivation and Context `OpTester` class that is widely used for kernel testing is not suitable for testing internal classes for EPs that are built as shared objects. Currently, CUDA EP tests run only on Linux. We want to enable testing and developments on Windows, and create a usable pattern for testing of other EPs internals. Alternatives considered: Abstracting EP unit tests into separate test executable such as `onnxruntime_test_all`. This alternative was rejected as it would create a lot more changes in the established patterns, and potentially interfere with CUDA functionality with more complex source code maintanence.	2024-03-27 13:32:36 -07:00
Dmitri Smirnov	3076b56947	Make MS Debug engine SymInitialize() called as needed. (#20036 ) ### Description <!-- Describe your changes. --> Initialize Symbol engine as needed with no duplicate calls. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Currently absel library may call SymInitialize more than once when shared libraries are involved. However, this can only be called only once per process. Our debug_alloc also may call it when enabled. This change enables intialization to proceed only when needed with no duplicate effort.	2024-03-22 16:17:47 -07:00
sfatimar	eab35c20fc	Ort openvino npu 1.17 master (#19966 ) ### Description Add NPU to list of device supported. Added changes for Support to OV 2024.0 Nuget packages removes packaging of OpenVINO DLL Bug Fixes with Python API Reverted Dockerfiles not being maintained. ### Motivation and Context NPU Device has been introduced by Intel in latest client systems OpenVINO 2024.0 release is out. --------- Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Ubuntu <ubuntu@ubuntu-118727.iind.intel.com> Co-authored-by: hmamidix <hemax.sowjanya.mamidi@intel.com> Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>	2024-03-21 18:44:00 -07:00
Changming Sun	dafbef3a21	CMake: support reading dependency zip files from a local mirror (#20005 ) ### Description To test this feature, run ```bat python cmake\deps_update_and_upload.py --root-path mirror ``` Then run build.py as usual. The zip files will be cached local. To avoid being downloaded again and again.	2024-03-21 17:58:59 -07:00
Yufeng Li	15219e2e71	turn on neural_speed by default (#19627 ) ### Description <!-- Describe your changes. --> the crash caused by the neural_speed turns out to be a very corn case. Turn it on by default. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-20 12:49:58 -07:00
Rachel Guo	6b305f95e0	Support xcframework for mac catalyst builds. (#19534 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> MAUI on macOS uses mac-catalyst which requires a different native binary. --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2024-03-20 10:55:19 -07:00
mindest	3dfe4a5e6d	[ROCm] Remove MPI dependency and collectives to use NCCL (#19830 ) ### Description * Remove MPI dependency to use NCCL AllReduce, etc. * Exclude unsupported collectives in hipify	2024-03-19 17:35:18 -07:00
Ted Themistokleous	6bb64683f8	Use version instead of version-dev for ROCm (#19967 )	2024-03-19 10:40:40 +08:00
Adam Louly	32558134a9	[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770 ) ### Description Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers ### Motivation and Context This PR includes changes that will make ort handle 2GB+ checkpoints. To do that we need to upgrade flatbuffers to 23.5.9 - https://github.com/google/flatbuffers/pull/7945 - Modified the commitHash and the hash for the new version - Removed the patch for rust generator's unused variable warning as it is no longer producing this - [Check it out here](`d121e09d89/src/idl_gen_rust.cpp`) - Updated the VerifyField calls with alignment values that were introduced in the new version. --------- Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>	2024-03-14 16:36:24 -07:00
Changming Sun	1fb6cbddee	Add a build patch for Windows ARM64EC (#19898 ) ### Description Add a patch for Windows ARM64EC ### Motivation and Context Will need more changes in onnxruntime/core/common/cpuid_arch_definition.h and onnxruntime/core/common/cpuid_info.cc	2024-03-14 08:50:42 -07:00
Jeff Daily	9443366009	[ROCm] fix build failure when nccl is enabled (#19900 ) Building onnxruntime ROCm EP with --enable_nccl --use_mpi fails due to inclusion of MOE source files but MOE is not supported. The error observed is `error: contrib_ops/rocm/moe/ft_moe/moe_kernel.h: No such file or directory` The fix is to exclude collective/sharded_moe.* files when nccl is requested.	2024-03-13 21:16:54 -07:00
Adrian Lizarraga	9c3242ab70	[QNN EP] Copy security catalog file for HtpV73Skel.so from QNN SDK (#19903 ) ### Description Copies the `QNN_HOME/lib/hexagon-v73/unsigned/libqnnhtpv73.cat` file from QNN SDK to the unittest build directory. This is necessary in order to be able to load the `libQnnHtpV73Skel.so` file on Windows for modern versions of QNN SDK. ### Motivation and Context A [digitally-signed catalog file](https://learn.microsoft.com/en-us/windows-hardware/drivers/install/catalog-files) (.cat) can be used as a digital signature for an arbitrary collection of files.	2024-03-13 20:52:59 -07:00
Jake Mathern	18ad8587a6	[CP] Fix for xfgcheck and Fix WAI ARM64 build (#19634 ) (#19644 ) ### Description Fix WAI build by only conditionally copying linker flags ### Motivation and Context I broke the WAI build that contains ORT on ARM64	2024-03-13 17:54:06 -07:00
Edward Chen	860eb762c2	[Apple framework] Fix minimal build with training enabled. (#19858 ) Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled. - training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build. - tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build. Added a test to cover this configuration.	2024-03-12 11:33:30 -07:00
Scott McKay	978c40d853	Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723 ) ### Description <!-- Describe your changes. --> If the EP handles QDQ node units, we need to make sure we do not split those into different partitions. Update the partitioning utils to be QDQ aware. If there are node units we process the logical nodes they represent instead of individual nodes. This ensure we process all nodes in a QDQ node unit at the same time so that they are always in the same partition. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix one of the issues in #19590 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-03-12 10:55:49 +10:00
Changming Sun	efad5bbc5a	Replace some old file system calls with C++17 std::filesystem APIs. (#19196 ) ### Description 1. Replace some old file system calls to use C++17 std::filesystem APIs. 2. Remove tensorflow_C_PACKAGE_PATH cmake option, which was only used in onnxruntime_perf_test and the code is out of maintain. 3. Excludes onnx_test_runner and onnxruntime_perf_test from iOS build because C++17 filesystem library is not available there	2024-03-09 09:17:36 -08:00
Scott McKay	db59cec82f	Don't reduce warning level for CUDA build on Windows (#19663 ) ### Description <!-- Describe your changes. --> Address warnings so all the ORT projects build with /W4 on Windows. Mainly - unused parameters - variables shadowing other ones ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19588 started on this.	2024-03-06 15:03:55 +10:00
Chi Lo	d9730c7f43	[TensorRT EP] Fix bug for DDS output handling for empty tensor (#19575 ) When the DDS output is empty tensor (i.e. any of the dimension is 0), TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to prevent accidentally overwriting other location that might belong to other tensors. This PR also refactors the code to only allocate single bytes for all empty tensors. #TODO: add unit tests to cover the DDS code paths or doing more testing with concurrent,sequential, threaded faster-rcnn using onnx_test_runner and verifying outputs --------- Co-authored-by: Chi Lo <lochi@microsoft.com>	2024-03-05 14:39:36 -08:00
Chen Fu	06e684c9f2	Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619 ) ### Description Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is specially optimized for Nvidia Ampere GPUs. ### Motivation and Context Trying to improve quantized LLM inference performance on Nvidia Ampere GPUs ### Note: This is implemented by extending CUTLASS, so it has a hard dependency on CUTLASS. However, in current build system, loading of CUTLASS dependency is guarded with: (onnxruntime_USE_FLASH_ATTENTION OR onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION) If both of these options are turned off, then compilation will fail. Why CUTLASS dependency is guarded at all? It's a header file only library that does not introduce any binary if not instantiated. What's the downside of removing all the guards and just include CUTLASS unconditionally?	2024-03-05 09:37:45 -08:00
Changming Sun	a0521f899e	Enable CPUINFO for all Windows build (#19655 ) ### Description It was disabled in PR #9065. And the reason was: " api-ms-win-core-kernel32-legacy-*.dll wasn't available in Windows 8 and was added in Windows 10, so cpuinfo breaks our Windows 8 support. I'm disabling it again." We no longer support Windows 8. Therefore we can add CPUINFO back. ### Motivation and Context To make the code simpler. If in any case the library doesn't work as expected, we can submit a PR to their code base and fix it.	2024-03-01 16:23:20 -08:00
Edward Chen	5672cdebdf	Update google benchmark to 1.8.3. (#19734 ) Update google benchmark to 1.8.3. Update deps_update_and_upload.py script to make it easier to use.	2024-03-01 11:01:58 -08:00
Scott McKay	2a857d9a86	Add ML Program support for more operators (#19527 ) ### Description <!-- Describe your changes. --> Add support for: - Clip/Relu/Relu6 - Add/Mul/Div/Sub/Pow - GlobalAveragePool/GlobalMaxPool/AveragePool/MaxPool - Reshape - Gemm/MatMul Fix some build issues/warnings from changes. Fix a couple of potential issues with the Resize op as well (noticed due to change to reject inputs with empty data at a higher level). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable mobilenetv2 with ML Program	2024-03-01 10:23:29 +10:00
Maximilian Müller	c20ced4132	Use CMake's find package for CUDA libs (#19673 ) ### Description Answers issue #19640 More details are in the issue, basically I am changing all the include directory and link directory usage to CMake's `CUDA::*` targets	2024-02-27 11:26:48 -08:00
cloudhan	1e69b61238	Make version string detection more robust (#19615 ) `/opt/rocm/.info/version-dev` is only available if the `rocm-dev` metapackage is installed. This will bring a lot of unused packages which are not needed by the users, they may opt for fine grained control. Fallback to `rocm_version.h` in case `rocm-dev` is not installed.	2024-02-27 16:06:06 +08:00
Changming Sun	9ccdc4961a	Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib (#19632 ) ### Description Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib in onecore build. ### Motivation and Context 1. Now all Windows Editions come with Reverse Forwarders. We should just use the normal onecore libs. 2. Many new Windows APIs are only available in [windows umbrella libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries). So these libraries are not specific for Windows CoreOS or Onecore. 3. Going forward we should use "IsApiSetImplemented" to guard our API usages: https://learn.microsoft.com/en-us/windows/win32/apiindex/detect-api-set-availability . After this change, our built binaries can pass apivalidator's check. ``` C:\local\apivalidator>apivalidator.exe -BinaryPath:C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll -SupportedApiXmlFiles:onecoreuap_DDIs.xml ApiValidation: Summary: "C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll" is Universal ApiValidation: All binaries are Universal ``` So it will give an easy way to test ONNX Runtime's compatibility to Windows versions.	2024-02-23 22:31:57 -08:00
cao lei	f430600432	Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515 (#19609 ) ### Description <!-- Describe your changes. --> Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515	2024-02-23 06:02:05 -08:00
pengwa	ae92d593c0	ONNX Gelu Op in Opset 20 (#19560 ) ### ONNX Gelu Op in Opset 20 Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op 1. Move CPU-GELU implmentation from `onnxruntime/contrib_ops/cpu/activations.h/cc` to `onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation for approximate attribute to be 'none'. 2. Dumplicate some logic from `onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to `onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation for approximate attribute to be 'tanh'. 3. Register ONNX domain Gelu CPU kernel from opset 20 in `onnxruntime/core/providers/cpu/cpu_execution_provider.cc`. 4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to `onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and `onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu` respectively, as the implementation for approximate attribute to be 'tanh'. 5. Implement the logic for approximate attribute to be 'none' in `onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`. 6. Register ONNX domain Gelu CUDA kernel from opset 20 in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. 7. ROCM ep related changes. 8. Enrich the tests for ONNX domain Gelu in `onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.	2024-02-23 11:05:16 +08:00
PeixuanZuo	6226c5f62f	[ROCm] Add SkipGroupNorm for ROCm EP (#19303 ) Add SkipGroupNorm for ROCm EP. --------- Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2024-02-21 11:08:48 +08:00
Jake Mathern	7a5860e490	Fix cmake function duplicate lib (#19547 ) ### Description Fixes cmake function definition in winml.cmake to copy link flags. ### Motivation and Context XFGCheck errors in WindowsAI because this function does not transfer linker flags	2024-02-20 13:41:40 -08:00
pengwa	b55260d076	Minor fix for cmake (#19552 ) ### Minor fix for cmake When build on Linux, get a warning saying " CMake Warning at CMakeLists.txt:1603 (message): MPI and NCCL disabled on Win build. " This message is not correct. So have such a fix to avoid any misunderstanding from users. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/848c2d77-a538-4e31-8e0d-4b539233e515) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-19 10:21:19 +08:00
Scott McKay	4e5119760d	Add initial support for CoreML ML Program to the CoreML EP. (#19347 ) ### Description <!-- Describe your changes. --> Adds infrastructure to create an ML Package containing the Model using ML Program. Updated coremltools files to v7.1 to bring in new protobuf definitions along with the tools to write the weight.bin file and create an ML Package correctly. Enables building a CoreML Model on all platforms which means all the operator builder code can be debugged anywhere. Execution of the generated CoreML model is obviously limited to Apple platforms. The Conv operator builder has been updated to be able to generate an ML Program Operation. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> NeuralNetwork is no longer being developed and ML Program is the replacement going forward.	2024-02-15 08:46:03 +10:00
George Wu	5e70c6b3a6	allow protobuf lite build for TRT EP (#19498 ) allow protobuf-lite builds with TensorRT EP as long as it's built with the trt built-in parser and not the oss-parser. This is because trt built-in parser statically links protobuf so there aren't any conflicts for protobuf-lite.	2024-02-12 22:53:04 -08:00
Patrice Vignola	1182b5509b	Disable streams for the DML EP (#19481 ) There's currently a bug in the allocation planner when reusing buffers and more than one streams are used that make it possible (although rarely) to reach a reference count of 0 for a buffer that is still being used. Since DML doesn't benefit from multiple streams, disabling it is the safest option for now. This is a high priority issue that we need to fix for 1.17.1 since it breaks stable diffusion. Identifying the perfect fix and fixing the underlying issue would be too risky for a patch release, especially given the limited time that we have. https://github.com/microsoft/onnxruntime/issues/19480	2024-02-10 00:34:34 -08:00
Changming Sun	1007d8f3d1	Revert "Revert NeuralSpeed code for x64 MatMulNBits (#19382 )" (#19474 ) This reverts commit `0d10c7f3c1`.	2024-02-09 09:24:54 -08:00
luoyu-intel	0d10c7f3c1	Revert NeuralSpeed code for x64 MatMulNBits (#19382 ) ### Description <!-- Describe your changes. --> Revert PR#19016 https://github.com/microsoft/onnxruntime/pull/19016 Revert PR#17669 https://github.com/microsoft/onnxruntime/pull/17669	2024-02-07 13:04:37 -08:00
Maximilian Müller	91b2e660fe	[Build] fix: missing nvcc flags when compiling with unittests (#19308 ) When configured using the following CMake ops Clion is not able to configure due to checking with `nvcc ... --dryrun tmp.cu`: ``` cmake -G Ninja -Donnxruntime_USE_TENSORRT="ON" -Donnxruntime_USE_CUDA="ON" -Donnxruntime_USE_CUDA_NHWC_OPS="ON" -DCMAKE_CUDA_ARCHITECTURES="native" -Donnxruntime_NVCC_THREADS=1 -Donnxruntime_ENABLE_NVTX_PROFILE="ON" -Donnxruntime_USE_TENSORRT_BUILTIN_PARSER="ON" -DCMAKE_CUDA_COMPILER_LAUNCHER="ccache" -Donnxruntime_BUILD_UNIT_TESTS="ON" -Donnxruntime_USE_TRITON_KERNEL=OFF -Donnxruntime_USE_FLASH_ATTENTION=OFF ``` Without building the unittests everything works fine. I believe my changes only follow the logic that is actually desired. If `NVCC_HAS_STRICT_ALIASING` is set to false it should not be possible to add this as a CUDA flag. Same is true for `HAS_NOERROR` as seen in `adjust_global_compile_flags.cmake`	2024-02-06 17:01:26 -08:00
Ye Wang	aaf32fb1b1	phi2 conversion/optimization script (#19338 ) ### Description <!-- Describe your changes. --> This PR adds onnx conversion script for dynamo exported phi2, optimization script, and inference example script A readme file is added as documentation. https://github.com/microsoft/onnxruntime/tree/wangye/phi2_doc/onnxruntime/python/tools/transformers/models/phi2#readme ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-02-05 10:15:16 -08:00
Scott McKay	debd1cab10	Add coremltools 7.1 as a dependency (#19389 ) ### Description <!-- Describe your changes. --> Setup usage of coremltools via dependencies instead of copying files. Pull in some changes from https://github.com/microsoft/onnxruntime/pull/19347 in preparation for supporting ML Program and enabling building the ML Model on all platforms to make development and testing of CoreML EP code easier. - Update to coremltools 7.1 - Add patch for changes required for cross platform build of ML Program related code - Generate coreml proto files on all platforms - mainly to test these changes work everywhere, as the proto files will be used on all platforms when #19347 is checked in - rename onnxruntime_coreml_proto target to coreml_proto as it contains purely coreml protobuf code with no ORT related chagnes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve setup.	2024-02-03 09:42:21 +10:00
He Li	1bdd7d9499	Update oneDNN to v3.0.1 in order to support gcc 13 (#19344 ) ### Description Update the dependency of `oneDNN` to v3.0.1, which fixes a minor bug hindering gcc 13. ### Motivation and Context Referring to [oneDNN-1548](https://github.com/oneapi-src/oneDNN/issues/1548). - When building with `--use_dnnl` using gcc 13.x, it will fail due to this upstream issue. - This is fixed in `v3.0.1` [tag](https://github.com/oneapi-src/oneDNN/tree/v3.0.1) by [this commit](`1d7971ce48`).	2024-02-01 15:39:03 -08:00
Yueqing Zhang	1d6f13fb92	[VitisAI] Refactor the VAIEP to use MSFT's standalone API (#19058 ) ### Description <!-- Describe your changes. --> Refactor the VAIEP to use MSFT's standalone API ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs in order to decouple the EP from onnxruntime.dll and the providers.dll. This will help to simplify customer deployment of applications and use cases that need to share their onnxruntime.dll with other applications. --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com> Co-authored-by: zz002 <zhenze.wang@amd.com>	2024-01-31 21:08:26 -08:00
Yi-Hong Lyu	55b60d8fe0	Turn off Neural Speed to avoid slowdowns (#19265 ) Disable Neural Speed to prevent the operation following MatMulNBits from significantly slowing down.	2024-01-31 13:40:25 -08:00
Phoebe Chen	2b361c04d6	Fix Flatbuffer build issue. (#19296 ) ### Description Building on g++ 13.2.0 results in -Wstringop-overread errors on Linux. This commit addresses the flatbuffer build issue with the following changes: 1. Remove the Werror flag in the flarbuffer patch. 2. Add a compilation option to suppress the 'stringop-overflow' error in the Flatbuffers within the xnnpack provider. ### Motivation and Context https://github.com/google/flatbuffers/issues/8119 https://github.com/microsoft/onnxruntime/pull/19239 Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>	2024-01-31 10:12:43 -08:00
Changming Sun	8dad9d92f4	Move einsum's test data to constexpr variables (#19320 ) ### Description emscripten's C++ compiler has difficulty on compiling einsum_test.cc because the file has too many local variables. So I moved them to constexpr.	2024-01-30 15:59:37 -08:00
Changming Sun	a92802f940	Disable a few tests for wasm build (#19316 )	2024-01-30 08:16:57 -08:00
Tianlei Wu	8b4517218b	Remove USE_CUTLASS flag (#19271 ) ### Description Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for onnxruntime CUDA build), there is no need to have a flag to disable cutlass. Changes: (1) Reverted https://github.com/microsoft/onnxruntime/pull/18761 (2) remove the condition to build cutlass. (3) Fix a few build errors or warnings during testing CUDA 11.4 build. Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later. Flash attention and cutlass fused multihead attention will not be built for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if you want to support latest GPUs. It is better to include it in 1.17.0 (otherwise, the release branch might encounter build failure with CUDA 11.4). Tests: (1) Build with flash attention and efficient attention off: passed (2) Build with CUDA 11.4: passed Example build command used in Ubuntu 20.04: ``` export CUDA_HOME=/usr/local/cuda-11.4 export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/ export CUDACXX=/usr/local/cuda-11.4/bin/nvcc sh build.sh --config Release --build_shared_lib --parallel --use_cuda --cuda_version 11.4 \ --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \ --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \ --disable_types float8 ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-25 16:57:58 -08:00
PeixuanZuo	1c92e56dc0	[Cuda] Refactor GroupNorm (#19146 ) Split GroupNorm implementation into multiple files, to make ROCm EP can reuse cuda code. Related PR: https://github.com/microsoft/onnxruntime/pull/19158 --------- Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2024-01-25 22:28:47 +08:00
Phoebe Chen	4477f57ee3	Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238 ) ### Description This pull request introduces the necessary changes to enable RISC-V 64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V architecture has gained popularity as an open standard instruction set architecture, and this contribution aims to extend ONNX Runtime's compatibility to include RISC-V, thereby broadening the reach of ONNX models to a wider range of devices. ### Motivation and Context RISC-V is a free and open-source instruction set architecture (ISA) based on established RISC principles. It is provided under open licenses without fees. Due to its extensibility and freedom in both software and hardware, RISC-V is poised for widespread adoption in the future, especially in applications related to AI, parallel computing, and data centers. ### Example Build Command ``` ./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests ``` ### Documentation Updates Relevant sections of the documentation will be updated to reflect the newly supported RISC-V 64-bit cross-compilation feature. https://github.com/microsoft/onnxruntime/pull/19239 --------- Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>	2024-01-24 16:27:05 -08:00
Changming Sun	bc54ad3f03	Update abseil to a release tag and register neural_speed (#19255 ) ### Description Update abseil to a release tag and register neural_speed to CG. ### Motivation and Context Now we are using a non-relesed version of abseil. Using a tag is better.	2024-01-24 14:37:39 -08:00
Jeff Daily	b2aec41a83	[ROCm] enable hipGraph (#18382 ) This ports the cudaGraph support from the CUDA EP to the ROCM EP's hipGraph.	2024-01-23 11:17:04 +08:00
snadampal	77da2ef278	[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031 ) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `	2024-01-22 14:43:06 -08:00
Edward Chen	c8ce83967e	Download protoc for all Apple host builds, remove protoc build from iOS packaging pipeline. (#19209 )	2024-01-19 15:30:09 -08:00
luoyu-intel	459c750b03	Update x64 template kernel library for 'sqnbitgemm' (#19016 ) ### Description <!-- Describe your changes. --> 1. Make JBLAS codes an external module of ORT. 2. Move q4 gemm code to contrib_ops. 3. Update template kernel library to v0.1 release. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We found that the current LLM model performance is far below our expectations. Here is some performance data collected on Mistral-7B model with Xeon-8480: 8 threads \| prompt length=32 past_len=32 \| prompt length=1 past_len=32 -- \| -- \| -- ORT-main \| 1220ms \| 263ms Neural-speed \| 564ms \| 87ms ORT-this PR\|597ms\|120ms Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code, there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks. Through some statistics analysis, the summary latency of `MatMulNBits` is 86.7ms The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So other OPs introduce an extra 30ms latency. The performance of MatMulNBits in this PR meets our expectations. ### Remain Issues 1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses TaskGranularityFactor to scale its number of threads. This is not expected in our code design. It may slow down the hybrid CPU performance by 30~40%. 2. Prepack uses a single thread which is very slow to init a session. 3. MatMulNBits with zero points will fall through to COMP_FP32 even accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is not optimized for now. It will be updated in the future. So, for an int4 model with zero points, whether the accuracy_level is 0 or 4 will be no difference.	2024-01-18 13:16:34 -08:00
Maximilian Müller	bc219ed553	[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#19052 ) Adresses https://github.com/microsoft/onnxruntime/issues/18542. I followed the advice given by @RyanUnderhill [here](https://github.com/microsoft/onnxruntime/pull/18731#issuecomment-1848261925) and went with a minimal CUDA EP for now.	2024-01-17 11:33:34 -08:00
Wanming Lin	07d3aed3aa	[WebNN EP] Fixed build issue with disable_rtti (#19173 ) Previously building webnn ep with --disable_rtti will throw unboundTypeError since unbound type names are illegal with RTTI disabled in Embind API, we can fix it by adding a -DEMSCRIPTEN_HAS_UNBOUND_TYPE_NAMES=0 flag.	2024-01-16 21:35:13 -08:00
Changming Sun	e2e488d6f8	Revert "iOS packaging pipeline stability" (#19135 ) Reverts microsoft/onnxruntime#19097 because it broken Android CI pipeline.	2024-01-16 09:18:35 -08:00
Jeff Bloomfield	8d4369b77e	Update DirectML nuget version to 1.13.1 (#19122 ) ### Description Update DML version to 1.13.1 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-15 19:04:41 -08:00
Edward Chen	e1e45901e2	iOS packaging pipeline stability (#19097 ) - Remove protoc build step which sometimes times out. Download protoc instead. - Use macOS-12 image in the set variables stage. It seems more stable.	2024-01-13 19:27:44 -08:00
Edward Chen	150c4cb8fe	[MLAS AArch64] SQNBitGemm CompInt8 kernel (#18953 ) Implement ARM NEON SQNBitGemm kernel that first block quantizes A to int8 and then does int8 multiplication.	2024-01-12 17:58:08 -08:00
Guenther Schmuelling	96dbac6e4b	update to emsdk-3.1.51 (#18844 )	2024-01-12 16:04:33 -08:00
Preetha Veeramalai	c340bf08f6	Openvino EP code changes for 1.17 update (#19023 ) ### Description Introduce AppendExecutionProvider_OpenVINO_V2 API and support for OV 2023.3. ### Context - The API is added to facilitate customers in using published official Microsoft onnxruntime libraries with OVEP libraries. - Add support for OpenVINO 2023.3 official release. - Extend operator coverage - GH fixes --------- Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>	2024-01-12 13:20:51 -08:00
Changming Sun	0e8d4c3d21	Enable Address Sanitizer in CI (#19073 ) ### Description 1. Add two build jobs for enabling Address Sanitizer in CI. One for Windows CPU, One for Linux CPU. 2. Set default compiler flags/linker flags in build.py for normal Windows/Linux/MacOS build. This can help control compiler flags in a more centralized way. 3. All Windows binaries in our official packages will be built with "/PROFILE" flag. Symbols of onnxruntime.dll can be found at [Microsoft public symbol server](https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/microsoft-public-symbols). Limitations: 1. On Linux Address Sanitizer ignores RPATH settings in ELF binaries. Therefore once Address Sanitizer is enabled, before running tests we need to manually set LD_LIBRARY_PATH properly otherwise libonnxruntime.so may not be able to find custom ops and shared EPs. 4. On Linux we also need to set LD_PRELOAD before running some tests(if the main executable, like python, is not built with address sanitizer. On Windows we do not need to. 5. On Windows before running python tests we should manually copy address sanitizer DLL to the onnxruntime/capi directory, because python 3.8 and above has enabled "Safe DLL Search Mode" that wouldn't use the information provided by PATH env. 6. On Linux Address Sanitizer found a lot of memory leaks from our python binding code. Therefore right now we cannot enable Address Sanitizer when building ONNX Runtime with python binding. 7. Address Sanitizer itself uses a lot of memory address space and delays memory deallocations, which is easy to cause OOM issues in 32-bit applications. We cannot run all the tests in onnxruntime_test_all in 32-bit mode with Address Sanitizer due to this reason. However, we still can run individual tests in such a way. We just cannot run all of them in one process. ### Motivation and Context To catch memory issues.	2024-01-12 07:24:40 -08:00
PeixuanZuo	5f3113ecd6	[ROCm] Fix hipify error: fast_divmod.h: No such file or directory (#19060 ) Fix error: ``` [ 48%] Built target onnxruntime_optimizer In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5: /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory 11 \| #include "core/providers/rocm/shared_inc/fast_divmod.h" \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. ``` This error is due to onnxruntime_optimizer missing dependencies on hipify generated files.	2024-01-10 14:49:19 +08:00
Milos Puzovic	37ac9d391c	Enable Arm Compute Library 23.08 (#17672 ) ### Description This PR enables onnxruntime to build with the most recent release of Arm Compute Library ### Motivation and Context The latest version of Arm Compute Library that onnxruntime can build is 20.02 which is more than 3 years old.	2024-01-09 14:10:25 -08:00
Changming Sun	68c29ece23	In a Linux or Android build check if the compiler support bfloat16 and float16 (#18813 ) ### Description Restrict clang version because we have an upcoming change that requires clang version >=16 , which will mainly affect Android build.	2024-01-08 19:46:33 -08:00
Jeff Bloomfield	55a669409a	Merge pull request #18983 from microsoft/WindowsAI Merge WindowsAI to main	2024-01-04 17:21:19 -08:00
Wei-Sheng Chin	658e30eb33	Remove DORT since it's in PyTorch main now (#18996 ) Main code are removed and tests are modified to use DORT directly from PyTorch.	2024-01-04 12:59:47 -08:00
Xavier Dupré	889b1ef2d1	Fix schema type constraint for custom operators (#17497 ) ### Description onnxruntime may raise an error "type inference failed" but when a custom operator sets IsHomogeneous to false in its schema. This change make sure that TypeInferenceFunction and schema type constraints are aligned to prevent that from happening. --------- Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>	2024-01-04 20:27:46 +01:00
Yulong Wang	b18abaaa2c	[js/web] wait for threadpool initialization (#18952 ) ### Description a replacement of #18683. try to resolve #18689. By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the threadpool to initialize before the webassembly instance is available.	2024-01-04 08:06:55 -08:00
Sheil Kumar	107d7492b9	[DirectML EP] Add DML EP registration for Col2Im (#17786 ) ### Description [DirectML EP] Add DML EP registration for Col2Im operator ### Motivation and Context Add Col2Im support for opset 18. This operator is implemented as the DirectML Fold operator. --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2024-01-03 16:13:14 -08:00
Jeff Bloomfield	c3d96a7b35	Update DML version to 1.13.0 (#18978 ) Update DML nuget version to 1.13.0	2024-01-03 16:09:55 -08:00
Sheil Kumar	dbb8680bdc	Delay load dxcore.dll in addition to ext-ms-win-dxcore-l1-1-0.dll (#18913 ) Delay load dxcore.dll in addition to ext-ms-win-dxcore-l1-1-0.dll Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-12-26 12:33:42 -08:00
pengwa	37f743680a	Fix build when flash attention and memory efficient attention are disabled (#18761 ) ### Fix build when flash attention and memory efficient attention are disabled On a customer env with lower version of CUDA < 11.6. Both flash attention and memory efficient attention is turned OFF according to `e8f33b54ba/cmake/CMakeLists.txt (L701)`. So `e8f33b54ba/cmake/external/cutlass.cmake (L1)` condition check return false. No cutlass lib is built. ``` Turn off flash attention since CUDA compiler version < 11.6 ``` While, the kernels in https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/moe/ft_moe are depending on cutass for its build, so we get error like this: ``` [ 77%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory 23 \| #include "cutlass/array.h" \| ^~~~~~~~~~~~~~~~~ compilation terminated. In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory 23 \| #include "cutlass/array.h" \| ^~~~~~~~~~~~~~~~~ compilation terminated. In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory 23 \| #include "cutlass/array.h" \| ^~~~~~~~~~~~~~~~~ compilation terminated. In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17: /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory 23 \| #include "cutlass/array.h" \| ^~~~~~~~~~~~~~~~~ compilation terminated. fatal : Could not open input file /tmp/tmpxft_00044da3_00000000-11_moe_gemm_kernels_fp16_fp16.compute_60.cpp1.ii make[2]: * [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:6290: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o] Error 1 make[2]: * Waiting for unfinished jobs.... make[1]: * [CMakeFiles/Makefile2:2210: CMakeFiles/onnxruntime_providers_cuda.dir/all] Error 2 make: * [Makefile:166: all] Error 2 Traceback (most recent call last): File "/tmp/onnxruntime/tools/ci_build/build.py", line 2746, in <module> sys.exit(main()) File "/tmp/onnxruntime/tools/ci_build/build.py", line 2639, in main build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target) File "/tmp/onnxruntime/tools/ci_build/build.py", line 1527, in build_targets run_subprocess(cmd_args, env=env) File "/tmp/onnxruntime/tools/ci_build/build.py", line 824, in run_subprocess return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env) File "/tmp/onnxruntime/tools/python/util/run.py", line 49, in run completed_process = subprocess.run( File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, ``` ### Motivation and Context To summarize, there are two cases we will have build failure for Linux CUDA build: 1. User use cuda version < 11.6 2. User disabled Flash attention and memory efficient attention explictly with onnxruntime_USE_FLASH_ATTENTION and onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION	2023-12-26 08:57:58 +08:00
luoyu-intel	5f00bc9931	Integrate high-performance x64 gemm library to MLAS (#17669 ) ### Description Improve MLAS to support high-performance x64 INT4 kernels ### Motivation and Context 1. improve LLM inference performance on Intel CPUs. 2. support more 4bit quantization types: nf4, fp4 3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension 4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16 5. support MatMulNBits' data format ### Tasks - [x] support block_size: 32, 128, -1(per channel) - [x] get weight pack size without memory allocation - [x] use ort's thread pool for parallelism - [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8 ### Benchmark Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 47613 \| 47401 \| 12970 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 6347792 \| 6317562 \| 109 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 11814014 \| 11757847 \| 59 Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 50222 \| 50031 \| 13759 Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 2038222 \| 2028743 \| 341 Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 3792832 \| 3774485 \| 191 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 58717 \| 58501 \| 11467 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 1360846 \| 1354598 \| 543 Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 2564232 \| 2551365 \| 266 Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 57929 \| 57694 \| 12047 Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5495330 \| 5465810 \| 126 Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10676240 \| 10617817 \| 66 Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 68305 \| 68047 \| 10026 Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5504862 \| 5476215 \| 126 Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 11758623 \| 11697337 \| 66 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 67713 \| 67451 \| 10298 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5508325 \| 5480237 \| 126 Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10738528 \| 10681656 \| 64 Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 60708 \| 60486 \| 11321 Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5523784 \| 5495736 \| 126 Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10829633 \| 10772161 \| 67 Reference: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time \| 53088 \| 52911 \| 13364 Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time \| 6268981 \| 6230335 \| 110 Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time \| 11701237 \| 11632339 \| 59 Win11+12900K 8 cores: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time \| 215976 \| 211295 \| 2884 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time \| 60960590 \| 60937500 \| 10 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time \| 1.18E+08 \| 1.19E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time \| 470377 \| 453059 \| 1414 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time \| 1.54E+08 \| 1.53E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time \| 3.18E+08 \| 3.13E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time \| 569072 \| 559398 \| 1229 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time \| 1.54E+08 \| 1.52E+08 \| 4 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time \| 3.22E+08 \| 3.28E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time \| 1486055 \| 1473325 \| 403 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time \| 4.14E+08 \| 4.14E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time \| 8.88E+08 \| 8.59E+08 \| 1 --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mengni.wang@intel.com>	2023-12-19 09:36:31 -08:00
Changming Sun	f52668cc68	Disable mlas unit test in ARM64EC build (#18747 ) ### Description Disable mlas unit test in ARM64EC build because the program has some link errors. We will fix the errors later. This PR only impacts Windows ARM64EC build. It has no impact on the existing build pipelines.	2023-12-15 09:17:47 -08:00
Changming Sun	d795fc636c	FIX: Our cmake script didn't check googletest's hash (#18826 )	2023-12-15 08:48:15 -08:00
Changming Sun	cbad4fe49b	Update absl and googletest (#18827 ) ### Description Update absl and googletest to their latest version to include some cmake changes: 1. A googletest's cmake change that will allow using external absl and re2. 2. Nullability enhancements that will allow our clang-based static analysis detecting many kinds of null pointer errors. ### Motivation and Context To fix a C4744 link warning in our Windows pipelines. ``` LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\usage.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<int>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj] ```	2023-12-14 16:15:07 -08:00
Yueqing Zhang	b42d4b8ea6	[VitisAI] 1. api compatbile 2. dynamic load onnx (#18470 ) ### Description <!-- Describe your changes. --> 1. Add a backward-compatible API for compiling model. 2. Run-time load vitisai-ep.dll ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yueqing Zhang <yueqingz@amd.com> Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2023-12-14 14:43:41 -08:00
Suryaprakash Shanmugam	0723dcb8b5	OpenVINO Execution Provider with 2023.2 support (#18596 ) - Add support for OpenVINO 2023.2 - num_of_threads provider option is mapped to the CPU device property inference_num_threads of the CPU plugin, so users can control the #threads used for inference by the CPU - Logging in Debug mode now includes the runtime properties set for devices - Fix issue in using external weights through OpenVINO --------- Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-12-13 15:56:43 -08:00
Adrian Lizarraga	81796a3081	[QNN EP Quantization] Add fusion preprocessing to QNN quantization (#18719 ) ### Description - Adds graph fusions to preprocessing step that can be called before creating a QDQ model for QNN EP. - Fuse Erf sequence to Gelu (adapted from [optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_gelu.py)). Required by QNN EP. - Fuse ReduceMean sequence to LayerNormaliation (adapted from [optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_layernorm.py)). Not required by QNN EP. - Fuse ReduceL2 sequence to LpNormalization (new, specific to QNN EP). Required by QNN EP. Example use: ```python3 from quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model # Added by this PR: model_updated = qnn_preprocess_model("model.fp32.onnx", "model.fp32.preprocessed.onnx", fuse_layernorm=True) model_to_quantize = "model.fp32.preprocessed.onnx" if model_updated else "model.fp32.onnx" # Quantize model ... qnn_config = get_qnn_qdq_config(model_to_quantize, data_reader, activation_type=QuantType.QUInt16) quantize(model_to_quantize, "model.qdq.onnx", qnn_config) ``` ### Motivation and Context Allow more models to be quantized for use with QNN EP --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-12-12 08:43:04 -08:00
Changming Sun	c7799d7058	Build fixes for Windows ARM32 desktop build (#18752 ) ### Description Fix a link error: ``` onnxruntime_common.lib(cpuid_info.obj) : error LNK2019: unresolved external symbol __imp_RegGetValueA referenced in function "privat e: void __cdecl onnxruntime::CPUIDInfo::ArmWindowsInit(void)" (?ArmWindowsInit@CPUIDInfo@onnxruntime@@AAAXXZ) [C:\Users\snnn\src\on nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventRegister referenced in function "pub lic: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\src\on nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventUnregister referenced in function "p ublic: virtual __cdecl onnxruntime::WindowsTelemetry::~WindowsTelemetry(void)" (??1WindowsTelemetry@onnxruntime@@UAA@XZ) [C:\Users\y ilyu\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventSetInformation referenced in functio n "public: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\ src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventWriteTransfer referenced in function _tlgWriteTransfer_EventWriteTransfer [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\RelWithDebInfo\onnx_test_runner.exe : fatal error LNK1120: 5 unresolved ex ternals [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj] ```	2023-12-08 12:45:06 -08:00
Changming Sun	bf33919afb	Update absl and gtest to fix an ARM64EC build error (#18735 ) ### Description Update absl and gtest to fix an ARM64EC build error ### Motivation and Context We need to get an important fix into ORT. The fix is: `8028a87c96`	2023-12-07 15:55:17 -08:00
junchao-loongson	4abec9749e	[mlas] add loongarch lsx and lasx optimize code (#17937 ) ### Description Hello we(@lixing-star) are the developers of loongson team. We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch architecture [100% tests passed, 0 tests failed out of 7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true) ### Development Environments1 ``` CPU: Loongson-3C5000L uname -a: Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux ``` ### LonngArch Documents - [LoongArch Reference Manual - Volume 1: Basic Architecture: This manual describes the basic part of the LoongArch architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html) - [LoongArch ELF psABI: This manual describes the LoongArch ELF psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html) - [more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)	2023-12-07 11:15:59 -08:00
moyo1997	9479ba525b	Build onnxruntime.dll as arm64x (#18633 ) Build onnxruntime.dll as arm64x Added a .cmake file to generate a link repro of the onnxruntime.dll during arm64 build. This provides us a directory containing all the arm64 objs, def file and libs to link to when it is time to building arm64x onnxruntime.dll during the arm64ec build by passing the /machine:arm64x flag to the linker along with the arm64 artifacts. If other dlls wanted to be built as x, setting the ARM64X_TARGETS variable in the toplevel cmakelists.txt to include these other targets is all that will be needed. Added build_arm64x.bat as a wrapper for the multiple (rm64, then arm64ec) cmake calls needed to build as arm64x. AB#22533	2023-12-06 16:49:00 -08:00
Ye Wang	c012e41f93	MoE with Expert Slicing (#18565 ) ### Description <!-- Describe your changes. --> Registered Sharded MoE op under contrib_op/cuda/collective with expert slicing. The broadcast process happens just before adding second bias(if has) and permutation undoing. Tensor slicing is planned but not included in this PR. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-12-05 16:56:38 -08:00
Adrian Lizarraga	e066fca777	[Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465 ) ### Description #### 1. Adds `TensorQuantOverrides` extra option Allows specifying a dictionary of tensor-level quantization overrides: ``` TensorQuantOverrides = dictionary : Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains a dictionary for each channel in the tensor. Each dictionary contains optional overrides with the following keys and values. 'quant_type' = QuantType : The tensor's quantization data type. 'scale' = Float : The scale value to use. Must also specify `zero_point` if set. 'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set. 'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also set `scale` or `zero_point`. 'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also set `scale` or `zero_point`. 'rmax' = Float : Override the maximum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. 'rmin' = Float : Override the minimum real tensor value in calibration data. Invalid if also set `scale` or `zero_point`. ``` - All of the options are optional. - Some combinations are invalid. - Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale` are also specified. Example for per-tensor quantization overrides: ```Python3 extra_options = { "TensorQuantOverrides": { "SIG_OUT": [{"scale": 1.0, "zero_point": 127}], "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}], }, } ``` Example for per-channel quantization overrides (Conv weight and bias): ```Python3 extra_options = { "TensorQuantOverrides": { "WGT": [ { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.0, "rmax": 2.5, "reduce_range": True, }, { "quant_type": quantization.QuantType.QUInt8, "rmin": 0.2, "rmax": 2.55, "reduce_range": False, }, ], "BIAS": [ {"zero_point": 0, "scale": 0.000621}, {"zero_point": 0, "scale": 0.23}, ], }, } ``` #### 2. Adds utilities to get the default QDQ configs for QNN EP Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method that inspects the model and returns suitable quantization configurations. Example usage: ```python3 from quantization import quantize, QuantType from quantization.execution_providers.qnn import get_qnn_qdq_config qnn_config = get_qnn_qdq_config(input_model_path, data_reader, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8) quantize(input_model_path, output_model_path, qnn_config) ``` ### Motivation and Context Make it possible to create more QDQ models that run on QNN EP. --------- Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>	2023-12-04 17:54:58 -08:00
snadampal	05a9c95764	[DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847 ) Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. ### Description <!-- Describe your changes. --> Add ACL as the DNNL runtime option for aarch64 platforms. Update makefile and the python wheel build script. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to enable the optimized ACL gemm kernels for dnnl execution provider on aarch64 platform.	2023-12-01 09:16:44 -08:00
George Wu	5c67a00d8e	Revert "remove full protobuf requirement for tensorrt ep" (#18626 ) Reverts microsoft/onnxruntime#18413 there's a timing issue here. we eventually want to get this change merged in but we need to update OSS onnx-tensorrt first.	2023-11-29 22:27:51 -08:00
Edward Chen	14a343441d	Fix Objective-C static analysis build (#18606 ) - Patch abseil to fix a compile error about not finding `cxxabi.h`. - Fix some static analysis warnings.	2023-11-28 17:14:20 -08:00
Rachel Guo	288b80d363	Add MacOS build to ORT C Pod (#18550 ) ### Description <!-- Describe your changes. --> As title. 1. Add macos build as an optionally enabled arch for pod and changes to exsiting build_ios_framework/assemble_c_pod scripts. 2. Enable macos build arch in ios packaging pipeline (currently for variants other than Mobile) and check the output artifacts are correct. 3. Write MacOS Test Target scheme in the test app and integrate into ios packaging CI testing pipeline. Currently the changes only apply to onnxruntime-c pod. as the original request was from ORT SPM which consumes the onnxruntime-c pod only as the binary target. TODO: could look into adding macos platform to objc pod as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable macos platform support in cocoapods. and also potentially produce binary target for enabling macos platform in SPM as well. Replace https://github.com/microsoft/onnxruntime/pull/18334 --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-28 10:11:53 -08:00
Chen Fu	05046e5452	Adding unit test for sm80 prepack (#18514 ) ### Description Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware ### Motivation and Context Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs. This kernel requires sophisticated quantized weight rearrangement to speedup loading data to tensor-core. To facilitate the addition, this change includes the following: 1. matrix_layout.h A new layout lib that facilitate iterating matrix elements and tiles that balance memory safety and performance. 2. prepack_sm80.h Code for rearranging quantized weight, scales and offsets (aka. prepacking) 3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the memory safety and correctness of the prepacking code. Currently the prepacking code runs on CPU with single threaded code. We run this on CPU in order to minimize GPU memory fragmentation. On the other hand, hopefully we get around to parallelize this part of the code. Should be straight forward with the unit tests in place.	2023-11-28 10:01:09 -08:00
Sheil Kumar	0b7048e7d6	Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384 ) Update winml to use #cores - #soc cores by Default as the number of intraopthreads --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-11-28 09:26:48 -08:00
cloudhan	6f3c1f9dc9	[ROCm] Update ck for GemmFloat8 (#18487 )	2023-11-23 12:06:19 +08:00
pengwa	43a5147e01	Memory optimization refactor and refinement (#17481 ) ### Memory optimization refactor and refinement Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - *** ONNX Runtime Training (ORTModule) is accelerating your model *** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0inputs_input_ids_dim1(384.0inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0inputs_input_ids_dim0inputs_input_ids_dim1(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0inputs_input_ids_dim0inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== \|Freq \| Memory Optimization Opportunities (Clustered by node-level activation patterns) \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|3 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(3), \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(2), \| \| \| - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=2 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \| \| \| \| \|>>Option 2 : RecomputeWithCompromise subgraph Cast+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. *********************************************************************** ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-23 11:39:00 +08:00
Dmitri Smirnov	cc542024ce	Create edges with arg positons correctly accounting for non-existing args (#18462 ) ### Description Truncate traling non-existing arguments. Make sure we do not skip on the non-existing arguments in the middle, because shape inferece relies on their proper position. This also affects the argument position in the Edges that must be properly rebuilt each time If node branch is inlined. Make sure that when we rename Defs in subgraphs, new renamed defs are created in those subgraphs instead of pointing to outer scope defs. Add unit test. ### Motivation and Context This is a follow up for https://github.com/microsoft/onnxruntime/pull/18105 Currently, the non-trailing arguments are simply ignored and the edges are created with potentially incorrect positions.	2023-11-20 14:49:09 -08:00
Akshay Sonawane	97cc40d75a	Add fusion patterns for conformer-transducer model (#18461 ) ### Description Add conformer-transducer model type to optimizer. This PR adds pattern matches for attention shown below: Unfused attention: ![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956) Fused attention: ![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)	2023-11-18 23:39:04 -08:00
Ashwini Khade	02333293de	Removed all the deprecated python training code and related tests and utils (#18333 ) ### Description Motivation for this PR is code cleanup. 1. Remove all deprecated python code related to orttrainer, old checkpoint, related tests and utils 2. Cleanup orttraining_pybind_state.cc to remove all deprecated bindings.	2023-11-17 18:19:21 -08:00
George Wu	d73073d491	remove full protobuf requirement for tensorrt ep (#18413 ) tensorrt can work with protobuf lite.	2023-11-16 20:44:27 -08:00
Yulong Wang	6f9f653ada	[wasm] increase test max memory from 2G to 4G (#18459 ) ### Description increase max memory from 2G to 4G for onnxruntime_test_all in WebAssembly build.	2023-11-15 17:51:04 -08:00
Edward Chen	0a4d76d98b	MLAS AArch64 quantized int4 Gemm kernel (#18031 ) - Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs. - Connect MatMulNBits contrib op to MLAS function.	2023-11-15 09:31:54 -08:00
Ye Wang	f9af94009b	onboard MoE (#18279 ) ### Description <!-- Describe your changes. --> 1. Introduce MoE CUDA op to ORT based on FT implementation. 2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows. Remove patch file for cutlass 3.0.0. 3. Sharded MoE implementation will come with another PR limitation: __CUDA_ARCH__ >= 700 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-14 16:48:51 -08:00
PeixuanZuo	a62a500ae1	[ROCm] Update CK version (#17628 ) update ck version	2023-11-13 15:43:38 -08:00
Scott McKay	8d298f6f78	Fix xnnpack compile error on arm32 (#18291 ) ### Description <!-- Describe your changes. --> Use different march flag to workaround what appears to be a clang issue. See https://github.com/tensorflow/tensorflow/issues/59970 for links to various relevant pieces of info/discussions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-12 08:59:20 +10:00
Scott McKay	64c91d790b	Fix ability to use patch on Windows CI machines (#18356 ) ### Description <!-- Describe your changes. --> Add 32-bit patch binary and infra to fallback to it. The Azure devops Windows CIs are missing patch.exe from their git install for some reason so the default `find_package(Patch)` fails as that is where it expects to find it. Remove Eigen patch. Underlying issue was fixed in source 3 years ago by `c6c84ed961` and the patch command is invalid (args are for git apply not patch). ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make usage of patch consistent across all CIs Fix https://github.com/microsoft/onnxruntime/issues/15248	2023-11-11 07:32:14 +10:00
Bart Verhagen	87744e55fa	fix reference to Microsoft.GSL::GSL in CMake build scripts when enabling cuda (#17843 ) ### Description Some CMake scripts reference Microsoft.GSL::GSL. Most of the time, the GSL package that is found on the system is used. However, when cuda is enabled, it is downloaded and patched. Most CMake scripts rely on the first case and forget about the second. This patch makes the second case behave like the first case. ### Motivation and Context This is an issue that occurs 'in the wild'. For example, I had to patch this to be able to enable the CUDA provider for the onnxruntime conan package (see https://github.com/conan-io/conan-center-index/pull/20392).	2023-11-10 10:46:45 -08:00
Changming Sun	812532592e	Add a build validation for Linux ARM64 cross-compile (#18200 ) ### Description 1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch issues listed in #18195 . 2. Revert eigen's commit id back to what we had before. ### Motivation and Context To catch cross-compile issues. Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639	2023-11-08 13:03:18 -08:00
Dmitri Smirnov	a37e6a503b	Update Abseil raw_flat_hash visualization (#18329 ) ### Description <!-- Describe your changes. --> Fix the broken pieces due to the latest Abseil update. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? Make the debugging bearable.	2023-11-08 11:19:45 -08:00
Wei-Sheng Chin	fb6737e893	Distributed Squeeze and Distributed Unsqueeze (#18269 ) Implementat DistributedSqueeze & DistributedUnsqueeze for llama 2.	2023-11-06 20:11:35 -08:00
Yi Zhang	b7b8b5b2ce	Fix Eigen-3.4.0 URL and hash (#18290 ) ### Description Add CI changes for #18287 Install onnx explicitly to pass windows GPU+dml stage. ### Motivation and Context 'eigen-3.4' was refering to a branch, not to a tag. There is now an Eigen 3.4.1 on that branch, and thus the hash has changed. See https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416	2023-11-06 09:19:51 -08:00
Chi Lo	dfafcb58aa	[TensorRT EP] Properly set CUDA_INCLUDE_DIR for onnx-tensorrt (#18274 ) https://github.com/microsoft/onnxruntime/pull/17468 The above PR didn't fully fix the issue for some environments. This PR fixes this.	2023-11-03 20:04:10 -07:00
Scott McKay	4f2096be38	Update XNNPACK to latest version (#18038 ) ### Description <!-- Describe your changes. --> Update XNNPACK to latest version - adds fp16 kernels and various other improvements - requires pthreadpool update as well Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API - 'setup' is split into 'reshape' and 'setup' - some ops use a workspace buffer - copied workspace allocation from XNNPACK unit test code - some suffixes changed Added wrapper for XNNPACK caches to base XNNPACK EP kernel - simplifies usage - XNNPACK split out the code and weights caches, but the code cache isn't currently usable via the public API - we could use the internal types if we think it's required for performance reasons. non-trivial though as we'd need to propagate ifdef values from the XNNPACK build up to the ORT build. - using XNNPACK internals would also mean we would not be able to support using a pre-build XNNPACK package - not an issue currently Fixed opset registration for internal NHWC domain - was not being tied to the ONNX version, so nodes inserted by layout transformation had the incorrect opset - a number of other places needed updating once this issue was fixed Remove support for NCHW Resize from XNNPACK EP so it's NHWC only - we only supported NCHW for fp32, - doing so adds complexity in multiple places (XNNPACK EP kernel implementation, layout transformation and transpose optimization) - unclear if that complexity provides any benefit. can add back if required by production scenario ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We're looking at enabling fp16 support for CoreML and NNAPI. If we do that we need a good fallback story if the CPU EP will be used. The XNNPACK fp16 kernels will hopefully provide that. NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That can be done as required in separate EPs and should be relatively simple to do.	2023-11-03 09:04:28 -07:00
Scott McKay	016b75260b	Pre-link when creating static library for apple framework (#18241 ) ### Description <!-- Describe your changes. --> Pre-link with `ld -r` to apply symbol visibility when the static library is created to replicate XCode's Single Object Pre-link. Current builds set the visibility flags but that doesn't get applied until the static library is linked into something else, which can be too late. Pre-linking fixes this. The pre-link uses the .o files from the ORT static libraries and the .a files from external libraries. This combination limits the symbols included from the .a files to things required by the ORT .o files. In order to minimize changes elsewhere in the build we extract the .o files from the ORT static libraries using `ar -x`. Re-ordered the pieces use to build the Apple framework to make it a little more readable. Fixed a couple of misc issues with missing symbols from the minimal build that show up when pre-linking is applied. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Will hopefully address #17722	2023-11-03 23:38:29 +10:00
aciddelgado	178f7caaeb	GQA Memory Efficient Kernel (#17920 ) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.	2023-11-01 20:04:22 -07:00
Wei-Sheng Chin	9e8ad39847	Distributed Reduction (#18206 ) This PR implements distributed reduciton for llama 2. This version doesn't consider any cases requring re-sharding because we haven't seen any use cases. Intutive examples: - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and device_mesh=[0,1] - [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] -> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and device_mesh=[0,1] Algorithm: When the reduced axes are not sharded, each device can call reduction directly. The output sharding spec will be identical to input sharding spec. We currently throw when input and output sharding specs are different. Review guideline: - Check 97b8d2f for new op's schema and how new op is registered. - Read tests in 2450f93 to get faimilar with the behavior of these ops. - Check the implementation details in 753d9af.	2023-11-01 08:49:33 -07:00
Preetha Veeramalai	d87216bcb1	Openvino ep ort 23.1 (#17911 ) ### Description Integration to OpenVINO 2023.1 ### Motivation and Context - Alignment with latest OpenVINO Version. - Device name change from VPUX to NPU and Remove from supported list until official public support is available. --------- Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com> Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com>	2023-11-01 08:39:39 -07:00
liqun Fu	20f2dd8b6b	use onnx rel-1.15.0, update cgman, cmake/external and requirement hash (#18177 )	2023-10-31 14:58:21 -07:00
Wei-Sheng Chin	24f9c1afe3	Distributed Expand (#18126 ) This PR implements DistributedExpand for llama 2. Representative Examples of DistributedExpand: - [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor (shape=[8, 2], spec=S[0]R, device_mesh=[0,1])` - [sharding expanded axis is invalid since it must have dim=1 and axis with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R, device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor (shape=[2, 8], spec=S[0]R, device_mesh=[0,1])` From those examples, we observe a few important behaviors. - The output sharding spec is always the same to the input sharding spec. - Expanding always happen on axis with dimension=1. Otherwise, it will violate the broadcasting rule. - No communication is needed since all computation can happen locally. Let's consider the first example again. If you put the first half tensor (shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on device 1, then `Expand` it with target shape [4, 2] , these two local tensors (shape: [4, 2]) are exactly the same as the one described by output sharding spec. Algorithm: - Compute logical (i.e., unsharded) shapes of input and output. - Compute sharded output shape from logical output. - Call Expand to broadcast local input to sharded output shape. How to review? - Start with [changes in onnxruntime_test_distributed.py](`ea33392f37`). Those tests are good examples for using this op. - [Read expand.h/expand.cc](`e4c49987f5`). Theose changes are for exposing functionalities in Expand to DistributedExpand. - Read distributed_expand.h/distributed_expand.cc. It follows the algorithm described above. The commit `68ac301bba` first sketches the definition of DistributedExpand. The next commit `0eb9330c3b` adds real implementation.	2023-10-28 00:44:02 -07:00
Xavier Dupré	b5f242e978	GemmFloat8 as a contrib ops (#16051 ) ### Description Add support for Gemm with float 8 as a contrib op. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-10-27 14:33:55 +02:00
Wei-Sheng Chin	9c32310673	Distributed Reshape Implementation (#18068 ) This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in `TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR implements algorithms to compute the categories below. - All inputs and outputs are replica, so it's computed like a normal Reshape. - Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`. - Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch x seq, hidden] -> [batch, seq, hidden]`. Review guideline: - Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR #18025. - First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape. - Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape. - For DistributedReshape, check its `ComputeInternal` for the 3 categories mentioned above.	2023-10-26 22:33:42 -07:00
Vincent Wang	b7408f7389	[ORTModule] ATen Efficient Attention and Triton Flash Attention (#17959 ) This PR is to support efficient attention and flash attention in ORTModule, including: - Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable. - Integrate Triton Flash attention, which requires triton==2.0.0.dev20221202. Need A100 or H100. ORTMODULE_USE_FLASH_ATTENTION=1 to enable. - A python transformer tool to match sub-graph by config and write transformer quickly. Current transformers supports attention mask for both efficient attn and flash attn, and dropout for efficient attn only. To support more training scenarios (such as causal mask in GPT2), more transformers need to be added. The feature is guarded by system environment variables, it won't effect any current behavior if not enabled. Since it requires specific PyTorch/Triton versions, related tests is not added for now.	2023-10-27 10:29:27 +08:00
Chi Lo	455a9ce614	[TensorRT EP] Use latest onnx-tensorrt parser (#18067 ) Use latest onnx-tensorrt to fix compile error. Please see the issue https://github.com/microsoft/onnxruntime/issues/18029	2023-10-26 13:55:12 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
snadampal	d88d52eead	[aarch64] Remove mmla kernel support from apple (#18082 ) ### Description <!-- Describe your changes. --> The mmla kernels require additional ISA flags and are currently supported only on Linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> more context is in https://github.com/microsoft/onnxruntime/pull/15270 cc: @skottmckay , @chenfucn , @snnn	2023-10-25 11:34:57 -07:00
snadampal	780ee186d7	[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160 ) ### Description <!-- Describe your changes. --> This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This covers (i) symmetric quantization (zero point is Zero) (ii) asymmetric quantization (zero point is non zero) (iii) per channel as well as per tensor quantization (iv) Signed weights (U8S8 Gemm) (v) Unsigned weights (U8U8 Gemm) and (vi) Signed activations and weights (S8S8 Gemm) scenarios I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM` support MMLA QGEMM kernels are enabled for all the devices that support I8MM instructions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to improve INT8 quantized MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed up to 1.33x performance improvement compared to the optimized UDOT qgemm kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` I have also run the unit tests, and made sure all are passing ``` ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync ```	2023-10-24 07:49:04 +10:00
liqun Fu	020824ed50	Update ONNX to 1.15.0rc1 (#17914 )	2023-10-20 15:08:25 -07:00
Hariharan Seshadri	9356986730	Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972 ) ### Description This PR: (1) Fixes AMD builds after #17200 broke them (Need to remember to run AMD builds while trying to merge external CUDA PRs next time) (2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time spent in building a few more files and running a few more tests will not be much. Test Linux GPU CI run : https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770 ### Motivation and Context Keep the NHWC CUDA ops tested (https://github.com/microsoft/onnxruntime/pull/17200) and guard against regressions	2023-10-17 09:23:52 -07:00
Maximilian Müller	7c17e33c07	Make CUDA a NHWC EP (#17200 ) ### Description CUDA inference speed heavily relies on Tensor Cores. To have tensor cores achieve the optimal throughput they require the data layout to be NHWC rather than NCHW. ### Motivation and Context Especially for convolutional networks this is very important. I will illustrate this using a very simple network: ``` import torch import torch.nn as nn class Net1(nn.Module): def __init__(self): super(Net1, self).__init__() # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.m = nn.ModuleList([ nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1), nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1), nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False), ]) def forward(self, x): for module in self.m: x = module(x) return x if __name__ == "__main__": dtype = torch.half device = "cuda" dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device) model = Net1().to(dtype=dtype, device=device) input_names = ["input1"] output_names = ["output1"] torch.onnx.export(model, dummy_input, "test.onnx", input_names=input_names, output_names=output_names) ``` I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test -e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges. Current master launches below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b) If I add the introduced `-l` flag we see below kernels: ![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008) Notice the missing NCHW<>NHWC kernels per operation. The layout optimizer introduced a transpose op as first and last op of the whole network. The `op_generic_tensor_kernel` shows the bias used which should also be optimized out next. Measured across some very basic models: \| CUDA EP \| NCHW [ms] \| NHWC [ms] \| Speedup \| \|:------------------------\|--------------------------------------:\|-----------------------------------------:\|------------------:\| \| \| -e cuda -t 5 -q \| -e cuda -t 5 -q -l \| \| \| resnet101-v2-7_bs8_fp16 \| 18.33 \| 13.07 \| 1.4 \| \| resnet101-v2-7_bs8 \| 21.8 \| 12.06 \| 1.81 \| \| test \| 102.07 \| 73.62 \| 1.39 \| Average speedup: 1.53 ## Outlook Next the mission will be to first write a templated unit test to check for correctness of NHWC vs NCHW ops. After that we have to transition more ops to measure perf improvements on a broader range of models. Currently this is not easily possible as we can do not support all ops in the NHWC domain. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2023-10-16 10:16:37 -07:00
Chi Lo	8abaa7b753	[TensorRT EP] Fix cmake install (#17923 ) We removed tensorrt_provider_factory.h in the [PR](https://github.com/microsoft/onnxruntime/pull/17617). Need to remove the copy of this file when cmake install.	2023-10-16 09:16:24 -07:00
Yufeng Li	11af34440a	Add MatMul 4bits support on GPU (#17890 ) ### Description <!-- Describe your changes. --> Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU	2023-10-13 16:55:30 -07:00
Jeff Daily	07317316cc	CUDA EP vs ROCM EP hipify audit (#17776 ) Migrate most CUDA EP improvements and changes to ROCM EP. The process involves using hipify against all CUDA EP files (i.e. do not exclude any files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them against the ROCM EP files that are under source control and pull in most changes. These changes include functional as well as formatting and makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff somewhat less obvious due to formatting changes. - hipify audit of onnxruntime/core/providers/rocm, enable ops - Loop - Scan - hipify audit of onnxruntime/contrib_ops/rocm - fix contrib ops search implementation - enable more contrib ops - Affine - ComplexMul - ConvTransposeWithDynamicPads - Crop - DynamicSlice - FFT [Rfft, Irfft] - GreedySearch - ImageScaler - ParametricSoftplus - ScaledTanh - ThresholdRelu --------- Co-authored-by: cloudhan <cloudhan@outlook.com>	2023-10-13 10:13:53 +08:00
Tang, Cheng	ca8cab29cd	distributed slice (#17761 ) ### Description Support DistributedSlice kernel in Cuda EP. mainly support following cases: 1. input data is sharded or replica for all axes (including slice axes) 2. slice axes is sharded across different devices. starts / ends / steps sharded across different devices are not supported yet. --------- Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com>	2023-10-12 14:28:00 -07:00
Maximilian Müller	74a8acf405	Set default value for NVCC threads (#17866 ) Without doing this CMake gives a miscellaneous error on windows when checking if NVCC is functional. It will be missing a number after `--threads`. Currently it is only possible to configure through the python build scripts and not CMake only configure - which is what I am usually doing through CLion.	2023-10-11 22:46:40 -07:00
Numfor Tiapo	b8f373b0ae	Add API for NPU Device Selection in the DML EP (#17612 ) Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-11 14:53:00 -07:00
pengwa	0e2782438a	Support inplace update for PythonOp/Grad (#17687 ) ### Support inplace update for PythonOp/Grad This PR is based on another PR https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it easier to review. With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By default all PythonOp inputs/outputs are assumed to not be inplaced, if during run, we found some inplace update happens (by checking output data address with all inputs data address), we add clone before set it as PythonOp/Grad's outputs. In this case, results are correct, but implicit copies overheads are introduced. This PR allow users to define output input reuse map, to let ORT know how to do the reuse map, avoid such unnecessary copies.	2023-10-10 21:36:45 -07:00
Changming Sun	05ac9f6f2a	Split onnxruntime_providers.cmake to multiple (#17853 ) ### Description Split onnxruntime_providers.cmake to multiple files, for easier editing. No other change was made in this PR.	2023-10-09 20:33:44 -07:00
Baiju Meswani	9c716f4557	Add noexcep_operators to onnxruntime internal libraries (#17850 )	2023-10-09 16:29:41 -07:00
cloudhan	c2bd5b70b2	Fix enable_training and use_migraphx (#17827 )	2023-10-08 11:43:27 +08:00
MistEO	faf9a0f6c7	Fix runtime installation error (#17828 )	2023-10-07 11:50:02 -07:00
JiCheng	3878011ce2	Remove MPI dependency (#17624 ) ### Description <!-- Describe your changes. --> Support launch multi-GPU without MPI ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-06 15:33:18 +08:00
George Wu	b306b02a86	[QNN EP] fixed input for InstanceNormU8 unit test and update copy lib paths (#17806 ) -update InstanceNormU8 with fixed input. With this input, it fails consistently using QNN 2.15.1 -update QNN lib paths (target is deprecated) and additionally copy V73 skel file	2023-10-05 22:17:15 -07:00
Justin Chu	be7541ef4a	[Linter] Bump ruff and remove pylint (#17797 ) Bump ruff version and remove pylint from the linter list. Fix any new error detected by ruff. ### Motivation and Context Ruff covers many of the pylint rules. Since pylint is not enabled in this repo and runs slow, we remove it from the linters	2023-10-05 21:07:33 -07:00
Wei-Sheng Chin	faef9c32fa	ONNX-Native Tensor Parallel: Using Distributed MatMul as Example (#17695 ) This PR introduces - New data structure to represent kernel-level (aka node-level or op-level) tensor sharding informaiton. I consider it as the fundamentaion of ONNX distribtued inference. - Building blocks for distribtued kernels implementation especially stateless implementation for communication ops. - Implementation of DistributedMatMul and its tests. Code structure: - sharding.h/.cc: Function to shard and reshard tensors (calling into NCCL). - sharding_spec.h/.cc: Representation of how a tensor is sharded. - distributed_matmul.h/.cc: Implementation of tensor parallel MatMul. Inputs and outputs are sharded across devices. - onnxruntime_test_distributed.py: distributed operator tests. Example of specifying sharding information ```python @onnxscript.script() def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT: # Run MatMul by sharding x along column axis and w along row axis on # 2 GPUs. return MICROSOFT_OPSET.DistributedMatMul( tensor_x, tensor_w, device_mesh_shape=[2], device_mesh_elements=[0, 1], input_shard_specs=["RS[0]", "S[0]R"], output_shard_specs=["RR"], ) onnx_model = matmul_rs_sr_rr.to_model_proto( input_types=[FLOAT[2, "s"], FLOAT["s", 2]], output_types=[FLOAT[2, 2]], ) ``` In this example, the device mesh can be visualized as 1-D tensor, `[0, 1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the 0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is sharded across `[0, 1]` as well. C++ classes to represent tensor sharding (copied from sharding_spec.h): ```cpp class DeviceMesh { public: // [Device Mesh and Tensor Sharding for Tensor Parallel] // Device mesh is a tensor of device indices. // A tensor can then be partitioned along specific mesh axes. // // Assume we have 4 GPUs indexed by 0, 1, 2, and 3. // Let's consider some examples. // 1. 1D device mesh [0, 1, 2, 3]. In this case, // device_mesh_shape is [4] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor along its axis 1, the // corresponding sharding spec is a string "RS[0]". // 2. 2D device mesh [[0, 1], [2, 3]]. In this case, // device_mesh_shape is [2, 2] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor's // rows along mesh axis 1 and // columns along mesh axis 0, the // corresponding sharding spec is a string "S[1]S[0]". // If that 2-D tensor's value is np.array([[5, 6], [7, 8]]), // GPU 0/1/2/3 owns 5/7/6/8. Below is a visualization the sharding // proccess. // - Start with a 2-D device mesh [[0, 1], [2, 3]] and // a 2-D tensor [[5, 6], [7, 8]] // - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]] // - Split GPU mesh along axis 1 and tensor along // axis 0 for "S[1]" in "S[1]S[0]" // - GPU: [[0], [2]], Tensor: [[5, 6]] // GPU: [[1], [3]], Tensor: [[7, 8]] // - Split GPU mesh along axis 0 and tensor along // axis 1 for "S[0]" in "S[1]S[0]" // - GPU: [[0]], Tensor: [[5]] // - GPU: [[2]], Tensor: [[6]] // - GPU: [[1]], Tensor: [[7]] // - GPU: [[3]], Tensor: [[8]] // Actual shape of device mesh represented by `device_mesh_elements`. std::vector<int64_t> device_mesh_shape; // Flattened device mesh. std::vector<int64_t> device_mesh_elements; }; class AxisPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // This class is the in-memory representation of // 1. if a tensor is sharded or not (aka replica), and // 2. which tensor axis is shard by which device mesh axis. // Let's consider sharding 2-D tensor along column axis on // device mesh [0, 1] as an example. // The required sharding spec RS[0] can be represented by // - AxisPartitionSpec(Condition::Replica, -1) // - AxisPartitionSpec(Condition::Shard, 0) public: // Status of a tensor axis. // A tensor axis can be either sharded or replicated // along a device mesh axis. enum class Condition { Replica, Shard }; // This field tells if a tensor axis is sharded or not. Condition cond; // If a tensor axis is sharded, this field tells which device // mesh axis to distribute the shards along. // If a tensor axis is not sharded, this field is ignored. int device_mesh_axis; // A helper to construct a replica spec for a tensor axis. static AxisPartitionSpec CreateReplica() { return AxisPartitionSpec(Condition::Replica, -1); } // A helper to construct a sharding spec for a tensor axis. // This tensor axis is sharded along `device_mesh_axis` in device mesh. static AxisPartitionSpec CreateShard(int device_mesh_axis) { return AxisPartitionSpec(Condition::Shard, device_mesh_axis); } }; class TensorPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // TensorPartitionSpec holds a collection of AxisPartitionSpec and an // associated DeviceMesh. It is responsible for determining how a tensor // should be partitioned across a device mesh. // // Example 1: RS[0] // In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects. // - The first object is a Replica, denoting that the first axis of the tensor is // not sharded but is instead replicated. // - The second object is a Shard along the 0-th axis of the device mesh. It denotes // that the second axis of the tensor is sharded along the first axis of the // device mesh. // // Example 2: S[0]RR // In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects. // - The first object is a Shard along the 0-th axis of the device mesh, indicating // that the first axis of the tensor is sharded along the first axis of the // device mesh. // - The second and third objects are Replicas, indicating that the second and third // axes of the tensor are not sharded but are instead replicated. public: // axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor, // axis_specs[0] is for row axis and axis_specs[1] is for // column axis. axis_specs[i].device_mesh_axis = j means that // tensor axis i is sharded along device mesh axis j. std::vector<AxisPartitionSpec> axis_specs; // device_mesh: DeviceMesh for sharding the associated tensor. // Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment. DeviceMesh device_mesh; }; ```	2023-10-05 14:22:25 -07:00
Edward Chen	1bc115719c	Unify handling of public headers in onnxruntime.cmake. (#17779 ) The changes in PR #8919 overwrote the PUBLIC_HEADER property value of the `onnxruntime` target with a list that did not include EP-specific headers. We should probably be using a consistent set of header files across packages anyway.	2023-10-04 08:55:08 -07:00
Changming Sun	14d349e290	Enable backtrace in unit tests (#17655 ) ### Description Google test can be built either with absl/re2 or not. This PR enables the build option so that google test framework can print out a nice stacktrace when something went wrong. It helps locate test errors in CI build pipelines. Also, Google test will remove the build option and make it always ON. So sooner or later we must make this change.	2023-09-29 12:32:56 -07:00
MistEO	870b0bc305	Fix typo of cmake (#17715 ) This caused a cmake configuration error.	2023-09-27 11:48:46 -07:00
Mustafa Ateş Uzun	13b0f8a6ce	fix: supported typo (#17216 )	2023-09-27 10:45:27 -07:00
liqun Fu	2be4dc6d04	ONNX 1.15 integration (#17125 ) ### Description this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released ### Motivation and Context Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-09-26 14:44:48 -07:00
Jian Chen	0141e27ca1	Enabling c++ 20 in MacOS build (#16187 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-26 11:27:02 -07:00
Tianlei Wu	730fab3050	Refactor Attention cuda kernel (#17578 ) * Break QkvToContext into small functions. Each fused and unfused kernel will have separated function. * Move DecoderAttention kernel to separated file * Move KV cache related kernel to attention_kv_cache.cu ### Motivation and Context To make the code easier to maintain.	2023-09-19 09:49:21 -07:00
Tianlei Wu	adb0be45d3	Refactoring of attention cuda kernel: move prepare qkv and concat_past_to_present (#17559 ) To avoid a huge cu file and make code more readable: - Move PrepareQKV to separate cu file (attention_prepare_qkv.cu) - Move ConcatPastToPresent to attention_concat.cu - Add default value for AttentionData - Add a data structure QkvData to track Q, K and V pointers and track QKV format.	2023-09-15 10:57:29 -07:00
Changming Sun	5af6279440	Fix Android build (#17540 ) ### Description The new cpuinfo library doesn't use clog on Android. Newer XNNPack versions have removed the dependency on clog, but the one we use still has it. So I cherry-pick the XNNPack to our patch file.	2023-09-14 07:36:01 -07:00
Changming Sun	24a3c740c0	Revert "[ROCm][MIGraphX] for googletest dep, set OVERRIDE_FIND_PACKAGE (#16715 )" (#17523 ) This reverts commit `bb136f86c8`, then re-implement it in a different way. I reverted the original change, then added a version constraint to the find_package args. If you still found it picks up wrong gtest version after this change, you may disable `find_package` by setting 'FETCHCONTENT_TRY_FIND_PACKAGE_MODE' to NEVER. For example, the latest gtest version is 1.14.0. If at a later time Google releases a new version of gtest and that one is incompatible with the ONNX Runtime source code you get today and your dev environment already installed the new version and you do not want to create a new clean build environment that is without the package, you can add `--cmake_extra_defines FETCHCONTENT_TRY_FIND_PACKAGE_MODE=NEVER` to your build command to solve the problem.	2023-09-12 22:39:31 -07:00
Chi Lo	b827ab0efc	[TRT EP] Fix build error for building oss onnx-tensorrt parser (#17468 ) If building ORT TRT with `--use_tensorrt_oss_parse` (meaning ORT wil include [oss onnx-tensorrt parser](https://github.com/onnx/onnx-tensorrt/blob/main/CMakeLists.txt#L82) and build it from source) ,the cmake CUDA_INCLUDE_DIR variable is needed. if not, you will encounter following [ build error](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1133937&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=39e3f98f-7fe5-578c-20bd-5ae5a4590bda): CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: /build/Release/_deps/onnx_tensorrt-src/CUDA_INCLUDE_DIR Note: Not quite sure why in the past when CI still tested with oss parser won't hit this issue. probably the CUDA_INCLUDE_DIR was defined somewhere back then.	2023-09-08 20:34:57 -07:00
Caroline Zhu	dcc93909b4	Add training WASM generation to Web CI pipeline (#17319 ) ### Description [Successful pipeline run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results) Added flag to build the training artifacts & updated the pull-wasm-artifacts script to pull the training artifacts as well. Bundled into this PR are minor formatting fixes + naming fixes. ### Motivation and Context [This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended the WASM API wrapper to build training WASM artifacts as well. The ORT training WASM artifacts are required to support ORT training web bindings.	2023-09-08 15:49:47 -07:00
Changming Sun	bc84f52633	Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470 ) ### Description Update C/C++ dependencies abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint to newer versions per request of @ mayeut. He created the following PRs to update the deps: https://github.com/microsoft/onnxruntime/pull/15432 https://github.com/microsoft/onnxruntime/pull/15434 https://github.com/microsoft/onnxruntime/pull/15435 https://github.com/microsoft/onnxruntime/pull/15436 https://github.com/microsoft/onnxruntime/pull/15437 However, our build system needs to fetch the dependencies from an internal mirror that only Microsoft employees have write access to. So I closed his PRs and created this one. This PR also updates abseil to a newer version. This is to prepare for upgrading re2.	2023-09-08 13:35:04 -07:00
Yulong Wang	110a2d0b73	[build][wasm] add js_internal_api.js to link dependency (#17407 ) ### Description add js_internal_api.js to link dependency. Now changes to js_internal_api.js will correctly trigger re-link of ort-wasm.wasm	2023-09-05 20:40:40 -07:00
Changming Sun	c6b0d185b4	Update cmake to 3.27 and upgrade Linux CUDA docker files from CentOS7 to UBI8 (#16856 ) ### Description 1. Update docker files and their build instructions. ARM64 and x86_64 can use the same docker file. 2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8 AB#18990	2023-09-05 18:12:10 -07:00
Lennart Hannink	e3bb2a0cdd	Fix git working dir for ORT_BUILD_INFO (fixes #17197 ) (#17198 ) ### Description Git commands producing `git-commid-id` and `git-branch` are always run in `CMAKE_CURRENT_SOURCE_DIR` (i.e. `onnxruntime/cmake`) ### Motivation and Context Please refer to corresponding issue [#17197](https://github.com/microsoft/onnxruntime/issues/17197).	2023-09-05 09:20:49 -07:00
cloudhan	6ea3908db4	Add ck's streamk and splitk gemm impl (#17280 )	2023-09-04 11:49:07 +08:00
aciddelgado	44101e8771	Flash Attention v2 MHA (#17227 ) ### Description Integrate Flash Attention V2 to PackedMultiHeadAttention, MultiHeadAttention and Attention operators. Flash Attention v2 source code is from https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src. We did some change to remove dependency on Torch, then removed backward and bfloat16 related code. Add benchmark script (see benchmark_mha.sh) to compare different attention kernels for MultiHeadAttention operator. Current limitations for Flash Attention in PackedMultiHeadAttention, MultiHeadAttention and Attention operators: * Relative Position Bias is not supported * Different hidden size for Q and V is not supported * Only float16 is supported * Padding/attention mask is not supported * For MultiHeadAttention, when there is past or present input, bias shall be provided to activate flash attention * For Attention, past or present inputs will deactivate flash attention * Causal is not supported Some limitations (like attention mask and causal) might be removed later. Currently, Flash Attention v2 only works in Linux. For Windows, we will enable later with Cutlass 3.2. Two environment variables can be used for testing purpose: (1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default value is 0 (enable). Set it to "1" to disable it. (2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is "513", which means that we only enable flash attention when sequence length is larger than 512 for packed QKV format. Set it to "0" if you want to use flash attention v2 whenever possible. ### Speedup The following result is from Standard_ND96amsr_A100_v4 VM (A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per second for MultiHeadAttention operator. There are 3 input formats: * `Q,K,V` means separated inputs query, key and value of BxSxNH * `Q,KV` means packed KV, where key is 5D: BxSxNx2xH * `QKV` means packed QKV, where query is 5D: BxSxNx3xH Note that flash attention cannot use packed QKV format, so extra Transpose is needed. We found that TensorRT kernel is faster for sequence length <= 512 for packed QKV. The reason might be no transpose is needed for TensorRT kernel in this format. We also notice that, TensorRT kernel is faster for stable diffusion 512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while flash attention v2 is faster for 1024x1024 image (see seq_len=16384, heads=8, head_dim=40 below). input format \| batch size \| sequence length \| heads \| head dim \| flash_v2 (TFLOPs/s) \| TensorRT (TFLOPs/s) \| Memory Efficient Attention (TFLOPs/s) -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Q,K,V \| 32 \| 512 \| 64 \| 32 \| 78.1 \| 60.0 \| 39.3 Q,K,V \| 32 \| 512 \| 128 \| 16 \| 46.8 \| 44.1 \| 21.7 Q,K,V \| 16 \| 1024 \| 64 \| 32 \| 99.0 \| 72.8 \| 44.3 Q,K,V \| 16 \| 1024 \| 128 \| 16 \| 54.7 \| 49.2 \| 23.4 Q,K,V \| 8 \| 2048 \| 64 \| 32 \| 113.8 \| 81.2 \| 47.8 Q,K,V \| 8 \| 2048 \| 128 \| 16 \| 59.7 \| 51.9 \| 24.7 Q,K,V \| 4 \| 4096 \| 64 \| 32 \| 122.5 \| 85.6 \| 49.7 Q,K,V \| 4 \| 4096 \| 128 \| 16 \| 62.5 \| 53.3 \| 25.3 Q,K,V \| 2 \| 8192 \| 64 \| 32 \| 127.4 \| 87.5 \| 50.7 Q,K,V \| 2 \| 8192 \| 128 \| 16 \| 64.0 \| 54.2 \| 25.6 Q,K,V \| 1 \| 16384 \| 64 \| 32 \| 129.5 \| 91.0 \| 51.2 Q,K,V \| 1 \| 16384 \| 128 \| 16 \| 64.7 \| 54.5 \| 25.8 Q,K,V \| 1 \| 4096 \| 8 \| 40 \| 51.0 \| 43.6 \| 36.8 Q,K,V \| 1 \| 4096 \| 8 \| 80 \| 97.7 \| 77.0 \| 55.5 Q,K,V \| 1 \| 4096 \| 8 \| 160 \| 120.0 \| 39.7 \| 57.8 Q,K,V \| 4 \| 4096 \| 8 \| 40 \| 89.0 \| 84.4 \| 49.2 Q,K,V \| 4 \| 4096 \| 8 \| 80 \| 133.0 \| 92.2 \| 63.2 Q,K,V \| 4 \| 4096 \| 8 \| 160 \| 164.8 \| 42.7 \| 63.8 Q,K,V \| 1 \| 16384 \| 8 \| 40 \| 96.9 \| 91.3 \| 52.1 Q,K,V \| 1 \| 16384 \| 8 \| 80 \| 142.9 \| 101.5 \| 65.6 Q,K,V \| 1 \| 16384 \| 8 \| 160 \| 177.4 \| 44.2 \| 65.7 Q,K,V \| 128 \| 128 \| 12 \| 64 \| 29.0 \| 26.9 \| 25.7 Q,K,V \| 64 \| 128 \| 12 \| 64 \| 23.1 \| 10.8 \| 21.3 Q,K,V \| 128 \| 384 \| 12 \| 64 \| 83.5 \| 60.8 \| 55.7 Q,K,V \| 64 \| 384 \| 12 \| 64 \| 72.6 \| 40.5 \| 52.8 Q,K,V \| 128 \| 512 \| 12 \| 64 \| 98.9 \| 77.9 \| 62.1 Q,K,V \| 64 \| 512 \| 12 \| 64 \| 94.7 \| 75.6 \| 60.4 Q,KV \| 32 \| 512 \| 64 \| 32 \| 85.9 \| 41.1 \| 41.1 Q,KV \| 32 \| 512 \| 128 \| 16 \| 47.1 \| 21.6 \| 21.6 Q,KV \| 16 \| 1024 \| 64 \| 32 \| 104.4 \| 45.8 \| 45.8 Q,KV \| 16 \| 1024 \| 128 \| 16 \| 54.7 \| 23.6 \| 23.6 Q,KV \| 8 \| 2048 \| 64 \| 32 \| 116.8 \| 48.5 \| 48.5 Q,KV \| 8 \| 2048 \| 128 \| 16 \| 59.8 \| 24.7 \| 24.7 Q,KV \| 4 \| 4096 \| 64 \| 32 \| 124.2 \| 50.1 \| 50.1 Q,KV \| 4 \| 4096 \| 128 \| 16 \| 62.6 \| 25.3 \| 25.3 Q,KV \| 2 \| 8192 \| 64 \| 32 \| 128.5 \| 50.8 \| 50.9 Q,KV \| 2 \| 8192 \| 128 \| 16 \| 64.1 \| 25.6 \| 25.6 Q,KV \| 1 \| 16384 \| 64 \| 32 \| 129.4 \| 51.2 \| 51.2 Q,KV \| 1 \| 16384 \| 128 \| 16 \| 64.8 \| 25.8 \| 25.8 Q,KV \| 1 \| 4096 \| 8 \| 40 \| 67.5 \| 37.7 \| 37.5 Q,KV \| 1 \| 4096 \| 8 \| 80 \| 101.3 \| 56.7 \| 56.6 Q,KV \| 1 \| 4096 \| 8 \| 160 \| 124.0 \| 58.6 \| 58.6 Q,KV \| 4 \| 4096 \| 8 \| 40 \| 90.8 \| 49.8 \| 49.8 Q,KV \| 4 \| 4096 \| 8 \| 80 \| 135.6 \| 63.8 \| 63.8 Q,KV \| 4 \| 4096 \| 8 \| 160 \| 166.3 \| 64.5 \| 64.5 Q,KV \| 1 \| 16384 \| 8 \| 40 \| 97.5 \| 52.3 \| 52.3 Q,KV \| 1 \| 16384 \| 8 \| 80 \| 143.5 \| 65.9 \| 65.8 Q,KV \| 1 \| 16384 \| 8 \| 160 \| 178.4 \| 65.9 \| 65.8 Q,KV \| 128 \| 128 \| 12 \| 64 \| 26.8 \| 48.1 \| 30.9 Q,KV \| 64 \| 128 \| 12 \| 64 \| 28.0 \| 38.9 \| 25.0 Q,KV \| 128 \| 384 \| 12 \| 64 \| 97.7 \| 61.1 \| 61.0 Q,KV \| 64 \| 384 \| 12 \| 64 \| 89.5 \| 57.8 \| 57.9 Q,KV \| 128 \| 512 \| 12 \| 64 \| 111.9 \| 66.7 \| 66.9 Q,KV \| 64 \| 512 \| 12 \| 64 \| 107.2 \| 64.9 \| 64.8 QKV \| 32 \| 512 \| 64 \| 32 \| 77.2 \| 84.7 \| 39.3 QKV \| 32 \| 512 \| 128 \| 16 \| 43.4 \| 53.1 \| 20.9 QKV \| 16 \| 1024 \| 64 \| 32 \| 98.8 \| 87.4 \| 44.6 QKV \| 16 \| 1024 \| 128 \| 16 \| 52.0 \| 54.1 \| 23.2 QKV \| 8 \| 2048 \| 64 \| 32 \| 113.1 \| 89.0 \| 47.9 QKV \| 8 \| 2048 \| 128 \| 16 \| 58.2 \| 54.6 \| 24.5 QKV \| 4 \| 4096 \| 64 \| 32 \| 120.6 \| 89.7 \| 49.7 QKV \| 4 \| 4096 \| 128 \| 16 \| 61.7 \| 54.6 \| 25.2 QKV \| 2 \| 8192 \| 64 \| 32 \| 125.9 \| 89.5 \| 50.7 QKV \| 2 \| 8192 \| 128 \| 16 \| 63.6 \| 54.8 \| 25.5 QKV \| 1 \| 16384 \| 64 \| 32 \| 128.5 \| 92.0 \| 51.2 QKV \| 1 \| 16384 \| 128 \| 16 \| 64.6 \| 54.8 \| 25.7 QKV \| 1 \| 4096 \| 8 \| 40 \| 60.2 \| 69.8 \| 38.1 QKV \| 1 \| 4096 \| 8 \| 80 \| 101.6 \| 75.2 \| 56.7 QKV \| 1 \| 4096 \| 8 \| 160 \| 130.2 \| 41.2 \| 58.4 QKV \| 4 \| 4096 \| 8 \| 40 \| 90.6 \| 91.0 \| 49.5 QKV \| 4 \| 4096 \| 8 \| 80 \| 133.6 \| 98.1 \| 62.8 QKV \| 4 \| 4096 \| 8 \| 160 \| 165.3 \| 43.7 \| 63.9 QKV \| 1 \| 16384 \| 8 \| 40 \| 97.2 \| 92.8 \| 52.1 QKV \| 1 \| 16384 \| 8 \| 80 \| 143.0 \| 103.1 \| 65.6 QKV \| 1 \| 16384 \| 8 \| 160 \| 177.6 \| 44.5 \| 65.7 QKV \| 128 \| 128 \| 12 \| 64 \| 31.1 \| 65.9 \| 27.6 QKV \| 64 \| 128 \| 12 \| 64 \| 26.1 \| 49.8 \| 23.5 QKV \| 128 \| 384 \| 12 \| 64 \| 84.6 \| 88.5 \| 56.1 QKV \| 64 \| 384 \| 12 \| 64 \| 79.1 \| 80.3 \| 53.5 QKV \| 128 \| 512 \| 12 \| 64 \| 97.3 \| 114.2 \| 62.2 QKV \| 64 \| 512 \| 12 \| 64 \| 95.9 \| 110.7 \| 60.6 QKV \| 4 \| 2048 \| 32 \| 128 \| 125.26 \| 44.72 \| 78.15 QKV \| 4 \| 4096 \| 32 \| 128 \| 141.62 \| 46.29 \| 85.84 QKV \| 8 \| 2048 \| 32 \| 128 \| 127.40 \| 45.49 \| 78.75 QKV \| 8 \| 4096 \| 32 \| 128 \| 144.24 \| 46.60 \| 86.95 ### Known Issues NVCC uses huge memory while compiling flash attention CUDA kernel. Linux build with CUDA might fail when machine has limited memory while number of CPUs is large. Walkaround is to use a build machine with larger memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in build. ### Motivation and Context Increases speed and efficiency of MHA or Packed MHA. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-08-31 13:52:21 -07:00
Wanming Lin	3a53836836	[WebNN EP] Fix compilation with newer flatbuffers (#17367 )	2023-08-31 10:22:15 -07:00
Artem Shilkin	6e60dba726	Fix compilation with newer flatbuffers (#17164 ) In flatbuffers@v23.5.9 was broken forward declaration for FlatBufferBuilder. Trying to compile onnxruntime falls with the following error: ``` flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder') typedef FlatBufferBuilderImpl<false> FlatBufferBuilder; ^ onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here class FlatBufferBuilder; ``` This PR removes these declarations and puts includes instead	2023-08-29 10:28:26 -07:00
Caroline	228db24317	Add training API functions to WASM API (#16521 ) ### Description * Created `wasm/training_api` source and header files & modified WebAssembly CMake to include training flags * The `wasm/training_api` files use an `OrtTrainingManager` handle which is a struct of an OrtCheckpointState and an OrtTrainingSession, rather than creating a CheckpointState handle & a separate TrainingSession handle. * This is so that the TypeScript side only has to manage one handle that will be passed between TrainingSession & CheckpointState representations, rather than the TypeScript side managing separate CheckpointStateHandle and TrainingSessionHandle. ### Motivation and Context WASM API needs to be updated with ORT training API function calls so that ORT training web bindings can be added for on-device training. --------- Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Ashwini Khade <askhade@microsoft.com>	2023-08-28 11:05:02 -07:00
Arthur Islamov	c262879214	Added DML and CUDA provider support in onnxruntime-node (#16050 ) ### Description I've added changes to support CUDA and DML (only on Windows, on other platforms it will throw an error) ### Motivation and Context It fixes this feature request https://github.com/microsoft/onnxruntime/issues/14127 which is tracked here https://github.com/microsoft/onnxruntime/issues/14529 I was working on StableDiffusion implementation for node.js and it is very slow on CPU, so GPU support is essential. Here is a working demo with a patched and precompiled version https://github.com/dakenf/stable-diffusion-nodejs ---------	2023-08-25 16:57:06 -07:00
Yulong Wang	79c4ed9a45	[js/webgpu] support error pop and kernel name (#17260 ) ### Description This PR contains changes to support error pop and kernel name. - Add a function `JsepGetNodeName` to allow reading kernel name from JS to C++ - When in debug mode ( `env.debug = true;` ) or in profiling mode ( `env.webgpu.profilingMode = 'default';` ), kernel name will be read from ORT; otherwise use the kernel pointer ( a number ) as kernel name to save calls from JS to C++. - When in debug mode, WebGPU validation errors will be recorded and if any error occurs, `inferenceSession.run()` will fail (Promise get rejected). Behavior when not in debug mode is not changed. This is because recording errors are not zero-overhead, and GPU validation errors should occur consistently in and not in debug mode. - Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to: - allow implementation of the features mentioned above. - pass session ID to backend.	2023-08-25 08:08:15 -07:00
mindest	735cc8e6c8	[ROCm] enable If op for ROCm EP. (#17279 ) ### Description Enable If op for ROCm EP.	2023-08-25 17:49:49 +08:00
Baiju Meswani	fca81cc5d5	ConvTransposeGrad CUDA Kernel (#17201 )	2023-08-24 09:08:06 -07:00
cloudhan	87bef1f3f2	Move composable_kernel to deps.txt (#17245 )	2023-08-23 17:39:16 -07:00
kunal-vaishnavi	edac3ef150	Add LLaMA scripts (#17020 ) ### Description This PR adds the following scripts for LLaMA: - LLaMA conversion (support for TorchScript and Dynamo exporters) - LLaMA parity - LLaMA benchmark - LLaMA quantization - LLaMA integration with [Hugging Face Optimum](https://github.com/huggingface/optimum) ### Motivation and Context This PR adds scripts for using LLaMA. There is a [follow-up PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding scripts for Whisper.	2023-08-22 18:05:11 -07:00
Edward Chen	bd8a488f4b	Enable verbose logging in unit test program with environment variable. (#17133 ) Enable verbose logging in unit test program with environment variable. E.g., `ORT_UNIT_TEST_MAIN_LOG_LEVEL=0 ./onnxruntime_test_all --gtest_filter="<test that I want to see more logs for>"`.	2023-08-22 12:13:52 -07:00
cloudhan	4e6cec4d09	Update ck and enable test (#16383 ) Apply the fix in https://github.com/ROCmSoftwarePlatform/composable_kernel/issues/728 Introduce more kernel instances and allow the introduction of streamk and splitk.	2023-08-22 11:08:55 +08:00
Sheil Kumar	cbaa008391	Bump DirectML version from 1.12.0 to 1.12.1 (#17225 ) Bump DirectML version from 1.12.0 to 1.12.1 Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-08-20 09:55:38 -07:00
Changming Sun	3cec88bd12	FIX: memory leak checker is incompatible with std::stacktrace (#17209 ) ### Description When I worked on PR #17173, I didn't notice that onnxruntime\core\platform\windows\debug_alloc.cc also needs to call dbghelp functions like SymInitialize. So, if we use vc runtime's stacktrace functionality, vc runtime will initialize/uninitialize the dbghelp library independently and vc runtime's stacktrace helper DLLs get unloaded before our memory leak checker starts get work. Then we call SymSetOptions, it crashes. More details: In VC runtime the C++23 stacktrace functions are implemented on top of dbgeng.dll. In C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\crt\src\stl\stacktrace.cpp, you can see it has: ``` dbgeng = LoadLibraryExW(L"dbgeng.dll", nullptr, LOAD_LIBRARY_SEARCH_SYSTEM32); ``` The dbgeng.dll is a wrapper around dbghelp.dll. It calls SymInitialize and SymCleanup. dbgeng.dll gets unloaded before our memory leak check starts to run. In theory we should be able to call SymInitialize again if the previous user who called SymInitialize has also called SymCleanup. However, users can use SymRegisterCallback/SymRegisterCallback64/SymRegisterCallbackW64 to register callback functions to dbghelp.dll. These callback functions need to be alive when SymSetOptions(and some other dbghelp APIs) get called. ### Motivation and Context	2023-08-18 17:10:33 -07:00
Changming Sun	ee09a5ff35	Add DISABLE_CUSPARSE_DEPRECATED flag to CUDA build (#17207 ) This is to suppress a warning and make Windows CUDA 12.2 build work.	2023-08-18 10:25:49 -07:00
Chi Lo	2fb148dd88	Temporarily enforce "Debug build" TRT EP with trt oss parser on Windows (#17059 ) This PR handles two changes: 1. There is an issue when running "Debug build" TRT EP with "Release build" TRT builtin parser on Windows. Enforce use oss parser for Debug build. Note: args.config in build.py is an array, for example ["Debug", "Release"...]. The code will be much mess if we made the change there. 2. Update to use latest commit of oss parser. Please see the https://github.com/microsoft/onnxruntime/issues/16273	2023-08-17 12:17:25 -07:00
Changming Sun	5249b7ab7c	Re-implement stacktrace (#17173 ) ### Description Re-implement stacktrace. The new implementation doesn't directly use Windows API, hence can avoid problems regarding to initialize/uninitialize the dbghelp library. ### Motivation and Context	2023-08-16 16:07:49 -07:00
Dmitri Smirnov	f45eef399e	Fix visualization issues with Attribute/Tensor protos (#17188 ) ### Description Protobuf Natvis	2023-08-16 13:56:51 -07:00
RandySheriffH	3dd2c1b4d7	EP context for custom op (#16454 ) Implement infrastructures to allow EP resources surfaced to custom ops. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-08-16 13:03:40 -07:00
Maximilian Müller	7b9d1f18c7	NVTX windows include and link fixes (#16831 ) ### Description For windows headers are not duplicated to the normal cuda include. For linux they are: ``` (base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 \| grep nvTool nvToolsExt.h nvToolsExtCuda.h nvToolsExtCudaRt.h nvToolsExtOpenCL.h nvToolsExtSync.h (base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include \| grep nvTool nvToolsExt.h nvToolsExtCuda.h nvToolsExtCudaRt.h nvToolsExtOpenCL.h nvToolsExtSync.h ``` Is the preference via those added defines or should the include just be changed to be `nvtx3/` ? Also there is no library linking needed on Windows and the library is not even present.	2023-08-16 11:53:58 -07:00
Changming Sun	8e203efc69	Cleanup cmake file (#17154 ) ### Description 1. Clean up cmake files. Remove some unused code 2. Remove the "Semmle" task from tools/ci_build/github/azure-pipelines/templates/win-ci.yml. Semmle is deprecated and replaced by CodeQL.	2023-08-15 10:51:33 -07:00
Matthieu Darbois	5e971bc51a	Rework WIL dependency retrieval/usage (#17130 ) ### Description 1. `onnxruntime_fetchcontent_makeavailable` works around unconditional install commands so that can be used instead of `FetchContent_Populate` 2. This dependency is Windows specific, mark it as such. ### Motivation and Context 1. This simplifies `cmake/external/wil.cmake` not to do anything specific wether WIL was fetched or found 2. Given it's specific to Windows, it might not be available on other OS in specific air-gapped environment such as [conan-center-index](https://github.com/conan-io/conan-center-index). This allows downstream builds not to require specific patches for something not required by the build in the first place.	2023-08-15 09:11:46 -07:00
Wenbing Li	d052c8a45c	Remove the extensions submodule (#17097 ) ### Description Remove the onnxruntime-extensions submodule since it now was used via cmake FetchContent ### Motivation and Context The submodule relies on an outdated version of the extensions, and the build instructions should be updated to eliminate any confusion.	2023-08-14 10:16:33 -07:00
Yulong Wang	5704e71b89	update onnx.patch to apply wasm build break fix (#17104 ) ### Description This PR fixes build break for WebAssembly introduced in `6986981482` (`435ad2b1d8`). This change updates onnx.patch in onnxruntime repo. the corresponding PR in onnx repo is: https://github.com/onnx/onnx/pull/5495. It may takes a while for the next onnx version bump.	2023-08-11 15:00:39 -07:00
Changming Sun	4728f20f9a	Fix CI build (#17118 ) ### Description Some pipelines are failing. It is because PR #16325 set ONNX version to `rel-1.14.1` . It is a branch name, not a commit or tag name. It means whenever the branch got a new commit, we will auto pick it and use it.	2023-08-11 10:56:38 -07:00
Yulong Wang	9cd4e5af68	[wasm] upgrade emsdk to 3.1.44 (#17069 ) ### Description This change upgrade emsdk to 3.1.44. Because backend is upgraded to LLVM 16, so need to fix a lot of build failures caused by "-Wshorten-64-to-32". most of the build failures comes from generated `onnx.pb.h`, and this can be fixed by including "core/graph/onnx_protobuf.h", which detects and ignore shorten-64-to-32 warnings.	2023-08-10 16:08:36 -07:00
Bowen Bao	6986981482	Bump ONNX version (#16325 ) ### Description Bump ONNX version to https://github.com/onnx/onnx/tree/rel-1.14.1 to include a fix for segfault when shape inferencing nested onnx functions. ### Motivation and Context Resolves #16170	2023-08-10 11:27:28 -07:00
Jeff Daily	dbbfc249f7	[ROCm] update header and binary search paths used by cmake (#17083 ) This is in preparation for planned ROCm 6.0 changes that are not backward compatible. However, the adjustments made by this PR to the current onnxruntime cmake files will work with ROCm 5.x and 6.x.	2023-08-10 11:05:21 +08:00

... 2 3 4 5 6 ...

1776 commits