Commit graph

1776 commits

Author SHA1 Message Date
sfatimar
4d1963c2a2
OpenVINO EP Rel 1.18 Changes (#20337)
### Description
These changes include
Support to OpenVINO 2024.1 
Import PreCompiled Blobs with EPContext Blob 
Separate Device/Precision as input
Deprecate CPU_FP32 , GPU_FP32 terminology , introduce CPU, GPU 
AUTO GPU, CPU will only create GPU Blob and not CPU Blob. 



### Motivation and Context
- OpenVINO 2024.1 will be out soon
- Import Precompiled Blob can greatly reduce FEIL/FIL Time. 
- Separating Device/Precision will make the input cleaner
-

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2024-04-19 00:31:38 -07:00
Patrice Vignola
12569626cb
Update DML to 1.14.1 (#20380)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-18 22:43:41 -07:00
Patrice Vignola
745b426c60
[DML] Update DML to 1.14 (#20304)
I am prefiring this change to pre-run the non-dml checks, and also to
give folks the time to review it before DML gets released. When DML 1.14
officially releases, we'll only need to run the DML pipeline to
automatically pick up the nuget package. This should save us some
valuable time.

Note that DML 1.14 is the release needed for ORT 1.17.4, and DML 1.15
will come soon after.
2024-04-18 16:22:57 -07:00
Adam Louly
ee74fb6908
Introducing ORTPipelineModule - DeepSpeed Parallel Pipeline Support. (#20287)
### Description
Introducing a new class ORTPipelineModule to handle wrapping layers in
DeepSpeed pipeline parallel.


### Motivation and Context
To support pipeline parallelism on ORTModule.

This PR will include an initial support of deepspeed Pipeline
parallelism.

- [x] Support Pipeline parallel where layers are nn Modules in
Sequential.
- [ ] Support LayerSpec and TiedLayerSpec
- [ ] Enable partitioning to accept List
- [ ] Full-GPU Graph Consolidation
- [ ] Subgraph Merging for Inference
2024-04-18 11:30:15 -07:00
Sumit Agarwal
f664f91298
[DML EP] Expose NPU macro via build command (#20306)
### Description
This fixes following things:
- Expose `ENABLE_NPU_ADAPTER_ENUMERATION` macro via build command, so
that a user can enable NPU support for DML EP seamlessly.
- Add keyword `_dmlEp_` as part of the node name, which would be useful
for debugging purpose.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-18 11:23:13 -07:00
Patrice Vignola
76434907fb
[DML EP] Add graph capture (#20257)
This adds a new "Graph Capture" option to the DML ep, similar to the
cuda graph functionality. Here's how graph capture works:

- A user can enable graph capture in the session options by setting
`ep.dml.enable_graph_capture` to `true`
- When they want to capture a run, they set `gpu_graph_id` in their
`RunOptions` to a number bigger than 0 (0 is reserved for internal use
according to the cuda graph documentation).
- Then, when they start the inference, the graph will be captured and
stored in the DML EP for future use
- When they execute the run for a second time with the same id, the
`ReplayGraph` function in the DML EP will be called instead of executing
the kernels, resulting in very low overhead and avoiding kernel
recompilation.

This feature can give up-to-par or even better performance than
specifying the static dimensions at session creation time, but is also
much more flexible.
2024-04-18 10:15:00 -07:00
Adrian Lizarraga
0a1902525f
Add patch for ONNX 1.16.0 shape inference bug (#20316)
### Description
- Adds a patch that fixes a shape inference bug that caused a segfault:
https://github.com/onnx/onnx/pull/6080
- Fix documentation describing why QLinearMatMul tests are currently
being skipped.



### Motivation and Context
The [PR for integrating with ONNX
1.16.0](https://github.com/microsoft/onnxruntime/pull/19745) disabled
various python quantization tests due to a shape inference bug. This PR
applies the ONNX fix as a patch. We still can't enable the tests because
some of our CIs pip install onnx-1.16.0, which doesn't include the fix.
2024-04-17 10:23:22 -07:00
Yi-Hong Lyu
6b6a62fb40
Add vectorized AVX512F kernel for ReduceMaximumF32Kernel (#20268)
### Description
<!-- Describe your changes. -->

This commit introduces a new vectorized AVX512F kernel,
MlasReduceMaximumF32KernelAvx512F, which efficiently computes the
maximum value of the supplied buffer. Additionally, microbenchmarks have
been added for MlasComputeSoftmax (inplace),
MlasReduceMaximumF32KernelAvx, MlasComputeSumExpF32KernelAvx512F, and
MlasComputeSoftmaxOutputF32KernelAvx.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

The goal of this commit is to enhance the performance of
ReduceMaximumF32Kernel on CPUs with AVX512F instruction support.
  | AVX |   |   | AVX512 |   |   |  
-- | -- | -- | -- | -- | -- | -- | --
name | iterations | real_time | cpu_time | iterations | real_time |
cpu_time | time_unit
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:3/real_time | 271277304 |
2.58095 | 2.58091 | 263338132 | 2.65661 | 2.65661 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:3/real_time | 271220477 |
2.58095 | 2.58095 | 263509929 | 2.65652 | 2.65649 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:3/real_time | 271240587 |
2.58064 | 2.58064 | 263479542 | 2.65671 | 2.65665 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:3/real_time | 271227745 |
2.58083 | 2.58079 | 263402506 | 2.65657 | 2.65657 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:3/real_time | 271255069 |
2.58073 | 2.58071 | 263463858 | 2.65682 | 2.65682 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:3/real_time | 271257174 |
2.58058 | 2.58052 | 263460120 | 2.65682 | 2.65682 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:4/real_time | 174395051 |
4.01401 | 4.01401 | 197330481 | 3.5465 | 3.54636 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:4/real_time | 174645502 |
3.99691 | 3.99691 | 197474831 | 3.54298 | 3.54278 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:4/real_time | 174523308 |
4.01391 | 4.01386 | 197389981 | 3.54518 | 3.54506 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:4/real_time | 174779200 |
3.99874 | 3.99874 | 197519075 | 3.54227 | 3.54209 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:4/real_time | 174642874 |
4.00645 | 4.00641 | 197642101 | 3.54195 | 3.54188 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:4/real_time | 174546754 |
4.0061 | 4.00608 | 197621033 | 3.54296 | 3.54281 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:5/real_time | 162752651 |
4.30119 | 4.30114 | 215552503 | 3.24767 | 3.24752 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:5/real_time | 162717463 |
4.30123 | 4.30116 | 215541082 | 3.24711 | 3.24695 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:5/real_time | 162718819 |
4.3016 | 4.30153 | 215589239 | 3.24725 | 3.24708 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:5/real_time | 162719596 |
4.30151 | 4.30145 | 215563846 | 3.24956 | 3.24949 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:5/real_time | 162753333 |
4.30125 | 4.30125 | 215537315 | 3.24924 | 3.24908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:5/real_time | 162752258 |
4.3014 | 4.30141 | 215526482 | 3.24744 | 3.24735 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:7/real_time | 143579660 |
4.87526 | 4.87516 | 100000000 | 5.25767 | 5.25752 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:7/real_time | 143585097 |
4.87476 | 4.87467 | 100000000 | 5.41583 | 5.41567 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:7/real_time | 143571011 |
4.87506 | 4.87503 | 182359467 | 3.83773 | 3.83764 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:7/real_time | 143587142 |
4.87487 | 4.8748 | 182397261 | 3.83807 | 3.8379 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:7/real_time | 143578465 |
4.87525 | 4.87521 | 182428602 | 3.83777 | 3.83768 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:7/real_time | 143588555 |
4.87491 | 4.87488 | 125280452 | 5.59791 | 5.59766 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:9/real_time | 284851058 |
2.43476 | 2.43476 | 156879863 | 4.42895 | 4.42884 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:9/real_time | 270700898 |
2.59031 | 2.59024 | 157953114 | 4.42995 | 4.42968 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:9/real_time | 282871172 |
2.45385 | 2.45385 | 157801156 | 4.42817 | 4.42804 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:9/real_time | 285307738 |
2.47009 | 2.47005 | 158058507 | 4.4279 | 4.42786 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:9/real_time | 285709536 |
2.45481 | 2.45476 | 158070961 | 4.42809 | 4.42799 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:9/real_time | 285449733 |
2.47495 | 2.47491 | 158069718 | 4.45026 | 4.45017 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:11/real_time | 189213618 |
3.79684 | 3.79676 | 139459497 | 5.01882 | 5.01871 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:11/real_time | 185600468 |
3.76394 | 3.76376 | 139444892 | 5.01922 | 5.01905 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:11/real_time | 184968668 |
3.80636 | 3.80636 | 139470834 | 5.01948 | 5.01936 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:11/real_time | 183867226 |
3.80432 | 3.80427 | 139481986 | 5.01975 | 5.01944 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:11/real_time | 184301650 |
3.81634 | 3.81634 | 139452846 | 5.01983 | 5.01972 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:11/real_time | 186215795 |
3.82659 | 3.82654 | 139497736 | 5.02119 | 5.02113 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:13/real_time | 135622415 |
5.16256 | 5.16252 | 124661337 | 5.61227 | 5.61194 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:13/real_time | 135618907 |
5.15967 | 5.1596 | 124805224 | 5.6088 | 5.60854 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:13/real_time | 135612192 |
5.15506 | 5.15501 | 124803221 | 5.60901 | 5.60869 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:13/real_time | 135906082 |
5.15818 | 5.15818 | 124776601 | 5.60898 | 5.60886 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:13/real_time | 135369523 |
5.15709 | 5.15682 | 124790370 | 5.60927 | 5.60902 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:13/real_time | 135596827 |
5.1603 | 5.1603 | 124792145 | 5.61637 | 5.61614 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:15/real_time | 110947137 |
5.96511 | 5.96495 | 112861522 | 6.20035 | 6.20014 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:15/real_time | 118004792 |
6.22645 | 6.22628 | 112909900 | 6.20073 | 6.20073 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:15/real_time | 112630319 |
6.25564 | 6.25552 | 112874563 | 6.19932 | 6.19924 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:15/real_time | 117403034 |
6.17263 | 6.17258 | 112927318 | 6.19866 | 6.19842 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:15/real_time | 108921863 |
6.48624 | 6.48612 | 112927746 | 6.20057 | 6.20026 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:15/real_time | 110358148 |
6.66805 | 6.66789 | 112907312 | 6.19938 | 6.19908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:16/real_time | 203419574 |
3.4415 | 3.44137 | 237134525 | 2.95649 | 2.95638 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:16/real_time | 203414035 |
3.4411 | 3.44099 | 237129564 | 2.95178 | 2.95171 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:16/real_time | 203404068 |
3.44157 | 3.44151 | 236981704 | 2.9518 | 2.95167 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:16/real_time | 203391471 |
3.44146 | 3.44137 | 237108807 | 2.95203 | 2.95196 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:16/real_time | 203393801 |
3.44131 | 3.44127 | 237126460 | 2.95278 | 2.95272 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:16/real_time | 203407476 |
3.44181 | 3.44162 | 237154444 | 2.95293 | 2.9528 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:500/real_time | 37551439 |
18.6407 | 18.6407 | 39222534 | 17.858 | 17.8571 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:500/real_time | 37544097 |
18.6404 | 18.6401 | 39174151 | 17.8539 | 17.8536 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:500/real_time | 37549837 |
18.6391 | 18.6391 | 39233956 | 17.8507 | 17.8505 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:500/real_time | 45996345 |
15.2157 | 15.2153 | 39285929 | 17.848 | 17.8474 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:500/real_time | 46012429 |
15.2184 | 15.2179 | 65664865 | 10.7366 | 10.7364 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:500/real_time | 45912375 |
15.2349 | 15.2346 | 65205908 | 10.8498 | 10.8492 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:2000/real_time | 9493955 |
73.7232 | 73.7203 | 10188090 | 68.7931 | 68.7908 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:2000/real_time | 9495562 |
73.7173 | 73.7173 | 10180895 | 68.7533 | 68.7511 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:2000/real_time | 9487371 |
73.7852 | 73.7831 | 10164473 | 68.7279 | 68.725 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:2000/real_time | 10816047 |
64.7322 | 64.7287 | 10168481 | 68.8109 | 68.8096 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:2000/real_time | 10808802 |
64.7232 | 64.721 | 19478320 | 36.1471 | 36.1461 | ns
REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:2000/real_time | 10818192 |
64.7304 | 64.728 | 19419672 | 35.9635 | 35.9635 | ns
2024-04-16 13:52:43 -07:00
George Wu
08d208b969
[QNN EP] refactor QNN deps/copy logic. start copying deps to target python loc… (#20317)
copy QNN deps when building python bindings as well.
tweak the wildcard to only copy QNN related files. latest sdk from
Qualcomm (>= 2.21) also include SNPE dll's which we don't want to
include.
2024-04-15 22:33:12 -07:00
liqun Fu
cd7112f800
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md

ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0

#### Updated ops for CPU EP:
- DequantizeLinear(21)
  - Added int16 and uint16 support + various optimizer tests
  - Missing int4 and uint4 support
  - Missing block dequantization support
- QuantizeLinear(21)
  - Added int16 and uint16 support + various optimizer tests
  - Missing int4 and uint4 support
  - Missing block quantization support
- Cast(21)
  - Missing int4 and uint4 support
- CastLike(21)
  - Missing int4 and uint4 support
- ConstantOfShape(21)
  - Missing int4 and uint4 support
- Identity(21)
  - Missing int4 and uint4 support
- If(21)
  - Missing int4 and uint4 support
- Loop(21)
  - Missing int4 and uint4 support
- Reshape(21)
  - Missing int4 and uint4 support
- Scan(21)
  - Missing int4 and uint4 support
- Shape(21)
  - Missing int4 and uint4 support
- Size(21)
  - Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support

#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

### Disabled tests
#### ORT Training

orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops

#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
 - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8

#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)

---------

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 09:46:49 -07:00
Andrew Fantino
7303a90f49
Fix build errors from date/date.h C++20 compatibility (#20139)
### Description
For C++ standards >= 20, use `std::chrono::operator<<` in place of
`date::operator<<` to fix ambiguous operator compile error.

### Motivation and Context
The external dependency HowardHinnant/date has a conflict with
std::chrono for >=C++20.
Solves #20137
2024-04-02 22:10:25 -07:00
Yi Zhang
dae77e6014
Support building Windows CUDA with Ninja (#20176)
### How to run it locally
1. conda install ninja
2. "C:\Program Files\Microsoft Visual
Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
3. python.exe {ort_repo}\tools\ci_build\build.py --config RelWithDebInfo
--build_dir {ort_repo}\build_cuda --skip_submodule_sync --build_csharp
--update --parallel --cmake_generator "Ninja" --build_shared_lib
--enable_onnx_tests --enable_pybind --build_java --build_nodejs
--use_cuda "--cuda_home=C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v11.8" --enable_cuda_profiling --cmake_extra_defines
CMAKE_CUDA_ARCHITECTURES=60
4. cd build_cuda\RelWithDebInfo
5.  cmake --build . j16

### Motivation and Context
In packaging pipelines, we often come across a random issue that the
building with CUDA on Windows takes too much time.
Although it has been reduced much by moving the building to the CPU
machine.
We're planning to build with Ninja instead of msbuild in Packaging
pipelines, thus, nvcc can run parallelly.
It's the first step to support it locally.
2024-04-03 11:19:31 +08:00
Jeff Bloomfield
2f31560430
Enable generic feature level devices in DML EP (#20114)
### Description
Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and
D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX
headers through the DirectX-Headers repo.

Note that this includes an update to cgamanifest.json for onnx-tensorrt
which is triggered during re-generation due to a prior changes to
deps.txt.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-29 14:37:30 -07:00
Ye Wang
17919717b5
add QMoE (#20108)
### Description
<!-- Describe your changes. -->
1. Introduce latest cutlass extension from TRTLLM that gives us cutlass
upgrade(to 3.4) opportunity from MoE side.
2. Fix Windows build issue
3. Add Int4 MoE op and ut



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-29 10:24:19 -07:00
Dmitri Smirnov
b95fd4e644
Enable CUDA EP unit testing on Windows (#20039)
### Description
Address build issues and source code discrepancies.
Fix cuda_test_provider gtest argument stack corruption.

### Motivation and Context
`OpTester` class that is widely used for kernel testing is not
suitable for testing internal classes for EPs that are built as shared
objects.
Currently, CUDA EP tests run only on Linux.
We want to enable testing and developments on Windows,
and create a usable pattern for testing of other EPs internals.

Alternatives considered: 
Abstracting EP unit tests into separate test executable such as
`onnxruntime_test_all`.
This alternative was rejected as it would create a lot more changes in
the established patterns,
and potentially interfere with CUDA functionality with more complex
source code maintanence.
2024-03-27 13:32:36 -07:00
Dmitri Smirnov
3076b56947
Make MS Debug engine SymInitialize() called as needed. (#20036)
### Description
<!-- Describe your changes. -->
Initialize Symbol engine as needed with no duplicate calls.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
  Currently absel library may call SymInitialize more than once
  when shared libraries are involved. However, this can only be
  called only once per process. Our debug_alloc also may call it
  when enabled. This change enables intialization to proceed
  only when needed with no duplicate effort.
2024-03-22 16:17:47 -07:00
sfatimar
eab35c20fc
Ort openvino npu 1.17 master (#19966)
### Description
Add NPU to list of device supported. 
Added changes for Support to OV 2024.0
Nuget packages removes packaging of OpenVINO DLL 
Bug Fixes with Python API 
Reverted Dockerfiles not being maintained. 



### Motivation and Context
NPU Device has been introduced by Intel in latest client systems
OpenVINO 2024.0 release is out.

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Ubuntu <ubuntu@ubuntu-118727.iind.intel.com>
Co-authored-by: hmamidix <hemax.sowjanya.mamidi@intel.com>
Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
2024-03-21 18:44:00 -07:00
Changming Sun
dafbef3a21
CMake: support reading dependency zip files from a local mirror (#20005)
### Description
To test this feature, run 
```bat
python cmake\deps_update_and_upload.py --root-path mirror
```
Then run build.py as usual. 

The zip files will be cached local. To avoid being downloaded again and
again.
2024-03-21 17:58:59 -07:00
Yufeng Li
15219e2e71
turn on neural_speed by default (#19627)
### Description
<!-- Describe your changes. -->
the crash caused by the neural_speed turns out to be a very corn case.
Turn it on by default.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-20 12:49:58 -07:00
Rachel Guo
6b305f95e0
Support xcframework for mac catalyst builds. (#19534)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

MAUI on macOS uses mac-catalyst which requires a different native
binary.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-03-20 10:55:19 -07:00
mindest
3dfe4a5e6d
[ROCm] Remove MPI dependency and collectives to use NCCL (#19830)
### Description
* Remove MPI dependency to use NCCL AllReduce, etc.
* Exclude unsupported collectives in hipify
2024-03-19 17:35:18 -07:00
Ted Themistokleous
6bb64683f8
Use version instead of version-dev for ROCm (#19967) 2024-03-19 10:40:40 +08:00
Adam Louly
32558134a9
[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770)
### Description
Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers


### Motivation and Context
This PR includes changes that will make ort handle 2GB+ checkpoints.
To do that we need to upgrade flatbuffers to 23.5.9 -
https://github.com/google/flatbuffers/pull/7945

- Modified the commitHash and the hash for the new version
- Removed the patch for rust generator's unused variable warning as it
is no longer producing this - [Check it out
here](d121e09d89/src/idl_gen_rust.cpp)
- Updated the VerifyField calls with alignment values that were
introduced in the new version.

---------

Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
2024-03-14 16:36:24 -07:00
Changming Sun
1fb6cbddee
Add a build patch for Windows ARM64EC (#19898)
### Description
Add a patch for Windows ARM64EC


### Motivation and Context
Will need more changes in onnxruntime/core/common/cpuid_arch_definition.h and onnxruntime/core/common/cpuid_info.cc
2024-03-14 08:50:42 -07:00
Jeff Daily
9443366009
[ROCm] fix build failure when nccl is enabled (#19900)
Building onnxruntime ROCm EP with --enable_nccl --use_mpi fails due to
inclusion of MOE source files but MOE is not supported. The error
observed is

`error: contrib_ops/rocm/moe/ft_moe/moe_kernel.h: No such file or
directory`

The fix is to exclude collective/sharded_moe.* files when nccl is
requested.
2024-03-13 21:16:54 -07:00
Adrian Lizarraga
9c3242ab70
[QNN EP] Copy security catalog file for HtpV73Skel.so from QNN SDK (#19903)
### Description
Copies the `QNN_HOME/lib/hexagon-v73/unsigned/libqnnhtpv73.cat` file
from QNN SDK to the unittest build directory. This is necessary in order
to be able to load the `libQnnHtpV73Skel.so` file on Windows for modern
versions of QNN SDK.

### Motivation and Context
A [digitally-signed catalog
file](https://learn.microsoft.com/en-us/windows-hardware/drivers/install/catalog-files)
(.cat) can be used as a digital signature for an arbitrary collection of
files.
2024-03-13 20:52:59 -07:00
Jake Mathern
18ad8587a6
[CP] Fix for xfgcheck and Fix WAI ARM64 build (#19634) (#19644)
### Description
Fix WAI build by only conditionally copying linker flags



### Motivation and Context
I broke the WAI build that contains ORT on ARM64
2024-03-13 17:54:06 -07:00
Edward Chen
860eb762c2
[Apple framework] Fix minimal build with training enabled. (#19858)
Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled.
- training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build.
- tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build.

Added a test to cover this configuration.
2024-03-12 11:33:30 -07:00
Scott McKay
978c40d853
Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723)
### Description
<!-- Describe your changes. -->
If the EP handles QDQ node units, we need to make sure we do not split
those into different partitions.

Update the partitioning utils to be QDQ aware. If there are node units
we process the logical nodes they represent instead of individual nodes.
This ensure we process all nodes in a QDQ node unit at the same time so
that they are always in the same partition.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix one of the issues in #19590

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-03-12 10:55:49 +10:00
Changming Sun
efad5bbc5a
Replace some old file system calls with C++17 std::filesystem APIs. (#19196)
### Description
1. Replace some old file system calls to use C++17 std::filesystem APIs.
2. Remove tensorflow_C_PACKAGE_PATH cmake option, which was only used in
onnxruntime_perf_test and the code is out of maintain.
3. Excludes onnx_test_runner and onnxruntime_perf_test from iOS build
because C++17 filesystem library is not available there
2024-03-09 09:17:36 -08:00
Scott McKay
db59cec82f
Don't reduce warning level for CUDA build on Windows (#19663)
### Description
<!-- Describe your changes. -->
Address warnings so all the ORT projects build with /W4 on Windows.

Mainly 
- unused parameters
- variables shadowing other ones

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19588 started on this.
2024-03-06 15:03:55 +10:00
Chi Lo
d9730c7f43
[TensorRT EP] Fix bug for DDS output handling for empty tensor (#19575)
When the DDS output is empty tensor (i.e. any of the dimension is 0),
TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to
prevent accidentally overwriting other location that might belong to
other tensors.

This PR also refactors the code to only allocate single bytes for all
empty tensors.

#TODO: add unit tests to cover the DDS code paths or doing more testing
with concurrent,sequential, threaded faster-rcnn using onnx_test_runner
and verifying outputs

---------

Co-authored-by: Chi Lo <lochi@microsoft.com>
2024-03-05 14:39:36 -08:00
Chen Fu
06e684c9f2
Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619)
### Description
Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is
specially optimized for Nvidia Ampere GPUs.


### Motivation and Context
Trying to improve quantized LLM inference performance on Nvidia Ampere
GPUs

### Note:
This is implemented by extending CUTLASS, so it has a hard dependency on
CUTLASS. However, in current build system, loading of CUTLASS dependency
is guarded with:

(onnxruntime_USE_FLASH_ATTENTION OR
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION)

If both of these options are turned off, then compilation will fail.

Why CUTLASS dependency is guarded at all? It's a header file only
library that does not introduce any binary if not instantiated. What's
the downside of removing all the guards and just include CUTLASS
unconditionally?
2024-03-05 09:37:45 -08:00
Changming Sun
a0521f899e
Enable CPUINFO for all Windows build (#19655)
### Description
It was disabled in PR #9065. And the reason was:
" api-ms-win-core-kernel32-legacy-*.dll wasn't available in Windows 8
and was added in Windows 10, so cpuinfo breaks our Windows 8 support.
I'm disabling it again."

We no longer support Windows 8.  Therefore we can add CPUINFO back.

### Motivation and Context
To make the code simpler. If in any case the library doesn't work as
expected, we can submit a PR to their code base and fix it.
2024-03-01 16:23:20 -08:00
Edward Chen
5672cdebdf
Update google benchmark to 1.8.3. (#19734)
Update google benchmark to 1.8.3.
Update deps_update_and_upload.py script to make it easier to use.
2024-03-01 11:01:58 -08:00
Scott McKay
2a857d9a86
Add ML Program support for more operators (#19527)
### Description
<!-- Describe your changes. -->

Add support for:
- Clip/Relu/Relu6
- Add/Mul/Div/Sub/Pow
- GlobalAveragePool/GlobalMaxPool/AveragePool/MaxPool
- Reshape
- Gemm/MatMul

Fix some build issues/warnings from changes.

Fix a couple of potential issues with the Resize op as well (noticed due
to change to reject inputs with empty data at a higher level).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable mobilenetv2 with ML Program
2024-03-01 10:23:29 +10:00
Maximilian Müller
c20ced4132
Use CMake's find package for CUDA libs (#19673)
### Description
Answers issue #19640 
More details are in the issue, basically I am changing all the include
directory and link directory usage to CMake's `CUDA::*` targets
2024-02-27 11:26:48 -08:00
cloudhan
1e69b61238
Make version string detection more robust (#19615)
`/opt/rocm/.info/version-dev` is only available if the `rocm-dev`
metapackage is installed. This will bring a lot of unused packages which
are not needed by the users, they may opt for fine grained control.
Fallback to `rocm_version.h` in case `rocm-dev` is not installed.
2024-02-27 16:06:06 +08:00
Changming Sun
9ccdc4961a
Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib (#19632)
### Description
Stop using apiset in OneCore build: use onecoreuap.lib instead of
onecoreuap_apiset.lib in onecore build.


### Motivation and Context
1. Now all Windows Editions come with Reverse Forwarders. We should just
use the normal onecore libs.
2. Many new Windows APIs are only available in [windows umbrella
libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries).
So these libraries are not specific for Windows CoreOS or Onecore.
3. Going forward we should use "IsApiSetImplemented" to guard our API
usages:

https://learn.microsoft.com/en-us/windows/win32/apiindex/detect-api-set-availability
.

After this change, our built binaries can pass apivalidator's check.

```
C:\local\apivalidator>apivalidator.exe -BinaryPath:C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll -SupportedApiXmlFiles:onecoreuap_DDIs.xml
ApiValidation:
Summary:
        "C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll" is Universal


ApiValidation: All binaries are Universal
```
So it will give an easy way to test ONNX Runtime's compatibility to
Windows versions.
2024-02-23 22:31:57 -08:00
cao lei
f430600432
Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515 (#19609)
### Description
<!-- Describe your changes. -->
Enable streams for DML EP. This change is to revert PR 19481 since the
bug 19480 is fixed by PR 19515


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable streams for DML EP. This change is to revert PR 19481 since the
bug 19480 is fixed by PR 19515
2024-02-23 06:02:05 -08:00
pengwa
ae92d593c0
ONNX Gelu Op in Opset 20 (#19560)
### ONNX Gelu Op in Opset 20

Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op

1. Move CPU-GELU implmentation from
`onnxruntime/contrib_ops/cpu/activations.h/cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'none'.
2. Dumplicate some logic from
`onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'tanh'.
3. Register ONNX domain Gelu CPU kernel from opset 20 in
`onnxruntime/core/providers/cpu/cpu_execution_provider.cc`.
4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to
`onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and
`onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu`
respectively, as the implementation for approximate attribute to be
'tanh'.
5. Implement the logic for approximate attribute to be 'none' in
`onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`.
6. Register ONNX domain Gelu CUDA kernel from opset 20 in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
7. ROCM ep related changes. 
8. Enrich the tests for ONNX domain Gelu in
`onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.
2024-02-23 11:05:16 +08:00
PeixuanZuo
6226c5f62f
[ROCm] Add SkipGroupNorm for ROCm EP (#19303)
Add SkipGroupNorm for ROCm EP.

---------

Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-02-21 11:08:48 +08:00
Jake Mathern
7a5860e490
Fix cmake function duplicate lib (#19547)
### Description
Fixes cmake function definition in winml.cmake to copy link flags.



### Motivation and Context
XFGCheck errors in WindowsAI because this function does not transfer
linker flags
2024-02-20 13:41:40 -08:00
pengwa
b55260d076
Minor fix for cmake (#19552)
### Minor fix for cmake

When build on Linux, get a warning saying "
CMake Warning at CMakeLists.txt:1603 (message):
  MPI and NCCL disabled on Win build.
"

This message is not correct. So have such a fix to avoid any
misunderstanding from users.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/848c2d77-a538-4e31-8e0d-4b539233e515)




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-19 10:21:19 +08:00
Scott McKay
4e5119760d
Add initial support for CoreML ML Program to the CoreML EP. (#19347)
### Description
<!-- Describe your changes. -->
Adds infrastructure to create an ML Package containing the Model using
ML Program. Updated coremltools files to v7.1 to bring in new protobuf
definitions along with the tools to write the weight.bin file and create
an ML Package correctly.

Enables building a CoreML Model on all platforms which means all the
operator builder code can be debugged anywhere. Execution of the
generated CoreML model is obviously limited to Apple platforms.

The Conv operator builder has been updated to be able to generate an ML
Program Operation.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
NeuralNetwork is no longer being developed and ML Program is the
replacement going forward.
2024-02-15 08:46:03 +10:00
George Wu
5e70c6b3a6
allow protobuf lite build for TRT EP (#19498)
allow protobuf-lite builds with TensorRT EP as long as it's built with
the trt built-in parser and not the oss-parser.
This is because trt built-in parser statically links protobuf so there
aren't any conflicts for protobuf-lite.
2024-02-12 22:53:04 -08:00
Patrice Vignola
1182b5509b
Disable streams for the DML EP (#19481)
There's currently a bug in the allocation planner when reusing buffers
and more than one streams are used that make it possible (although
rarely) to reach a reference count of 0 for a buffer that is still being
used. Since DML doesn't benefit from multiple streams, disabling it is
the safest option for now.

This is a high priority issue that we need to fix for 1.17.1 since it
breaks stable diffusion. Identifying the perfect fix and fixing the
underlying issue would be too risky for a patch release, especially
given the limited time that we have.

https://github.com/microsoft/onnxruntime/issues/19480
2024-02-10 00:34:34 -08:00
Changming Sun
1007d8f3d1
Revert "Revert NeuralSpeed code for x64 MatMulNBits (#19382)" (#19474)
This reverts commit 0d10c7f3c1.
2024-02-09 09:24:54 -08:00
luoyu-intel
0d10c7f3c1
Revert NeuralSpeed code for x64 MatMulNBits (#19382)
### Description
<!-- Describe your changes. -->
Revert PR#19016 https://github.com/microsoft/onnxruntime/pull/19016
Revert PR#17669 https://github.com/microsoft/onnxruntime/pull/17669
2024-02-07 13:04:37 -08:00
Maximilian Müller
91b2e660fe
[Build] fix: missing nvcc flags when compiling with unittests (#19308)
When configured using the following CMake ops Clion is not able to
configure due to checking with `nvcc ... --dryrun tmp.cu`:
```
cmake -G Ninja -Donnxruntime_USE_TENSORRT="ON" -Donnxruntime_USE_CUDA="ON" -Donnxruntime_USE_CUDA_NHWC_OPS="ON" -DCMAKE_CUDA_ARCHITECTURES="native" -Donnxruntime_NVCC_THREADS=1 -Donnxruntime_ENABLE_NVTX_PROFILE="ON" -Donnxruntime_USE_TENSORRT_BUILTIN_PARSER="ON" -DCMAKE_CUDA_COMPILER_LAUNCHER="ccache" -Donnxruntime_BUILD_UNIT_TESTS="ON" -Donnxruntime_USE_TRITON_KERNEL=OFF -Donnxruntime_USE_FLASH_ATTENTION=OFF
```
Without building the unittests everything works fine. I believe my
changes only follow the logic that is actually desired. If
`NVCC_HAS_STRICT_ALIASING` is set to false it should not be possible to
add this as a CUDA flag. Same is true for `HAS_NOERROR` as seen in
`adjust_global_compile_flags.cmake`
2024-02-06 17:01:26 -08:00
Ye Wang
aaf32fb1b1
phi2 conversion/optimization script (#19338)
### Description
<!-- Describe your changes. -->
This PR adds 
onnx conversion script for dynamo exported phi2,
optimization script,
and inference example script

A readme file is added as documentation.
https://github.com/microsoft/onnxruntime/tree/wangye/phi2_doc/onnxruntime/python/tools/transformers/models/phi2#readme


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-02-05 10:15:16 -08:00
Scott McKay
debd1cab10
Add coremltools 7.1 as a dependency (#19389)
### Description
<!-- Describe your changes. -->
Setup usage of coremltools via dependencies instead of copying files. 
Pull in some changes from
https://github.com/microsoft/onnxruntime/pull/19347 in preparation for
supporting ML Program and enabling building the ML Model on all
platforms to make development and testing of CoreML EP code easier.

- Update to coremltools 7.1 
- Add patch for changes required for cross platform build of ML Program
related code
- Generate coreml proto files on all platforms
- mainly to test these changes work everywhere, as the proto files will
be used on all platforms when #19347 is checked in
- rename onnxruntime_coreml_proto target to coreml_proto as it contains
purely coreml protobuf code with no ORT related chagnes

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve setup.
2024-02-03 09:42:21 +10:00
He Li
1bdd7d9499
Update oneDNN to v3.0.1 in order to support gcc 13 (#19344)
### Description

Update the dependency of `oneDNN` to v3.0.1, which fixes a minor bug
hindering gcc 13.

### Motivation and Context


Referring to
[oneDNN-1548](https://github.com/oneapi-src/oneDNN/issues/1548).

- When building with `--use_dnnl` using gcc 13.x, it will fail due to
this upstream issue.
- This is fixed in `v3.0.1`
[tag](https://github.com/oneapi-src/oneDNN/tree/v3.0.1) by [this
commit](1d7971ce48).
2024-02-01 15:39:03 -08:00
Yueqing Zhang
1d6f13fb92
[VitisAI] Refactor the VAIEP to use MSFT's standalone API (#19058)
### Description
<!-- Describe your changes. -->
Refactor the VAIEP to use MSFT's standalone API


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs
in order to decouple the EP from onnxruntime.dll and the providers.dll.
This will help to simplify customer deployment of applications and use
cases that need to share their onnxruntime.dll with other applications.

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
2024-01-31 21:08:26 -08:00
Yi-Hong Lyu
55b60d8fe0
Turn off Neural Speed to avoid slowdowns (#19265)
Disable Neural Speed to prevent the operation following MatMulNBits from
significantly slowing down.
2024-01-31 13:40:25 -08:00
Phoebe Chen
2b361c04d6
Fix Flatbuffer build issue. (#19296)
### Description

Building on g++ 13.2.0 results in -Wstringop-overread errors on Linux.
This commit addresses the flatbuffer build issue with the following
changes:
1. Remove the Werror flag in the flarbuffer patch.
2. Add a compilation option to suppress the 'stringop-overflow' error in
the Flatbuffers within the xnnpack provider.

### Motivation and Context
https://github.com/google/flatbuffers/issues/8119
https://github.com/microsoft/onnxruntime/pull/19239

Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
2024-01-31 10:12:43 -08:00
Changming Sun
8dad9d92f4
Move einsum's test data to constexpr variables (#19320)
### Description
emscripten's C++ compiler has difficulty on compiling einsum_test.cc
because the file has too many local variables. So I moved them to
constexpr.
2024-01-30 15:59:37 -08:00
Changming Sun
a92802f940
Disable a few tests for wasm build (#19316) 2024-01-30 08:16:57 -08:00
Tianlei Wu
8b4517218b
Remove USE_CUTLASS flag (#19271)
### Description
Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for
onnxruntime CUDA build), there is no need to have a flag to disable
cutlass.

Changes:
(1) Reverted https://github.com/microsoft/onnxruntime/pull/18761
(2) remove the condition to build cutlass.
(3) Fix a few build errors or warnings during testing CUDA 11.4 build. 

Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later.
Flash attention and cutlass fused multihead attention will not be built
for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if
you want to support latest GPUs.

It is better to include it in 1.17.0 (otherwise, the release branch
might encounter build failure with CUDA 11.4).

Tests:
(1) Build with flash attention and efficient attention off: **passed**
(2) Build with CUDA 11.4: **passed**

Example build command used in Ubuntu 20.04:
```
export CUDA_HOME=/usr/local/cuda-11.4
export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/
export CUDACXX=/usr/local/cuda-11.4/bin/nvcc

sh build.sh --config Release  --build_shared_lib --parallel  --use_cuda --cuda_version 11.4 \
            --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
            --disable_types float8
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-25 16:57:58 -08:00
PeixuanZuo
1c92e56dc0
[Cuda] Refactor GroupNorm (#19146)
Split GroupNorm implementation into multiple files, to make ROCm EP can
reuse cuda code.

Related PR: https://github.com/microsoft/onnxruntime/pull/19158

---------

Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-01-25 22:28:47 +08:00
Phoebe Chen
4477f57ee3
Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238)
### Description  
This pull request introduces the necessary changes to enable RISC-V
64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V
architecture has gained popularity as an open standard instruction set
architecture, and this contribution aims to extend ONNX Runtime's
compatibility to include RISC-V, thereby broadening the reach of ONNX
models to a wider range of devices.

### Motivation and Context
RISC-V is a free and open-source instruction set architecture (ISA)
based on established RISC principles. It is provided under open licenses
without fees. Due to its extensibility and freedom in both software and
hardware, RISC-V is poised for widespread adoption in the future,
especially in applications related to AI, parallel computing, and data
centers.

### Example Build Command
```
./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests
```

### Documentation Updates
Relevant sections of the documentation will be updated to reflect the
newly supported RISC-V 64-bit cross-compilation feature.
https://github.com/microsoft/onnxruntime/pull/19239

---------

Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
2024-01-24 16:27:05 -08:00
Changming Sun
bc54ad3f03
Update abseil to a release tag and register neural_speed (#19255)
### Description
Update abseil to a release tag and register neural_speed to CG.


### Motivation and Context
Now we are using a non-relesed version of abseil. Using a tag is better.
2024-01-24 14:37:39 -08:00
Jeff Daily
b2aec41a83
[ROCm] enable hipGraph (#18382)
This ports the cudaGraph support from the CUDA EP to the ROCM EP's
hipGraph.
2024-01-23 11:17:04 +08:00
snadampal
77da2ef278
[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031)
### Description
This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to
implement matrix multiplication with bfloat16 SIMD instructions (bfmmla)
and MatMul operator changes to invoke the Sbgemm kernel. To enable
Sbgemm kernel, set the following session option:
"kOrtSessionOptionsGemmFastMathMode"

The PR also adds new test cases for mlas and ort.

### Motivation and Context

This is to improve MatMul performance on aarch64 platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x
-1.76x performance improvement compared to sgemm (fp32) kernel
performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
And the unit test precision results are matching to sgemm kernel
results.
`./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync `
2024-01-22 14:43:06 -08:00
Edward Chen
c8ce83967e
Download protoc for all Apple host builds, remove protoc build from iOS packaging pipeline. (#19209) 2024-01-19 15:30:09 -08:00
luoyu-intel
459c750b03
Update x64 template kernel library for 'sqnbitgemm' (#19016)
### Description
<!-- Describe your changes. -->
1. Make JBLAS codes an external module of ORT.
2. Move q4 gemm code to contrib_ops.
3. Update template kernel library to v0.1 release.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We found that the current LLM model performance is far below our
expectations. Here is some performance data collected on Mistral-7B
model with Xeon-8480:
8 threads | prompt length=32 past_len=32 | prompt length=1   past_len=32
-- | -- | --
ORT-main | 1220ms | 263ms
Neural-speed | 564ms | 87ms
ORT-this PR|597ms|120ms

Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code,
there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks.
Through some statistics analysis, the summary latency of `MatMulNBits`
is 86.7ms
The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So
other OPs introduce an extra 30ms latency.

The performance of MatMulNBits in this PR meets our expectations.

### Remain Issues
1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses
TaskGranularityFactor to scale its number of threads. This is not
expected in our code design. It may slow down the hybrid CPU performance
by 30~40%.
2. Prepack uses a single thread which is very slow to init a session.
3. MatMulNBits with zero points will fall through to COMP_FP32 even
accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is
not optimized for now. It will be updated in the future. So, for an int4
model with zero points, whether the accuracy_level is 0 or 4 will be no
difference.
2024-01-18 13:16:34 -08:00
Maximilian Müller
bc219ed553
[TensorRT EP] Enable a minimal CUDA EP compilation without kernels (#19052)
Adresses https://github.com/microsoft/onnxruntime/issues/18542.
I followed the advice given by @RyanUnderhill
[here](https://github.com/microsoft/onnxruntime/pull/18731#issuecomment-1848261925)
and went with a minimal CUDA EP for now.
2024-01-17 11:33:34 -08:00
Wanming Lin
07d3aed3aa
[WebNN EP] Fixed build issue with disable_rtti (#19173)
Previously building webnn ep with --disable_rtti will throw
unboundTypeError since unbound type names are illegal with RTTI disabled
in Embind API, we can fix it by adding a
-DEMSCRIPTEN_HAS_UNBOUND_TYPE_NAMES=0 flag.
2024-01-16 21:35:13 -08:00
Changming Sun
e2e488d6f8
Revert "iOS packaging pipeline stability" (#19135)
Reverts microsoft/onnxruntime#19097 because it broken Android CI
pipeline.
2024-01-16 09:18:35 -08:00
Jeff Bloomfield
8d4369b77e
Update DirectML nuget version to 1.13.1 (#19122)
### Description
Update DML version to 1.13.1



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-15 19:04:41 -08:00
Edward Chen
e1e45901e2
iOS packaging pipeline stability (#19097)
- Remove protoc build step which sometimes times out. Download protoc instead.
- Use macOS-12 image in the set variables stage. It seems more stable.
2024-01-13 19:27:44 -08:00
Edward Chen
150c4cb8fe
[MLAS AArch64] SQNBitGemm CompInt8 kernel (#18953)
Implement ARM NEON SQNBitGemm kernel that first block quantizes A to int8 and then does int8 multiplication.
2024-01-12 17:58:08 -08:00
Guenther Schmuelling
96dbac6e4b
update to emsdk-3.1.51 (#18844) 2024-01-12 16:04:33 -08:00
Preetha Veeramalai
c340bf08f6
Openvino EP code changes for 1.17 update (#19023)
### Description
Introduce AppendExecutionProvider_OpenVINO_V2 API and support for OV
2023.3.


### Context

- The API is added to facilitate customers in using published official
Microsoft onnxruntime libraries with OVEP libraries.
- Add support for OpenVINO 2023.3 official release.
- Extend operator coverage 
- GH fixes

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
2024-01-12 13:20:51 -08:00
Changming Sun
0e8d4c3d21
Enable Address Sanitizer in CI (#19073)
### Description
1. Add two build jobs for enabling Address Sanitizer in CI. One for
Windows CPU, One for Linux CPU.
2. Set default compiler flags/linker flags in build.py for normal
Windows/Linux/MacOS build. This can help control compiler flags in a
more centralized way.
3. All Windows binaries in our official packages will be built with
"/PROFILE" flag. Symbols of onnxruntime.dll can be found at [Microsoft
public symbol
server](https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/microsoft-public-symbols).

Limitations:
1. On Linux Address Sanitizer ignores RPATH settings in ELF binaries.
Therefore once Address Sanitizer is enabled, before running tests we
need to manually set LD_LIBRARY_PATH properly otherwise
libonnxruntime.so may not be able to find custom ops and shared EPs.
4. On Linux we also need to set LD_PRELOAD before running some tests(if
the main executable, like python, is not built with address sanitizer.
On Windows we do not need to.
5. On Windows before running python tests we should manually copy
address sanitizer DLL to the onnxruntime/capi directory, because python
3.8 and above has enabled "Safe DLL Search Mode" that wouldn't use the
information provided by PATH env.
6. On Linux Address Sanitizer found a lot of memory leaks from our
python binding code. Therefore right now we cannot enable Address
Sanitizer when building ONNX Runtime with python binding.
7. Address Sanitizer itself uses a lot of memory address space and
delays memory deallocations, which is easy to cause OOM issues in 32-bit
applications. We cannot run all the tests in onnxruntime_test_all in
32-bit mode with Address Sanitizer due to this reason. However, we still
can run individual tests in such a way. We just cannot run all of them
in one process.

### Motivation and Context
To catch memory issues.
2024-01-12 07:24:40 -08:00
PeixuanZuo
5f3113ecd6
[ROCm] Fix hipify error: fast_divmod.h: No such file or directory (#19060)
Fix error:
```
[ 48%] Built target onnxruntime_optimizer

In file included from /onnxruntime_src/onnxruntime/core/providers/rocm/rocm_stream_handle.cc:5:
/onnxruntime_src/onnxruntime/core/providers/rocm/rocm_common.h:11:10: fatal error: core/providers/rocm/shared_inc/fast_divmod.h: No such file or directory
   11 | #include "core/providers/rocm/shared_inc/fast_divmod.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
```

This error is due to onnxruntime_optimizer missing dependencies on
hipify generated files.
2024-01-10 14:49:19 +08:00
Milos Puzovic
37ac9d391c
Enable Arm Compute Library 23.08 (#17672)
### Description

This PR enables onnxruntime to build with the most recent release of Arm
Compute Library

### Motivation and Context

The latest version of Arm Compute Library that onnxruntime can build is
20.02 which is more than 3 years old.
2024-01-09 14:10:25 -08:00
Changming Sun
68c29ece23
In a Linux or Android build check if the compiler support bfloat16 and float16 (#18813)
### Description
Restrict clang version because we have an upcoming change that requires
clang version >=16 , which will mainly affect Android build.
2024-01-08 19:46:33 -08:00
Jeff Bloomfield
55a669409a
Merge pull request #18983 from microsoft/WindowsAI
Merge WindowsAI to main
2024-01-04 17:21:19 -08:00
Wei-Sheng Chin
658e30eb33
Remove DORT since it's in PyTorch main now (#18996)
Main code are removed and tests are modified to use DORT directly from
PyTorch.
2024-01-04 12:59:47 -08:00
Xavier Dupré
889b1ef2d1
Fix schema type constraint for custom operators (#17497)
### Description
onnxruntime may raise an error "type inference failed" but when a custom
operator sets IsHomogeneous to false in its schema. This change make
sure that TypeInferenceFunction and schema type constraints are aligned
to prevent that from happening.

---------

Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2024-01-04 20:27:46 +01:00
Yulong Wang
b18abaaa2c
[js/web] wait for threadpool initialization (#18952)
### Description

a replacement of #18683. try to resolve #18689.

By specifying "-s PTHREAD_POOL_SIZE" flag in emscripten, it forces the
threadpool to initialize before the webassembly instance is available.
2024-01-04 08:06:55 -08:00
Sheil Kumar
107d7492b9 [DirectML EP] Add DML EP registration for Col2Im (#17786)
### Description
[DirectML EP] Add DML EP registration for Col2Im operator

### Motivation and Context
Add Col2Im support for opset 18.
This operator is implemented as the DirectML Fold operator.

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
2024-01-03 16:13:14 -08:00
Jeff Bloomfield
c3d96a7b35 Update DML version to 1.13.0 (#18978)
Update DML nuget version to 1.13.0
2024-01-03 16:09:55 -08:00
Sheil Kumar
dbb8680bdc
Delay load dxcore.dll in addition to ext-ms-win-dxcore-l1-1-0.dll (#18913)
Delay load dxcore.dll in addition to ext-ms-win-dxcore-l1-1-0.dll

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-12-26 12:33:42 -08:00
pengwa
37f743680a
Fix build when flash attention and memory efficient attention are disabled (#18761)
### Fix build when flash attention and memory efficient attention are
disabled

On a customer env with lower version of CUDA < 11.6. Both flash
attention and memory efficient attention is turned OFF according to
e8f33b54ba/cmake/CMakeLists.txt (L701).
So
e8f33b54ba/cmake/external/cutlass.cmake (L1)
condition check return false. No cutlass lib is built.

```
Turn off flash attention since CUDA compiler version < 11.6
```

While, the kernels in
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/contrib_ops/cuda/moe/ft_moe
are depending on cutass for its build, so we get error like this:

```
[ 77%] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
   23 | #include "cutlass/array.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
   23 | #include "cutlass/array.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
   23 | #include "cutlass/array.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
In file included from /tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu:17:
/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h:23:10: fatal error: cutlass/array.h: No such file or directory
   23 | #include "cutlass/array.h"
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
fatal   : Could not open input file /tmp/tmpxft_00044da3_00000000-11_moe_gemm_kernels_fp16_fp16.compute_60.cpp1.ii
make[2]: *** [CMakeFiles/onnxruntime_providers_cuda.dir/build.make:6290: CMakeFiles/onnxruntime_providers_cuda.dir/tmp/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2210: CMakeFiles/onnxruntime_providers_cuda.dir/all] Error 2
make: *** [Makefile:166: all] Error 2
Traceback (most recent call last):
  File "/tmp/onnxruntime/tools/ci_build/build.py", line 2746, in <module>
    sys.exit(main())
  File "/tmp/onnxruntime/tools/ci_build/build.py", line 2639, in main
    build_targets(args, cmake_path, build_dir, configs, num_parallel_jobs, args.target)
  File "/tmp/onnxruntime/tools/ci_build/build.py", line 1527, in build_targets
    run_subprocess(cmd_args, env=env)
  File "/tmp/onnxruntime/tools/ci_build/build.py", line 824, in run_subprocess
    return run(*args, cwd=cwd, capture_stdout=capture_stdout, shell=shell, env=my_env)
  File "/tmp/onnxruntime/tools/python/util/run.py", line 49, in run
    completed_process = subprocess.run(
  File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
```


### Motivation and Context

To summarize, there are two cases we will have build failure for Linux
CUDA build:
1. User use cuda version < 11.6
2. User disabled Flash attention and memory efficient attention
explictly with onnxruntime_USE_FLASH_ATTENTION and
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION
2023-12-26 08:57:58 +08:00
luoyu-intel
5f00bc9931
Integrate high-performance x64 gemm library to MLAS (#17669)
### Description
Improve MLAS to support high-performance x64 INT4 kernels



### Motivation and Context
1. improve LLM inference performance on Intel CPUs.
2. support more 4bit quantization types: nf4, fp4
3. support dynamic block size: block size aligned with kernel's tiling
size(e.g. 4 for VNNI kernel), per channel on N dimension
4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni,
amx_bf16, amx_int8, avx512_fp16
5. support MatMulNBits' data format

### Tasks
- [x] support block_size: 32, 128, -1(per channel)
- [x] get weight pack size without memory allocation
- [x] use ort's thread pool for parallelism
- [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8

### Benchmark
Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 47613
| 47401 | 12970
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
6347792 | 6317562 | 109
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
11814014 | 11757847 | 59
Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
50222 | 50031 | 13759
Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
2038222 | 2028743 | 341
Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
3792832 | 3774485 | 191
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
58717 | 58501 | 11467
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
1360846 | 1354598 | 543
Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
2564232 | 2551365 | 266
Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 57929
| 57694 | 12047
Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5495330 | 5465810 | 126
Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10676240 | 10617817 | 66
Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
68305 | 68047 | 10026
Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5504862 | 5476215 | 126
Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
11758623 | 11697337 | 66
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
67713 | 67451 | 10298
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5508325 | 5480237 | 126
Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10738528 | 10681656 | 64
Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
60708 | 60486 | 11321
Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5523784 | 5495736 | 126
Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10829633 | 10772161 | 67


Reference:

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time | 53088 | 52911 |
13364
Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time | 6268981 |
6230335 | 110
Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time | 11701237 |
11632339 | 59

Win11+12900K 8 cores:
Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time | 215976
| 211295 | 2884
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time |
60960590 | 60937500 | 10
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time |
1.18E+08 | 1.19E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time |
470377 | 453059 | 1414
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time |
1.54E+08 | 1.53E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time |
3.18E+08 | 3.13E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time |
569072 | 559398 | 1229
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time |
1.54E+08 | 1.52E+08 | 4
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time |
3.22E+08 | 3.28E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time |
1486055 | 1473325 | 403
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time |
4.14E+08 | 4.14E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time |
8.88E+08 | 8.59E+08 | 1

---------

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
Co-authored-by: Mengni Wang <mengni.wang@intel.com>
2023-12-19 09:36:31 -08:00
Changming Sun
f52668cc68
Disable mlas unit test in ARM64EC build (#18747)
### Description
Disable mlas unit test in ARM64EC build because the program has some
link errors. We will fix the errors later.
This PR only impacts Windows ARM64EC build. It has no impact on the
existing build pipelines.
2023-12-15 09:17:47 -08:00
Changming Sun
d795fc636c
FIX: Our cmake script didn't check googletest's hash (#18826) 2023-12-15 08:48:15 -08:00
Changming Sun
cbad4fe49b
Update absl and googletest (#18827)
### Description
Update absl and googletest to their latest version to include some cmake
changes:
1. A googletest's cmake change that will allow using external absl and
re2.
2. Nullability enhancements that will allow our clang-based static
analysis detecting many kinds of null pointer errors.



### Motivation and Context
To fix a C4744 link warning in our Windows pipelines.
```
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\parse.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\usage.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<bool>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > >::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
LINK : warning C4744: 'static char const absl::lts_20230802::base_internal::FastTypeTag<int>::dummy_var' has different type in 'd:\a\_work\_temp\abseil_cpp\abseil-cpp-20230802.0\absl\flags\internal\flag.cc' and 'd:\a\_work\1\b\relwithdebinfo\_deps\googletest-src\googletest\src\gtest-all.cc': 'signed char' and 'unsigned char' [D:\a\_work\1\b\RelWithDebInfo\onnxruntime_mlas_test.vcxproj]
```
2023-12-14 16:15:07 -08:00
Yueqing Zhang
b42d4b8ea6
[VitisAI] 1. api compatbile 2. dynamic load onnx (#18470)
### Description
<!-- Describe your changes. -->

1. Add a backward-compatible API for compiling model.
2. Run-time load vitisai-ep.dll


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2023-12-14 14:43:41 -08:00
Suryaprakash Shanmugam
0723dcb8b5
OpenVINO Execution Provider with 2023.2 support (#18596)
- Add support for OpenVINO 2023.2
- num_of_threads provider option is mapped to the CPU device property
inference_num_threads of the CPU plugin, so users can control the
#threads used for inference by the CPU
- Logging in Debug mode now includes the runtime properties set for
devices
- Fix issue in using external weights through OpenVINO

---------

Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2023-12-13 15:56:43 -08:00
Adrian Lizarraga
81796a3081
[QNN EP Quantization] Add fusion preprocessing to QNN quantization (#18719)
### Description
- Adds graph fusions to preprocessing step that can be called before
creating a QDQ model for QNN EP.
- Fuse Erf sequence to Gelu (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_gelu.py)).
Required by QNN EP.
- Fuse ReduceMean sequence to LayerNormaliation (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_layernorm.py)).
Not required by QNN EP.
- Fuse ReduceL2 sequence to LpNormalization (new, specific to QNN EP).
Required by QNN EP.

Example use:
```python3
from quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model

# Added by this PR:
model_updated = qnn_preprocess_model("model.fp32.onnx", "model.fp32.preprocessed.onnx", fuse_layernorm=True)
model_to_quantize = "model.fp32.preprocessed.onnx" if model_updated else "model.fp32.onnx"

# Quantize model ...
qnn_config = get_qnn_qdq_config(model_to_quantize, data_reader, activation_type=QuantType.QUInt16)
quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```
### Motivation and Context
Allow more models to be quantized for use with QNN EP

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2023-12-12 08:43:04 -08:00
Changming Sun
c7799d7058
Build fixes for Windows ARM32 desktop build (#18752)
### Description
Fix a link error:

```
onnxruntime_common.lib(cpuid_info.obj) : error LNK2019: unresolved external symbol __imp_RegGetValueA referenced in function "privat
e: void __cdecl onnxruntime::CPUIDInfo::ArmWindowsInit(void)" (?ArmWindowsInit@CPUIDInfo@onnxruntime@@AAAXXZ) [C:\Users\snnn\src\on
nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventRegister referenced in function "pub
lic: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\src\on
nxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventUnregister referenced in function "p
ublic: virtual __cdecl onnxruntime::WindowsTelemetry::~WindowsTelemetry(void)" (??1WindowsTelemetry@onnxruntime@@UAA@XZ) [C:\Users\y
ilyu\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventSetInformation referenced in functio
n "public: __cdecl onnxruntime::WindowsTelemetry::WindowsTelemetry(void)" (??0WindowsTelemetry@onnxruntime@@QAA@XZ) [C:\Users\snnn\
src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
onnxruntime_common.lib(telemetry.cc.obj) : error LNK2019: unresolved external symbol __imp_EventWriteTransfer referenced in function
_tlgWriteTransfer_EventWriteTransfer [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]
C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\RelWithDebInfo\onnx_test_runner.exe : fatal error LNK1120: 5 unresolved ex
ternals [C:\Users\snnn\src\onnxruntime\build\ARM32\RelWithDebInfo\onnx_test_runner.vcxproj]

```
2023-12-08 12:45:06 -08:00
Changming Sun
bf33919afb
Update absl and gtest to fix an ARM64EC build error (#18735)
### Description
Update absl and gtest to fix an ARM64EC build error


### Motivation and Context
We need to get an important fix into ORT.
The fix is:

8028a87c96
2023-12-07 15:55:17 -08:00
junchao-loongson
4abec9749e
[mlas] add loongarch lsx and lasx optimize code (#17937)
### Description
Hello we(@lixing-star) are the developers of loongson team.

We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch
architecture


[100% tests passed, 0 tests failed out of
7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true)

### Development Environments1
```
CPU: 
    Loongson-3C5000L
uname -a:  
    Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux

```
### LonngArch Documents
- [LoongArch Reference Manual - Volume 1: Basic Architecture: This
manual describes the basic part of the LoongArch
architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html)
- [LoongArch ELF psABI: This manual describes the LoongArch ELF
psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
-
[more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)
2023-12-07 11:15:59 -08:00
moyo1997
9479ba525b
Build onnxruntime.dll as arm64x (#18633)
Build onnxruntime.dll as arm64x

Added a .cmake file to generate a link repro of the onnxruntime.dll
during arm64 build. This provides us a directory containing all the
arm64 objs, def file and libs to link to when it is time to building
arm64x onnxruntime.dll during the arm64ec build by passing the
/machine:arm64x flag to the linker along with the arm64 artifacts.

If other dlls wanted to be built as x, setting the ARM64X_TARGETS
variable in the toplevel cmakelists.txt to include these other targets
is all that will be needed.

Added build_arm64x.bat as a wrapper for the multiple (rm64, then
arm64ec) cmake calls needed to build as arm64x.

AB#22533
2023-12-06 16:49:00 -08:00
Ye Wang
c012e41f93
MoE with Expert Slicing (#18565)
### Description
<!-- Describe your changes. -->

Registered Sharded MoE op under contrib_op/cuda/collective with expert
slicing. The broadcast process happens just before adding second bias(if
has) and permutation undoing. Tensor slicing is planned but not included
in this PR.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-12-05 16:56:38 -08:00
Adrian Lizarraga
e066fca777
[Quantization] Tensor quant overrides and QNN EP quantization configuration (#18465)
### Description
#### 1. Adds `TensorQuantOverrides` extra option
Allows specifying a dictionary of tensor-level quantization overrides:
```
TensorQuantOverrides = dictionary :
    Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
    list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
    per-channel quantization, the list contains a dictionary for each channel in the tensor.
    Each dictionary contains optional overrides with the following keys and values.
          'quant_type' = QuantType : The tensor's quantization data type.
          'scale' =  Float         : The scale value to use. Must also specify `zero_point` if set.
          'zero_point' = Int       : The zero-point value to use. Must also specify `scale` is set.
          'symmetric' = Bool       : If the tensor should use symmetric quantization. Invalid if also
                                     set `scale` or `zero_point`.
          'reduce_range' = Bool    : If the quantization range should be reduced. Invalid if also
                                     set `scale` or `zero_point`.
          'rmax' = Float           : Override the maximum real tensor value in calibration data.
                                     Invalid if also set `scale` or `zero_point`.
          'rmin' = Float           : Override the minimum real tensor value in calibration data.
                                     Invalid if also set `scale` or `zero_point`.
```

- All of the options are optional.
- Some combinations are invalid.
- Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
are also specified.

Example for per-tensor quantization overrides:
```Python3
extra_options = {
    "TensorQuantOverrides": {
        "SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
        "WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
        "BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
    },
}
```

Example for per-channel quantization overrides (Conv weight and bias):
```Python3
extra_options = {
    "TensorQuantOverrides": {
        "WGT": [
            {
                "quant_type": quantization.QuantType.QUInt8,
                "rmin": 0.0,
                "rmax": 2.5,
                "reduce_range": True,
            },
            {
                "quant_type": quantization.QuantType.QUInt8,
                "rmin": 0.2,
                "rmax": 2.55,
                "reduce_range": False,
            },
        ],
        "BIAS": [
            {"zero_point": 0, "scale": 0.000621},
            {"zero_point": 0, "scale": 0.23},
        ],
    },
}
```

#### 2. Adds utilities to get the default QDQ configs for QNN EP
Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
that inspects the model and returns suitable quantization
configurations.

Example usage:
```python3
from quantization import quantize, QuantType
from quantization.execution_providers.qnn import get_qnn_qdq_config

qnn_config = get_qnn_qdq_config(input_model_path,
                                data_reader,
                                activation_type=QuantType.QUInt16,
                                weight_type=QuantType.QUInt8)
                                
quantize(input_model_path,
         output_model_path,
         qnn_config)
```

### Motivation and Context
Make it possible to create more QDQ models that run on QNN EP.

---------

Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
2023-12-04 17:54:58 -08:00
snadampal
05a9c95764
[DNNL] add Arm Compute Library (ACL) backend for dnnl execution provider (#15847)
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.

### Description
<!-- Describe your changes. -->
Add ACL as the DNNL runtime option for aarch64 platforms. Update
makefile and the python wheel build script.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to enable the optimized ACL gemm kernels for dnnl execution
provider on aarch64 platform.
2023-12-01 09:16:44 -08:00
George Wu
5c67a00d8e
Revert "remove full protobuf requirement for tensorrt ep" (#18626)
Reverts microsoft/onnxruntime#18413

there's a timing issue here. we eventually want to get this change
merged in but we need to update OSS onnx-tensorrt first.
2023-11-29 22:27:51 -08:00
Edward Chen
14a343441d
Fix Objective-C static analysis build (#18606)
- Patch abseil to fix a compile error about not finding `cxxabi.h`.
- Fix some static analysis warnings.
2023-11-28 17:14:20 -08:00
Rachel Guo
288b80d363
Add MacOS build to ORT C Pod (#18550)
### Description
<!-- Describe your changes. -->

As title.

1. Add macos build as an optionally enabled arch for pod and changes to
exsiting build_ios_framework/assemble_c_pod scripts.
2. Enable macos build arch in ios packaging pipeline (currently for
variants other than Mobile) and check the output artifacts are correct.
3. Write MacOS Test Target scheme in the test app and integrate into ios
packaging CI testing pipeline.
Currently the changes only apply to onnxruntime-c pod. as the original
request was from ORT SPM which consumes the onnxruntime-c pod only as
the binary target. TODO: could look into adding macos platform to objc
pod as well.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable macos platform support in cocoapods. and also potentially produce
binary target for enabling macos platform in SPM as well.

Replace https://github.com/microsoft/onnxruntime/pull/18334

---------

Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-11-28 10:11:53 -08:00
Chen Fu
05046e5452
Adding unit test for sm80 prepack (#18514)
### Description
Prepacking code for block q4 x fp16 GEMM cuda kernel, for SM80 hardware


### Motivation and Context
Preparing for addition of Q4 x FP16 GEMM kernel on Nvidia Ampere GPUs.
This kernel requires sophisticated quantized weight rearrangement to
speedup loading data to tensor-core. To facilitate the addition, this
change includes the following:

1. matrix_layout.h A new layout lib that facilitate iterating matrix
elements and tiles that balance memory safety and performance.
2. prepack_sm80.h Code for rearranging quantized weight, scales and
offsets (aka. prepacking)
3. blkq4_fp16_sm80_prepack_test.cc Unit tests that explicitly test the
memory safety and correctness of the prepacking code.

Currently the prepacking code runs on CPU with single threaded code. We
run this on CPU in order to minimize GPU memory fragmentation. On the
other hand, hopefully we get around to parallelize this part of the
code. Should be straight forward with the unit tests in place.
2023-11-28 10:01:09 -08:00
Sheil Kumar
0b7048e7d6
Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384)
Update winml to use #cores - #soc cores by Default as the number of
intraopthreads

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-11-28 09:26:48 -08:00
cloudhan
6f3c1f9dc9
[ROCm] Update ck for GemmFloat8 (#18487) 2023-11-23 12:06:19 +08:00
pengwa
43a5147e01
Memory optimization refactor and refinement (#17481)
### Memory optimization refactor and refinement

Currently memory optimizer runs graph transformations and print
recompute opportunities in INFO level, while ORT backend has many many
INFO level logs making users hard to find those information. So we are
looking for a Python binding API to retrieve the memory optimization
opportunities instead of depending on the MemoryOptimizer's default
logging.
Then we can print ORTModule feature statistics using this information. 
Also, with such an API, we can create an ORT session created, where
allocation plan is done, the analysis will consider buffer reuse as
well. This can void giving some recomputation subgraphs that are reusing
other subgraphs' output buffers.

Check
https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md
for the new flow using `MemoryOptimizer`.

This pull requests made following refactoring:
1. Print the log in ORTModule Python script, along with ORTModule
feature enabling stats. This is implemented by exposing an API
`get_serialized_ortmodule_memory_stat` to retrieve the memory
optimization opportunities.
2. We are analyzing memory optimization opportunities considering ORT
memory planning. This is done by firstly creating the execution graph
without enabling MemoryOptimizer, then we call
`execution_agent.get_serialized_ortmodule_memory_stat` which internally
will consider the session memory allocation planner when analyzing
memory optimization opportunity. As a direct result, the memory
optimization opportunities can show those stashed activations that are
reusing other buffers.
3. Move recompute analysis logic from memory_optimizer.h/cc to
recompute_analysis.h/cc.
4. Abstract optimization strategies for their own implementation. This
will make introducing new strategies (for example compression and
decompression ) easier.

New logging matrix (INFO Level), in WARNING level, the details will NOT
show.
```
2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] -
***** ONNX Runtime Training (ORTModule) is accelerating your model *****

ORTModule is enabled with following features ON/OFF for [training] mode:

  ATen Executor         :   ON    :   Dispatch ATen operators to ORT's ATen executor
  Cast Propagation      :   ON    :   Level 1 enabled
  Custom Function       :   ON    :   Support custom torch.autograd.Function export and execution
  Memory Optimizer      :   ON    :   RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs:
                                      Config                                                      Freq    Saving(B)       Saving Symbolic(Bytes)
   - Plan 1             :   ON    :   Reshape+Where+BiasSoftmax+:1:-1                             5       671,088,640     640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 2             :   ON    :   Cast+:1:-1                                                  6       402,587,648     inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0)
   - Plan 3             :   OFF   :   Reshape+Where+:1:-1                                         1       134,217,728     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 4             :   OFF   :   BiasSoftmax+:1:-1                                           1       134,086,656     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1)
   - Plan 5             :   OFF   :   BiasGelu+:1:-1                                              6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 6             :   OFF   :   FusedMatMul+:1:-1                                           6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 7             :   OFF   :   FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1               5       26,214,400      25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 8             :   OFF   :   Add+:1:-1                                                   1       5,237,760       5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1)
   - Plan 9             :   OFF   :   Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1         1       4,096           4.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 10            :   OFF   :   Cast+:2:-1                                                  1       2,048           2.0*inputs_input_ids_dim0*inputs_input_ids_dim1
  Compute Optimizer     :   ON    :   Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0
   - FLOPReduction      :   ON    :   Reduce FLOPs by upstreaming shrinking-sized ops
  Auto Fallback         :   ON    :   Fallback to PyTorch when encountering unsupported ops
  TritonOp Enabled      :   OFF   :   ORT will switch to Triton for executing some ops to further accelerate training.
  ZeRO Stage3 Support   :   OFF   :   Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0

Total ORT initialization overhead is 10.73s where export takes 8.39s.
Other overhead details:  graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s

Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0

Note 1: use comma to enable multiple plans at the same time.
  export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,...
Note 2: saving is calculated based on the 1st batch symbolic dim values:
  inputs_input_ids_dim0=1,
  inputs_input_ids_dim1=1024,
  inputs_attention_mask_dim0=1,
  inputs_attention_mask_dim1=1024,
  inputs_labels_dim0=1,
  inputs_labels_dim1=1024,

************************************************************************
```

If DEVINFO level is enabled, then more details about the memory
optimizations are printed.
```

MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1
==========================================================================================================================================
|Freq   | Memory Optimization Opportunities (Clustered by node-level activation patterns)                                                |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|3      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+Reshape+                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(3),                                                                                                  |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+                                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1                                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(2),                                                                                                  |
|       |   - Output 0  : [ x 2560 x ], byte/elem: 2, 100% saved                                                                         |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+BiasSoftmax+                                                                  |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1                       |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=2                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1         |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+                                                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1                                   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved                           |
|       |                                                                                                                                |
|       |>>Option 2     : RecomputeWithCompromise subgraph Cast+                                                                         |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasSoftmax+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Add+                                                                                        |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1                                             |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
==========================================================================================================================================
Note: use comma as a separator for enabling more than one subgraphs.

************************************************************************

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-23 11:39:00 +08:00
Dmitri Smirnov
cc542024ce
Create edges with arg positons correctly accounting for non-existing args (#18462)
### Description
Truncate traling non-existing arguments.
  Make sure we do not skip on the non-existing arguments in the middle,
  because shape inferece relies on their proper position.
This also affects the argument position in the Edges that must be
properly rebuilt
  each time If node branch is inlined.
Make sure that when we rename Defs in subgraphs, new renamed defs are
created in those subgraphs
  instead of pointing to outer scope defs.
  Add unit test.

### Motivation and Context
This is a follow up for
https://github.com/microsoft/onnxruntime/pull/18105
Currently, the non-trailing arguments are simply ignored and the edges
are created
with potentially incorrect positions.
2023-11-20 14:49:09 -08:00
Akshay Sonawane
97cc40d75a
Add fusion patterns for conformer-transducer model (#18461)
### Description
Add conformer-transducer model type to optimizer. This PR adds pattern
matches for attention shown below:
Unfused attention:

![ct_unfused](https://github.com/microsoft/onnxruntime/assets/111780983/46c71ed8-67e0-4607-85b1-bcadba5a2956)

Fused attention:

![ct_fused](https://github.com/microsoft/onnxruntime/assets/111780983/fbb91c96-0d4b-4f0b-8674-1ae3b9b9a92e)
2023-11-18 23:39:04 -08:00
Ashwini Khade
02333293de
Removed all the deprecated python training code and related tests and utils (#18333)
### Description
Motivation for this PR is code cleanup.

1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
2023-11-17 18:19:21 -08:00
George Wu
d73073d491
remove full protobuf requirement for tensorrt ep (#18413)
tensorrt can work with protobuf lite.
2023-11-16 20:44:27 -08:00
Yulong Wang
6f9f653ada
[wasm] increase test max memory from 2G to 4G (#18459)
### Description
increase max memory from 2G to 4G for onnxruntime_test_all in
WebAssembly build.
2023-11-15 17:51:04 -08:00
Edward Chen
0a4d76d98b
MLAS AArch64 quantized int4 Gemm kernel (#18031)
- Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs.

- Connect MatMulNBits contrib op to MLAS function.
2023-11-15 09:31:54 -08:00
Ye Wang
f9af94009b
onboard MoE (#18279)
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR

limitation: __CUDA_ARCH__ >= 700


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-14 16:48:51 -08:00
PeixuanZuo
a62a500ae1
[ROCm] Update CK version (#17628)
update ck version
2023-11-13 15:43:38 -08:00
Scott McKay
8d298f6f78
Fix xnnpack compile error on arm32 (#18291)
### Description
<!-- Describe your changes. -->
Use different march flag to workaround what appears to be a clang issue.

See https://github.com/tensorflow/tensorflow/issues/59970 for links to
various relevant pieces of info/discussions.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-12 08:59:20 +10:00
Scott McKay
64c91d790b
Fix ability to use patch on Windows CI machines (#18356)
### Description
<!-- Describe your changes. -->
Add 32-bit patch binary and infra to fallback to it. The Azure devops
Windows CIs are missing patch.exe from their git install for some reason
so the default `find_package(Patch)` fails as that is where it expects
to find it.

Remove Eigen patch. Underlying issue was fixed in source 3 years ago by
c6c84ed961
and the patch command is invalid (args are for git apply not patch).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make usage of patch consistent across all CIs
Fix https://github.com/microsoft/onnxruntime/issues/15248
2023-11-11 07:32:14 +10:00
Bart Verhagen
87744e55fa
fix reference to Microsoft.GSL::GSL in CMake build scripts when enabling cuda (#17843)
### Description
Some CMake scripts reference Microsoft.GSL::GSL. Most of the time, the
GSL package that is found on the system is used. However, when cuda is
enabled, it is downloaded and patched. Most CMake scripts rely on the
first case and forget about the second. This patch makes the second case
behave like the first case.

### Motivation and Context

This is an issue that occurs 'in the wild'. For example, I had to patch
this to be able to enable the CUDA provider for the onnxruntime conan
package (see https://github.com/conan-io/conan-center-index/pull/20392).
2023-11-10 10:46:45 -08:00
Changming Sun
812532592e
Add a build validation for Linux ARM64 cross-compile (#18200)
### Description
1. Add a build validation for Linux ARM64/ARM32 cross-compile to catch
issues listed in #18195 .
2. Revert eigen's commit id back to what we had before. 


### Motivation and Context
To catch cross-compile issues.
Added a TODO item for fixing the compile warnings in Linux ARM32 build: AB#21639
2023-11-08 13:03:18 -08:00
Dmitri Smirnov
a37e6a503b
Update Abseil raw_flat_hash visualization (#18329)
### Description
<!-- Describe your changes. -->
Fix the broken pieces due to the latest   Abseil update.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Make the debugging bearable.
2023-11-08 11:19:45 -08:00
Wei-Sheng Chin
fb6737e893
Distributed Squeeze and Distributed Unsqueeze (#18269)
Implementat DistributedSqueeze & DistributedUnsqueeze for llama 2.
2023-11-06 20:11:35 -08:00
Yi Zhang
b7b8b5b2ce
Fix Eigen-3.4.0 URL and hash (#18290)
### Description
Add CI changes for #18287

Install onnx explicitly to pass windows GPU+dml stage.


### Motivation and Context
'eigen-3.4' was refering to a branch, not to a tag. There is now an
Eigen 3.4.1 on that branch, and thus the hash has changed.
See
https://github.com/microsoft/onnxruntime/issues/18286#issuecomment-1793683416
2023-11-06 09:19:51 -08:00
Chi Lo
dfafcb58aa
[TensorRT EP] Properly set CUDA_INCLUDE_DIR for onnx-tensorrt (#18274)
https://github.com/microsoft/onnxruntime/pull/17468
The above PR didn't fully fix the issue for some environments.
This PR fixes this.
2023-11-03 20:04:10 -07:00
Scott McKay
4f2096be38
Update XNNPACK to latest version (#18038)
### Description
<!-- Describe your changes. -->
Update XNNPACK to latest version
- adds fp16 kernels and various other improvements
- requires pthreadpool update as well

Most code updates in the XNNPACK EP are to adjust to the new XNNPACK API
- 'setup' is split into 'reshape' and 'setup'
-  some ops use a workspace buffer
   -  copied workspace allocation from XNNPACK unit test code
- some suffixes changed 

Added wrapper for XNNPACK caches to base XNNPACK EP kernel
- simplifies usage
- XNNPACK split out the code and weights caches, but the code cache
isn't currently usable via the public API
- we could use the internal types if we think it's required for
performance reasons. non-trivial though as we'd need to propagate ifdef
values from the XNNPACK build up to the ORT build.
- using XNNPACK internals would also mean we would not be able to
support using a pre-build XNNPACK package
    - not an issue currently
  
Fixed opset registration for internal NHWC domain
- was not being tied to the ONNX version, so nodes inserted by layout
transformation had the incorrect opset
- a number of other places needed updating once this issue was fixed

Remove support for NCHW Resize from XNNPACK EP so it's NHWC only
- we only supported NCHW for fp32,
- doing so adds complexity in multiple places (XNNPACK EP kernel
implementation, layout transformation and transpose optimization)
- unclear if that complexity provides any benefit. can add back if
required by production scenario

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We're looking at enabling fp16 support for CoreML and NNAPI. If we do
that we need a good fallback story if the CPU EP will be used. The
XNNPACK fp16 kernels will hopefully provide that.

NOTE: This PR doesn't add fp16 support to the XNNPACK EP kernels. That
can be done as required in separate EPs and should be relatively simple
to do.
2023-11-03 09:04:28 -07:00
Scott McKay
016b75260b
Pre-link when creating static library for apple framework (#18241)
### Description
<!-- Describe your changes. -->
Pre-link with `ld -r` to apply symbol visibility when the static library
is created to replicate XCode's Single Object Pre-link.

Current builds set the visibility flags but that doesn't get applied
until the static library is linked into something else, which can be too
late. Pre-linking fixes this.

The pre-link uses the .o files from the ORT static libraries and the .a
files from external libraries. This combination limits the symbols
included from the .a files to things required by the ORT .o files.

In order to minimize changes elsewhere in the build we extract the .o
files from the ORT static libraries using `ar -x`.

Re-ordered the pieces use to build the Apple framework to make it a
little more readable.
Fixed a couple of misc issues with missing symbols from the minimal
build that show up when pre-linking is applied.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Will hopefully address #17722
2023-11-03 23:38:29 +10:00
aciddelgado
178f7caaeb
GQA Memory Efficient Kernel (#17920)
Implement Cutlass Memory Efficient Attention Kernel into Group Query
Attention Operator.

### Motivation and Context
Before this change, Group Query Attention Operator was supported only by
Flash-Attention. While this is the most efficient kernel for the
operation, it only supports sm >= 80. Cutlass Memory Efficient Attention
Kernel supports sm >= 53, allowing us to support a broader range of GPU
hardware.
2023-11-01 20:04:22 -07:00
Wei-Sheng Chin
9e8ad39847
Distributed Reduction (#18206)
This PR implements distributed reduciton for llama 2. This version
doesn't consider any cases requring re-sharding because we haven't seen
any use cases.

Intutive examples:
- [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] ->
Reduce(axes=[0]) -> [1,4,6]-tensor with spec=RRS[0] and
device_mesh=[0,1]
- [supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1] ->
Reduce(axes=[1]) -> [2,1,6]-tensor with spec=RRS[0] and
device_mesh=[0,1]
- [not supported] [2,4,6]-tensor with spec=RRS[0] and device_mesh=[0,1]
-> Reduce(axes=[2]) -> [2,4,1]-tensor with spec=RRS[0] and
device_mesh=[0,1]

Algorithm:
When the reduced axes are not sharded, each device can call reduction
directly. The output sharding spec will be identical to input sharding
spec. We currently throw when input and output sharding specs are
different.

Review guideline:
- Check 97b8d2f for new op's schema and how new op is registered.
- Read tests in 2450f93 to get faimilar with the behavior of these ops.
- Check the implementation details in 753d9af.
2023-11-01 08:49:33 -07:00
Preetha Veeramalai
d87216bcb1
Openvino ep ort 23.1 (#17911)
### Description
Integration to OpenVINO 2023.1


### Motivation and Context

- Alignment with latest OpenVINO Version. 
- Device name change from VPUX to NPU and Remove from supported list
until official public support is available.

---------

Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
2023-11-01 08:39:39 -07:00
liqun Fu
20f2dd8b6b
use onnx rel-1.15.0, update cgman, cmake/external and requirement hash (#18177) 2023-10-31 14:58:21 -07:00
Wei-Sheng Chin
24f9c1afe3
Distributed Expand (#18126)
This PR implements DistributedExpand for llama 2.

Representative Examples of DistributedExpand:
- [shard on non-expanded axis] `input tensor (shape=[8, 1], spec=S[0]R,
device_mesh=[0,1]) -> Expand(target_shape=[8, 2] -> output tensor
(shape=[8, 2], spec=S[0]R, device_mesh=[0,1])`
- [sharding expanded axis is invalid since it must have dim=1 and axis
with dim=1 cannot be sharded] `input tensor (shape=[1, 8], spec=S[0]R,
device_mesh=[0,1]) -> Expand(target_shape=[2, 8] -> output tensor
(shape=[2, 8], spec=S[0]R, device_mesh=[0,1])`

From those examples, we observe a few important behaviors.

- The output sharding spec is always the same to the input sharding
spec.
- Expanding always happen on axis with dimension=1. Otherwise, it will
violate the broadcasting rule.
- No communication is needed since all computation can happen locally.
Let's consider the first example again. If you put the first half tensor
(shape: [4, 1]) on device 0 and the second half (shape: [4, 1]) on
device 1, then `Expand` it with target shape [4, 2] , these two local
tensors (shape: [4, 2]) are exactly the same as the one described by
output sharding spec.

Algorithm:
- Compute logical (i.e., unsharded) shapes of input and output.
- Compute sharded output shape from logical output.
- Call Expand to broadcast local input to sharded output shape.

How to review?
- Start with [changes in
onnxruntime_test_distributed.py](ea33392f37).
Those tests are good examples for using this op.
- [Read
expand.h/expand.cc](e4c49987f5).
Theose changes are for exposing functionalities in Expand to
DistributedExpand.
- Read distributed_expand.h/distributed_expand.cc. It follows the
algorithm described above. The commit
68ac301bba
first sketches the definition of DistributedExpand. The next commit
0eb9330c3b
adds real implementation.
2023-10-28 00:44:02 -07:00
Xavier Dupré
b5f242e978
GemmFloat8 as a contrib ops (#16051)
### Description
Add support for Gemm with float 8 as a contrib op.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-10-27 14:33:55 +02:00
Wei-Sheng Chin
9c32310673
Distributed Reshape Implementation (#18068)
This DistributedReshape aims at supporting all sharding patterns
encountered in llama 2. All patterns found are tested in
`TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR
implements algorithms to compute the categories below.
- All inputs and outputs are replica, so it's computed like a normal
Reshape.
- Two-axis fusion (if any of the inputs and outputs are sharded). This
category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`.
- Two-axis decomposition (if any of the inputs and outputs are sharded).
This category convers, e.g., `[batch x seq, hidden] -> [batch, seq,
hidden]`.

Review guideline:
- Ignore the changes in sharding_spec.h and sharding_spec.cc since they
come from another PR #18025.
- First, read onnxruntime_test_distributed.py to get familiar with the
input/output of DistributedReshape.
- Second, check the new APIs in reshape.h/reshape.cc to expose CUDA
Reshape kernel to DistributedReshape.
- For DistributedReshape, check its `ComputeInternal` for the 3
categories mentioned above.
2023-10-26 22:33:42 -07:00
Vincent Wang
b7408f7389
[ORTModule] ATen Efficient Attention and Triton Flash Attention (#17959)
This PR is to support efficient attention and flash attention in
ORTModule, including:
- Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev
or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable.
- Integrate Triton Flash attention, which requires
triton==2.0.0.dev20221202. Need A100 or H100.
ORTMODULE_USE_FLASH_ATTENTION=1 to enable.
- A python transformer tool to match sub-graph by config and write
transformer quickly.

Current transformers supports attention mask for both efficient attn and
flash attn, and dropout for efficient attn only. To support more
training scenarios (such as causal mask in GPT2), more transformers need
to be added.

The feature is guarded by system environment variables, it won't effect
any current behavior if not enabled. Since it requires specific
PyTorch/Triton versions, related tests is not added for now.
2023-10-27 10:29:27 +08:00
Chi Lo
455a9ce614
[TensorRT EP] Use latest onnx-tensorrt parser (#18067)
Use latest onnx-tensorrt to fix compile error.

Please see the issue
https://github.com/microsoft/onnxruntime/issues/18029
2023-10-26 13:55:12 -07:00
Jambay Kinley
d30d4d372a
Add MatMul FP4 and NF4 Support (#18066)
### Description
Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to
support quantization on weight.

This PR adds:
- schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating
point) and NF4 (4-bit NormalFloat) quantization on weight.
- a naive implementation for MatMulBnb4 on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for MatMulBnb4 and related benchmark
tool.
- tool to quantize model to FP4 or NF4.
2023-10-25 15:34:58 -07:00
snadampal
d88d52eead
[aarch64] Remove mmla kernel support from apple (#18082)
### Description
<!-- Describe your changes. -->
The mmla kernels require additional ISA flags
and are currently supported only on Linux


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
more context is in https://github.com/microsoft/onnxruntime/pull/15270

cc: @skottmckay , @chenfucn , @snnn
2023-10-25 11:34:57 -07:00
snadampal
780ee186d7
[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160)
### Description
<!-- Describe your changes. -->
This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This
covers
(i) symmetric quantization (zero point is Zero)
(ii) asymmetric quantization (zero point is non zero)
(iii) per channel as well as per tensor quantization
(iv) Signed weights (U8S8 Gemm)
(v) Unsigned weights (U8U8 Gemm) and 
(vi) Signed activations and weights (S8S8 Gemm) scenarios

I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM`
support
MMLA QGEMM kernels are enabled for all the devices that support I8MM
instructions.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to improve INT8 quantized MatMul performance on aarch64
platform.
I have run the below benchmarking script (bert , roberta and gpt2 model
inference) on AWS Graviton3 based c7g.4xl instance and observed up to
1.33x performance improvement compared to the optimized UDOT qgemm
kernel performance.

```
cd onnxruntime/python/tools/transformers
python3 benchmark.py
```
I have also run the unit tests, and made sure all are passing

```
./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync 

```
2023-10-24 07:49:04 +10:00
liqun Fu
020824ed50
Update ONNX to 1.15.0rc1 (#17914) 2023-10-20 15:08:25 -07:00
Hariharan Seshadri
9356986730
Fix AMD builds and enable testing NHWC CUDA ops in one GPU CI (#17972)
### Description
This PR:

(1) Fixes AMD builds after #17200 broke them (Need to remember to run
AMD builds while trying to merge external CUDA PRs next time)

(2) Turn on the NHWC CUDA feature in the Linux GPU CI. The extra time
spent in building a few more files and running a few more tests will not
be much.

Test Linux GPU CI run :
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1170770

### Motivation and Context
Keep the NHWC CUDA ops tested
(https://github.com/microsoft/onnxruntime/pull/17200) and guard against
regressions
2023-10-17 09:23:52 -07:00
Maximilian Müller
7c17e33c07
Make CUDA a NHWC EP (#17200)
### Description

CUDA inference speed heavily relies on Tensor Cores. To have tensor
cores achieve the optimal throughput they require the data layout to be
NHWC rather than NCHW.

### Motivation and Context


Especially for convolutional networks this is very important. I will
illustrate this using a very simple network:
```
import torch
import torch.nn as nn

class Net1(nn.Module):

    def __init__(self):
        super(Net1, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.m = nn.ModuleList([
            nn.Conv2d(in_channels=8, out_channels=32, kernel_size=5, stride=1),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
            nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, bias=False),
        ])
    def forward(self, x):
        for module in self.m:
            x = module(x)
        return x


if __name__ == "__main__":
    dtype = torch.half
    device = "cuda"

    dummy_input = torch.randn(8, 8, 512, 512, dtype=dtype, device=device)
    model = Net1().to(dtype=dtype, device=device)
    input_names = ["input1"]
    output_names = ["output1"]
    torch.onnx.export(model, dummy_input, "test.onnx",
                      input_names=input_names, output_names=output_names)
```

I profiled the launch of `./build/RelWithDebInfo/onnxruntime_perf_test
-e cuda -I -q -t 5 test.onnx` using sys and nvtx ranges.
Current master launches below kernels: 

![image](https://github.com/microsoft/onnxruntime/assets/44298237/81655fce-0f8e-4f78-9335-b858a8c8977b)

If I add the introduced `-l` flag we see below kernels:

![image](https://github.com/microsoft/onnxruntime/assets/44298237/fceb5d6f-c12d-442b-b15a-948797630008)

Notice the missing NCHW<>NHWC kernels per operation. The layout
optimizer introduced a transpose op as first and last op of the whole
network. The `op_generic_tensor_kernel` shows the bias used which should
also be optimized out next.

Measured across some very basic models:
| CUDA EP | **NCHW** [ms] | **NHWC** [ms] | Speedup |

|:------------------------|--------------------------------------:|-----------------------------------------:|------------------:|
|                         |  -e cuda -t 5 -q |   -e cuda -t 5 -q -l | |
| resnet101-v2-7_bs8_fp16 | 18.33 | 13.07 | 1.4 |
| resnet101-v2-7_bs8 | 21.8 | 12.06 | 1.81 |
| test | 102.07 | 73.62 | 1.39 |
Average speedup: 1.53

## Outlook

Next the mission will be to first write a templated unit test to check
for correctness of NHWC vs NCHW ops. After that we have to transition
more ops to measure perf improvements on a broader range of models.
Currently this is not easily possible as we can do not support all ops
in the NHWC domain.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2023-10-16 10:16:37 -07:00
Chi Lo
8abaa7b753
[TensorRT EP] Fix cmake install (#17923)
We removed tensorrt_provider_factory.h in the
[PR](https://github.com/microsoft/onnxruntime/pull/17617).
Need to remove the copy of this file when cmake install.
2023-10-16 09:16:24 -07:00
Yufeng Li
11af34440a
Add MatMul 4bits support on GPU (#17890)
### Description
<!-- Describe your changes. -->
Add a contrib op MatMulNBits and related toolchain to support
quantization on weight. This PR only adds support for 4bits. It:

- add schema for contrib op MatMulNBits which can support 1-7 bits
quantization on weight.
- a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for 4bits MatMulNBits and related
benchmark tool
- tool to quantization model with 4bits. 

Next:
- add general and more efficient kernels for 4bits MatMulNBits on CPU
and GPU
2023-10-13 16:55:30 -07:00
Jeff Daily
07317316cc
CUDA EP vs ROCM EP hipify audit (#17776)
Migrate most CUDA EP improvements and changes to ROCM EP. The process
involves using hipify against all CUDA EP files (i.e. do not exclude any
files from onnxruntime_rocm_hipify.cmake) then vimdiff compare them
against the ROCM EP files that are under source control and pull in most
changes. These changes include functional as well as formatting and
makes comparing CUDA EP and ROCM EP easier, though it makes the PR diff
somewhat less obvious due to formatting changes.

- hipify audit of onnxruntime/core/providers/rocm, enable ops
  - Loop
  - Scan
- hipify audit of onnxruntime/contrib_ops/rocm
- fix contrib ops search implementation
- enable more contrib ops
  - Affine
  - ComplexMul
  - ConvTransposeWithDynamicPads
  - Crop
  - DynamicSlice
  - FFT [Rfft, Irfft]
  - GreedySearch
  - ImageScaler
  - ParametricSoftplus
  - ScaledTanh
  - ThresholdRelu

---------

Co-authored-by: cloudhan <cloudhan@outlook.com>
2023-10-13 10:13:53 +08:00
Tang, Cheng
ca8cab29cd
distributed slice (#17761)
### Description
Support DistributedSlice kernel in Cuda EP.

mainly support following cases:
1. input data is sharded or replica for all axes (including slice axes)
2. slice axes is sharded across different devices.

starts / ends / steps sharded across different devices are not supported
yet.

---------

Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
2023-10-12 14:28:00 -07:00
Maximilian Müller
74a8acf405
Set default value for NVCC threads (#17866)
Without doing this CMake gives a miscellaneous error on windows when
checking if NVCC is functional. It will be missing a number after
`--threads`.

Currently it is only possible to configure through the python build scripts and not CMake
only configure - which is what I am usually doing through CLion.
2023-10-11 22:46:40 -07:00
Numfor Tiapo
b8f373b0ae
Add API for NPU Device Selection in the DML EP (#17612)
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-10-11 14:53:00 -07:00
pengwa
0e2782438a
Support inplace update for PythonOp/Grad (#17687)
### Support inplace update for PythonOp/Grad

This PR is based on another PR
https://github.com/microsoft/onnxruntime/pull/17685's branch, to make it
easier to review.

With PR: PR https://github.com/microsoft/onnxruntime/pull/17685, By
default all PythonOp inputs/outputs are assumed to not be inplaced, if
during run, we found some inplace update happens (by checking output
data address with all inputs data address), we add clone before set it
as PythonOp/Grad's outputs. In this case, results are correct, but
implicit copies overheads are introduced.

This PR allow users to define output input reuse map, to let ORT know
how to do the reuse map, avoid such unnecessary copies.
2023-10-10 21:36:45 -07:00
Changming Sun
05ac9f6f2a
Split onnxruntime_providers.cmake to multiple (#17853)
### Description
Split onnxruntime_providers.cmake to multiple files, for easier editing.
No other change was made in this PR.
2023-10-09 20:33:44 -07:00
Baiju Meswani
9c716f4557
Add noexcep_operators to onnxruntime internal libraries (#17850) 2023-10-09 16:29:41 -07:00
cloudhan
c2bd5b70b2
Fix enable_training and use_migraphx (#17827) 2023-10-08 11:43:27 +08:00
MistEO
faf9a0f6c7
Fix runtime installation error (#17828) 2023-10-07 11:50:02 -07:00
JiCheng
3878011ce2
Remove MPI dependency (#17624)
### Description
<!-- Describe your changes. -->

Support launch multi-GPU without MPI


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-06 15:33:18 +08:00
George Wu
b306b02a86
[QNN EP] fixed input for InstanceNormU8 unit test and update copy lib paths (#17806)
-update InstanceNormU8 with fixed input. With this input, it fails
consistently using QNN 2.15.1
-update QNN lib paths (target is deprecated) and additionally copy V73
skel file
2023-10-05 22:17:15 -07:00
Justin Chu
be7541ef4a
[Linter] Bump ruff and remove pylint (#17797)
Bump ruff version and remove pylint from the linter list. Fix any new
error detected by ruff.

### Motivation and Context

Ruff covers many of the pylint rules. Since pylint is not enabled in
this repo and runs slow, we remove it from the linters
2023-10-05 21:07:33 -07:00
Wei-Sheng Chin
faef9c32fa
ONNX-Native Tensor Parallel: Using Distributed MatMul as Example (#17695)
This PR introduces
- New data structure to represent kernel-level (aka node-level or
op-level) tensor sharding informaiton. I consider it as the
fundamentaion of ONNX distribtued inference.
- Building blocks for distribtued kernels implementation especially
stateless implementation for communication ops.
- Implementation of DistributedMatMul and its tests.

Code structure:
- sharding.h/.cc: Function to shard and reshard tensors (calling into
NCCL).
- sharding_spec.h/.cc: Representation of how a tensor is sharded.
- distributed_matmul.h/.cc: Implementation of tensor parallel MatMul.
Inputs and outputs are sharded across devices.
- onnxruntime_test_distributed.py: distributed operator tests.

Example of specifying sharding information
```python
        @onnxscript.script()
        def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT:
            # Run MatMul by sharding x along column axis and w along row axis on
            # 2 GPUs.
            return MICROSOFT_OPSET.DistributedMatMul(
                tensor_x,
                tensor_w,
                device_mesh_shape=[2],
                device_mesh_elements=[0, 1],
                input_shard_specs=["RS[0]", "S[0]R"],
                output_shard_specs=["RR"],
            )
        onnx_model = matmul_rs_sr_rr.to_model_proto(
            input_types=[FLOAT[2, "s"], FLOAT["s", 2]],
            output_types=[FLOAT[2, 2]],
        )
```

In this example, the device mesh can be visualized as 1-D tensor, `[0,
1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the
0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is
sharded across `[0, 1]` as well.

C++ classes to represent tensor sharding (copied from sharding_spec.h):
```cpp
class DeviceMesh {
 public:
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // Device mesh is a tensor of device indices.
  // A tensor can then be partitioned along specific mesh axes.
  //
  // Assume we have 4 GPUs indexed by 0, 1, 2, and 3.
  // Let's consider some examples.
  //  1. 1D device mesh [0, 1, 2, 3]. In this case,
  //     device_mesh_shape is [4] and device_mesh_elements
  //     is [0, 1, 2, 3].
  //     If we want to shard a 2-D tensor along its axis 1, the
  //     corresponding sharding spec is a string "RS[0]".
  //  2. 2D device mesh [[0, 1], [2, 3]]. In this case,
  //     device_mesh_shape is [2, 2] and device_mesh_elements
  //     is [0, 1, 2, 3].
  //     If we want to shard a 2-D tensor's
  //     rows along mesh axis 1 and
  //     columns along mesh axis 0, the
  //     corresponding sharding spec is a string "S[1]S[0]".
  //     If that 2-D tensor's value is np.array([[5, 6], [7, 8]]),
  //     GPU 0/1/2/3 owns 5/7/6/8.  Below is a visualization the sharding
  //     proccess.
  //     - Start with a 2-D device mesh [[0, 1], [2, 3]] and
  //       a 2-D tensor [[5, 6], [7, 8]]
  //       - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]]
  //     - Split GPU mesh along axis 1 and tensor along
  //       axis 0 for "S[1]" in "S[1]S[0]"
  //       - GPU: [[0], [2]], Tensor: [[5, 6]]
  //         GPU: [[1], [3]], Tensor: [[7, 8]]
  //     - Split GPU mesh along axis 0 and tensor along
  //       axis 1 for "S[0]" in "S[1]S[0]"
  //       - GPU: [[0]], Tensor: [[5]]
  //       - GPU: [[2]], Tensor: [[6]]
  //       - GPU: [[1]], Tensor: [[7]]
  //       - GPU: [[3]], Tensor: [[8]]

  // Actual shape of device mesh represented by `device_mesh_elements`.
  std::vector<int64_t> device_mesh_shape;

  // Flattened device mesh.
  std::vector<int64_t> device_mesh_elements;
};

class AxisPartitionSpec {
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // This class is the in-memory representation of
  //  1. if a tensor is sharded or not (aka replica), and
  //  2. which tensor axis is shard by which device mesh axis.
  // Let's consider sharding 2-D tensor along column axis on
  // device mesh [0, 1] as an example.
  // The required sharding spec RS[0] can be represented by
  // - AxisPartitionSpec(Condition::Replica, -1)
  // - AxisPartitionSpec(Condition::Shard, 0)
 public:
  // Status of a tensor axis.
  // A tensor axis can be either sharded or replicated
  // along a device mesh axis.
  enum class Condition { Replica,
                         Shard };

  // This field tells if a tensor axis is sharded or not.
  Condition cond;

  // If a tensor axis is sharded, this field tells which device
  // mesh axis to distribute the shards along.
  // If a tensor axis is not sharded, this field is ignored.
  int device_mesh_axis;

  // A helper to construct a replica spec for a tensor axis.
  static AxisPartitionSpec CreateReplica() {
    return AxisPartitionSpec(Condition::Replica, -1);
  }

  // A helper to construct a sharding spec for a tensor axis.
  // This tensor axis is sharded along `device_mesh_axis` in device mesh.
  static AxisPartitionSpec CreateShard(int device_mesh_axis) {
    return AxisPartitionSpec(Condition::Shard, device_mesh_axis);
  }
};

class TensorPartitionSpec {
  // [Device Mesh and Tensor Sharding for Tensor Parallel]
  // TensorPartitionSpec holds a collection of AxisPartitionSpec and an
  // associated DeviceMesh. It is responsible for determining how a tensor
  // should be partitioned across a device mesh.
  //
  // Example 1: RS[0]
  // In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects.
  // - The first object is a Replica, denoting that the first axis of the tensor is
  //   not sharded but is instead replicated.
  // - The second object is a Shard along the 0-th axis of the device mesh. It denotes
  //   that the second axis of the tensor is sharded along the first axis of the
  //   device mesh.
  //
  // Example 2: S[0]RR
  // In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects.
  // - The first object is a Shard along the 0-th axis of the device mesh, indicating
  //   that the first axis of the tensor is sharded along the first axis of the
  //   device mesh.
  // - The second and third objects are Replicas, indicating that the second and third
  //   axes of the tensor are not sharded but are instead replicated.
 public:
  // axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor,
  //                axis_specs[0] is for row axis and axis_specs[1] is for
  //                column axis. axis_specs[i].device_mesh_axis = j means that
  //                tensor axis i is sharded along device mesh axis j.
  std::vector<AxisPartitionSpec> axis_specs;

  // device_mesh: DeviceMesh for sharding the associated tensor.
  // Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment.
  DeviceMesh device_mesh;
};
```
2023-10-05 14:22:25 -07:00
Edward Chen
1bc115719c
Unify handling of public headers in onnxruntime.cmake. (#17779)
The changes in PR #8919 overwrote the PUBLIC_HEADER property value of the `onnxruntime` target with a list that did not include EP-specific headers. We should probably be using a consistent set of header files across packages anyway.
2023-10-04 08:55:08 -07:00
Changming Sun
14d349e290
Enable backtrace in unit tests (#17655)
### Description
Google test can be built either with absl/re2 or not. This PR enables
the build option so that google test framework can print out a nice
stacktrace when something went wrong. It helps locate test errors in CI
build pipelines.

Also, Google test will remove the build option and make it always ON. So
sooner or later we must make this change.
2023-09-29 12:32:56 -07:00
MistEO
870b0bc305
Fix typo of cmake (#17715)
This caused a cmake configuration error.
2023-09-27 11:48:46 -07:00
Mustafa Ateş Uzun
13b0f8a6ce
fix: supported typo (#17216) 2023-09-27 10:45:27 -07:00
liqun Fu
2be4dc6d04
ONNX 1.15 integration (#17125)
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released


### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2023-09-26 14:44:48 -07:00
Jian Chen
0141e27ca1
Enabling c++ 20 in MacOS build (#16187)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-09-26 11:27:02 -07:00
Tianlei Wu
730fab3050
Refactor Attention cuda kernel (#17578)
* Break QkvToContext into small functions. Each fused and unfused kernel
will have separated function.
* Move DecoderAttention kernel to separated file
* Move KV cache related kernel to attention_kv_cache.cu

### Motivation and Context
To make the code easier to maintain.
2023-09-19 09:49:21 -07:00
Tianlei Wu
adb0be45d3
Refactoring of attention cuda kernel: move prepare qkv and concat_past_to_present (#17559)
To avoid a huge cu file and make code more readable:
 - Move PrepareQKV to separate cu file (attention_prepare_qkv.cu)
 - Move ConcatPastToPresent to attention_concat.cu
 - Add default value for AttentionData
- Add a data structure QkvData to track Q, K and V pointers and track
QKV format.
2023-09-15 10:57:29 -07:00
Changming Sun
5af6279440
Fix Android build (#17540)
### Description
The new cpuinfo library doesn't use clog on Android. Newer XNNPack
versions have removed the dependency on clog, but the one we use still
has it. So I cherry-pick the XNNPack to our patch file.
2023-09-14 07:36:01 -07:00
Changming Sun
24a3c740c0
Revert "[ROCm][MIGraphX] for googletest dep, set OVERRIDE_FIND_PACKAGE (#16715)" (#17523)
This reverts commit bb136f86c8, then
re-implement it in a different way.
I reverted the original change, then added a version constraint to the
find_package args.

If you still found it picks up wrong gtest version after this change,
you may disable `find_package` by setting
'FETCHCONTENT_TRY_FIND_PACKAGE_MODE' to NEVER. For example, the latest
gtest version is 1.14.0. If at a later time Google releases a new
version of gtest and that one is incompatible with the ONNX Runtime
source code you get today and your dev environment already installed the
new version and you do not want to create a new clean build environment
that is without the package, you can add `--cmake_extra_defines
FETCHCONTENT_TRY_FIND_PACKAGE_MODE=NEVER` to your build command to solve
the problem.
2023-09-12 22:39:31 -07:00
Chi Lo
b827ab0efc
[TRT EP] Fix build error for building oss onnx-tensorrt parser (#17468)
If building ORT TRT with `--use_tensorrt_oss_parse` (meaning ORT wil
include [oss onnx-tensorrt
parser](https://github.com/onnx/onnx-tensorrt/blob/main/CMakeLists.txt#L82)
and build it from source) ,the cmake CUDA_INCLUDE_DIR variable is
needed.

if not, you will encounter following [ build
error](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1133937&view=logs&j=7536d2cd-87d4-54fe-4891-bfbbf2741d83&t=39e3f98f-7fe5-578c-20bd-5ae5a4590bda):

CMake Error: The following variables are used in this project, but they
are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the
CMake files:
    /build/Release/_deps/onnx_tensorrt-src/CUDA_INCLUDE_DIR

Note: Not quite sure why in the past when CI still tested with oss
parser won't hit this issue. probably the CUDA_INCLUDE_DIR was defined
somewhere back then.
2023-09-08 20:34:57 -07:00
Caroline Zhu
dcc93909b4
Add training WASM generation to Web CI pipeline (#17319)
### Description
[Successful pipeline
run](https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1123141&view=results)

Added flag to build the training artifacts & updated the
pull-wasm-artifacts script to pull the training artifacts as well.

Bundled into this PR are minor formatting fixes + naming fixes.

### Motivation and Context
[This PR](https://github.com/microsoft/onnxruntime/pull/16521) extended
the WASM API wrapper to build training WASM artifacts as well.
The ORT training WASM artifacts are required to support ORT training web
bindings.
2023-09-08 15:49:47 -07:00
Changming Sun
bc84f52633
Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470)
### Description
Update C/C++ dependencies abseil, date, nsync, googletest, wil, mp11,
cpuinfo and safeint to newer versions per request of @
mayeut. He created the following PRs to update the deps:
https://github.com/microsoft/onnxruntime/pull/15432
https://github.com/microsoft/onnxruntime/pull/15434
https://github.com/microsoft/onnxruntime/pull/15435
https://github.com/microsoft/onnxruntime/pull/15436
https://github.com/microsoft/onnxruntime/pull/15437

However, our build system needs to fetch the dependencies from an
internal mirror that only Microsoft employees have write access to. So I
closed his PRs and created this one.

This PR also updates abseil to a newer version. This is to prepare for
upgrading re2.
2023-09-08 13:35:04 -07:00
Yulong Wang
110a2d0b73
[build][wasm] add js_internal_api.js to link dependency (#17407)
### Description
add js_internal_api.js to link dependency. Now changes to
js_internal_api.js will correctly trigger re-link of ort-wasm.wasm
2023-09-05 20:40:40 -07:00
Changming Sun
c6b0d185b4
Update cmake to 3.27 and upgrade Linux CUDA docker files from CentOS7 to UBI8 (#16856)
### Description
1. Update docker files and their build instructions.
ARM64 and x86_64 can use the same docker file.

2. Upgrade Linux CUDA pipeline's base docker image from CentOS7 to UBI8
AB#18990
2023-09-05 18:12:10 -07:00
Lennart Hannink
e3bb2a0cdd
Fix git working dir for ORT_BUILD_INFO (fixes #17197) (#17198)
### Description
Git commands producing `git-commid-id` and `git-branch` are always run
in `CMAKE_CURRENT_SOURCE_DIR` (i.e. `onnxruntime/cmake`)


### Motivation and Context
Please refer to corresponding issue
[#17197](https://github.com/microsoft/onnxruntime/issues/17197).
2023-09-05 09:20:49 -07:00
cloudhan
6ea3908db4
Add ck's streamk and splitk gemm impl (#17280) 2023-09-04 11:49:07 +08:00
aciddelgado
44101e8771
Flash Attention v2 MHA (#17227)
### Description
Integrate Flash Attention V2 to PackedMultiHeadAttention,
MultiHeadAttention and Attention operators.

Flash Attention v2 source code is from
https://github.com/Dao-AILab/flash-attention/tree/main/csrc/flash_attn/src.
We did some change to remove dependency on Torch, then removed backward
and bfloat16 related code.

Add benchmark script (see benchmark_mha.sh) to compare different
attention kernels for MultiHeadAttention operator.

Current limitations for Flash Attention in PackedMultiHeadAttention,
MultiHeadAttention and Attention operators:
* Relative Position Bias is not supported
* Different hidden size for Q and V is not supported
* Only float16 is supported
* Padding/attention mask is not supported
* For MultiHeadAttention, when there is past or present input, bias
shall be provided to activate flash attention
* For Attention, past or present inputs will deactivate flash attention
* Causal is not supported

Some limitations (like attention mask and causal) might be removed
later.

Currently, Flash Attention v2 only works in Linux. For Windows, we will
enable later with Cutlass 3.2.

Two environment variables can be used for testing purpose:
(1) `ORT_DISABLE_FLASH_ATTENTION` to disable flash attention. Default
value is 0 (enable). Set it to "1" to disable it.
(2) `ORT_MIN_SEQ_LEN_FLASH_ATTENTION_PACKED_QKV`. Default value is
"513", which means that we only enable flash attention when sequence
length is larger than 512 for packed QKV format. Set it to "0" if you
want to use flash attention v2 whenever possible.

### Speedup

The following result is from Standard_ND96amsr_A100_v4 VM
(A100-SXM4-80GB GPU) using benchmark_mha.sh. The metric is TFLOPs per
second for MultiHeadAttention operator.

There are 3 input formats:
* `Q,K,V` means separated inputs query, key and value of BxSxNH
* `Q,KV` means packed KV, where key is 5D: BxSxNx2xH
* `QKV` means packed QKV, where query is 5D: BxSxNx3xH

Note that flash attention cannot use packed QKV format, so extra
Transpose is needed. We found that TensorRT kernel is faster for
sequence length <= 512 for packed QKV. The reason might be no transpose
is needed for TensorRT kernel in this format.

We also notice that, TensorRT kernel is faster for stable diffusion
512x512 image (see seq_len=4096, heads=8, head_dim=40 below), while
flash attention v2 is faster for 1024x1024 image (see seq_len=16384,
heads=8, head_dim=40 below).

input format | batch size | sequence length | heads | head dim |
flash_v2 (TFLOPs/s) | TensorRT (TFLOPs/s) | Memory Efficient Attention
(TFLOPs/s)
-- | -- | -- | -- | -- | -- | -- | --
Q,K,V | 32 | 512 | 64 | 32 | 78.1 | 60.0 | 39.3
Q,K,V | 32 | 512 | 128 | 16 | 46.8 | 44.1 | 21.7
Q,K,V | 16 | 1024 | 64 | 32 | 99.0 | 72.8 | 44.3
Q,K,V | 16 | 1024 | 128 | 16 | 54.7 | 49.2 | 23.4
Q,K,V | 8 | 2048 | 64 | 32 | 113.8 | 81.2 | 47.8
Q,K,V | 8 | 2048 | 128 | 16 | 59.7 | 51.9 | 24.7
Q,K,V | 4 | 4096 | 64 | 32 | 122.5 | 85.6 | 49.7
Q,K,V | 4 | 4096 | 128 | 16 | 62.5 | 53.3 | 25.3
Q,K,V | 2 | 8192 | 64 | 32 | 127.4 | 87.5 | 50.7
Q,K,V | 2 | 8192 | 128 | 16 | 64.0 | 54.2 | 25.6
Q,K,V | 1 | 16384 | 64 | 32 | 129.5 | 91.0 | 51.2
Q,K,V | 1 | 16384 | 128 | 16 | 64.7 | 54.5 | 25.8
Q,K,V | 1 | 4096 | 8 | 40 | 51.0 | 43.6 | 36.8
Q,K,V | 1 | 4096 | 8 | 80 | 97.7 | 77.0 | 55.5
Q,K,V | 1 | 4096 | 8 | 160 | 120.0 | 39.7 | 57.8
Q,K,V | 4 | 4096 | 8 | 40 | 89.0 | 84.4 | 49.2
Q,K,V | 4 | 4096 | 8 | 80 | 133.0 | 92.2 | 63.2
Q,K,V | 4 | 4096 | 8 | 160 | 164.8 | 42.7 | 63.8
Q,K,V | 1 | 16384 | 8 | 40 | 96.9 | 91.3 | 52.1
Q,K,V | 1 | 16384 | 8 | 80 | 142.9 | 101.5 | 65.6
Q,K,V | 1 | 16384 | 8 | 160 | 177.4 | 44.2 | 65.7
Q,K,V | 128 | 128 | 12 | 64 | 29.0 | 26.9 | 25.7
Q,K,V | 64 | 128 | 12 | 64 | 23.1 | 10.8 | 21.3
Q,K,V | 128 | 384 | 12 | 64 | 83.5 | 60.8 | 55.7
Q,K,V | 64 | 384 | 12 | 64 | 72.6 | 40.5 | 52.8
Q,K,V | 128 | 512 | 12 | 64 | 98.9 | 77.9 | 62.1
Q,K,V | 64 | 512 | 12 | 64 | 94.7 | 75.6 | 60.4
Q,KV | 32 | 512 | 64 | 32 | 85.9 | 41.1 | 41.1
Q,KV | 32 | 512 | 128 | 16 | 47.1 | 21.6 | 21.6
Q,KV | 16 | 1024 | 64 | 32 | 104.4 | 45.8 | 45.8
Q,KV | 16 | 1024 | 128 | 16 | 54.7 | 23.6 | 23.6
Q,KV | 8 | 2048 | 64 | 32 | 116.8 | 48.5 | 48.5
Q,KV | 8 | 2048 | 128 | 16 | 59.8 | 24.7 | 24.7
Q,KV | 4 | 4096 | 64 | 32 | 124.2 | 50.1 | 50.1
Q,KV | 4 | 4096 | 128 | 16 | 62.6 | 25.3 | 25.3
Q,KV | 2 | 8192 | 64 | 32 | 128.5 | 50.8 | 50.9
Q,KV | 2 | 8192 | 128 | 16 | 64.1 | 25.6 | 25.6
Q,KV | 1 | 16384 | 64 | 32 | 129.4 | 51.2 | 51.2
Q,KV | 1 | 16384 | 128 | 16 | 64.8 | 25.8 | 25.8
Q,KV | 1 | 4096 | 8 | 40 | 67.5 | 37.7 | 37.5
Q,KV | 1 | 4096 | 8 | 80 | 101.3 | 56.7 | 56.6
Q,KV | 1 | 4096 | 8 | 160 | 124.0 | 58.6 | 58.6
Q,KV | 4 | 4096 | 8 | 40 | 90.8 | 49.8 | 49.8
Q,KV | 4 | 4096 | 8 | 80 | 135.6 | 63.8 | 63.8
Q,KV | 4 | 4096 | 8 | 160 | 166.3 | 64.5 | 64.5
Q,KV | 1 | 16384 | 8 | 40 | 97.5 | 52.3 | 52.3
Q,KV | 1 | 16384 | 8 | 80 | 143.5 | 65.9 | 65.8
Q,KV | 1 | 16384 | 8 | 160 | 178.4 | 65.9 | 65.8
Q,KV | 128 | 128 | 12 | 64 | 26.8 | 48.1 | 30.9
Q,KV | 64 | 128 | 12 | 64 | 28.0 | 38.9 | 25.0
Q,KV | 128 | 384 | 12 | 64 | 97.7 | 61.1 | 61.0
Q,KV | 64 | 384 | 12 | 64 | 89.5 | 57.8 | 57.9
Q,KV | 128 | 512 | 12 | 64 | 111.9 | 66.7 | 66.9
Q,KV | 64 | 512 | 12 | 64 | 107.2 | 64.9 | 64.8
QKV | 32 | 512 | 64 | 32 | 77.2 | 84.7 | 39.3
QKV | 32 | 512 | 128 | 16 | 43.4 | 53.1 | 20.9
QKV | 16 | 1024 | 64 | 32 | 98.8 | 87.4 | 44.6
QKV | 16 | 1024 | 128 | 16 | 52.0 | 54.1 | 23.2
QKV | 8 | 2048 | 64 | 32 | 113.1 | 89.0 | 47.9
QKV | 8 | 2048 | 128 | 16 | 58.2 | 54.6 | 24.5
QKV | 4 | 4096 | 64 | 32 | 120.6 | 89.7 | 49.7
QKV | 4 | 4096 | 128 | 16 | 61.7 | 54.6 | 25.2
QKV | 2 | 8192 | 64 | 32 | 125.9 | 89.5 | 50.7
QKV | 2 | 8192 | 128 | 16 | 63.6 | 54.8 | 25.5
QKV | 1 | 16384 | 64 | 32 | 128.5 | 92.0 | 51.2
QKV | 1 | 16384 | 128 | 16 | 64.6 | 54.8 | 25.7
QKV | 1 | 4096 | 8 | 40 | 60.2 | **69.8** | 38.1
QKV | 1 | 4096 | 8 | 80 | 101.6 | 75.2 | 56.7
QKV | 1 | 4096 | 8 | 160 | 130.2 | 41.2 | 58.4
QKV | 4 | 4096 | 8 | 40 | 90.6 | **91.0** | 49.5
QKV | 4 | 4096 | 8 | 80 | 133.6 | 98.1 | 62.8
QKV | 4 | 4096 | 8 | 160 | 165.3 | 43.7 | 63.9
QKV | 1 | 16384 | 8 | 40 | 97.2 | 92.8 | 52.1
QKV | 1 | 16384 | 8 | 80 | 143.0 | 103.1 | 65.6
QKV | 1 | 16384 | 8 | 160 | 177.6 | 44.5 | 65.7
QKV | 128 | 128 | 12 | 64 | 31.1 | 65.9 | 27.6
QKV | 64 | 128 | 12 | 64 | 26.1 | 49.8 | 23.5
QKV | 128 | 384 | 12 | 64 | 84.6 | 88.5 | 56.1
QKV | 64 | 384 | 12 | 64 | 79.1 | 80.3 | 53.5
QKV | 128 | 512 | 12 | 64 | 97.3 | 114.2 | 62.2
QKV | 64 | 512 | 12 | 64 | 95.9 | 110.7 | 60.6
QKV | 4 | 2048 | 32 | 128 | 125.26 | 44.72 | 78.15
QKV | 4 | 4096 | 32 | 128 | 141.62 | 46.29 | 85.84
QKV | 8 | 2048 | 32 | 128 | 127.40 | 45.49 | 78.75
QKV | 8 | 4096 | 32 | 128 | 144.24 | 46.60 | 86.95

### Known Issues

NVCC uses huge memory while compiling flash attention CUDA kernel. Linux
build with CUDA might fail when machine has limited memory while number
of CPUs is large. Walkaround is to use a build machine with larger
memory, or use argument like `--nvcc_threads 1` to limit nvcc threads in
build.

### Motivation and Context
Increases speed and efficiency of MHA or Packed MHA.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-08-31 13:52:21 -07:00
Wanming Lin
3a53836836
[WebNN EP] Fix compilation with newer flatbuffers (#17367) 2023-08-31 10:22:15 -07:00
Artem Shilkin
6e60dba726
Fix compilation with newer flatbuffers (#17164)
In flatbuffers@v23.5.9 was broken forward declaration for
FlatBufferBuilder. Trying to compile onnxruntime falls with the
following error:
```
flatbuffers/include/flatbuffers/flatbuffer_builder.h:1420:38: error: typedef redefinition with different types ('FlatBufferBuilderImpl<false>' vs 'flatbuffers::FlatBufferBuilder')
typedef FlatBufferBuilderImpl<false> FlatBufferBuilder;
                                     ^
onnx_runtime/include/onnxruntime/core/graph/graph.h:47:11: note: previous definition is here
    class FlatBufferBuilder;
```
This PR removes these declarations and puts includes instead
2023-08-29 10:28:26 -07:00
Caroline
228db24317
Add training API functions to WASM API (#16521)
### Description
* Created `wasm/training_api` source and header files & modified
WebAssembly CMake to include training flags
* The `wasm/training_api` files use an `OrtTrainingManager` handle which
is a struct of an OrtCheckpointState and an OrtTrainingSession, rather
than creating a CheckpointState handle & a separate TrainingSession
handle.
* This is so that the TypeScript side only has to manage one handle that
will be passed between TrainingSession & CheckpointState
representations, rather than the TypeScript side managing separate
CheckpointStateHandle and TrainingSessionHandle.


### Motivation and Context
WASM API needs to be updated with ORT training API function calls so
that ORT training web bindings can be added for on-device training.

---------

Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Ashwini Khade <askhade@microsoft.com>
2023-08-28 11:05:02 -07:00
Arthur Islamov
c262879214
Added DML and CUDA provider support in onnxruntime-node (#16050)
### Description
I've added changes to support CUDA and DML (only on Windows, on other
platforms it will throw an error)



### Motivation and Context
It fixes this feature request
https://github.com/microsoft/onnxruntime/issues/14127 which is tracked
here https://github.com/microsoft/onnxruntime/issues/14529

I was working on StableDiffusion implementation for node.js and it is
very slow on CPU, so GPU support is essential.

Here is a working demo with a patched and precompiled version
https://github.com/dakenf/stable-diffusion-nodejs

---------
2023-08-25 16:57:06 -07:00
Yulong Wang
79c4ed9a45
[js/webgpu] support error pop and kernel name (#17260)
### Description
This PR contains changes to support error pop and kernel name.

- Add a function `JsepGetNodeName` to allow reading kernel name from JS
to C++
- When in debug mode ( `env.debug = true;` ) or in profiling mode (
`env.webgpu.profilingMode = 'default';` ), kernel name will be read from
ORT; otherwise use the kernel pointer ( a number ) as kernel name to
save calls from JS to C++.
- When in debug mode, WebGPU validation errors will be recorded and if
any error occurs, `inferenceSession.run()` will fail (Promise get
rejected). Behavior when not in debug mode is not changed. This is
because recording errors are not zero-overhead, and GPU validation
errors should occur consistently in and not in debug mode.
- Add `jsepOnRunStart()` and `jsepOnRunEnd()` hook to:
   - allow implementation of the features mentioned above.
   - pass session ID to backend.
2023-08-25 08:08:15 -07:00
mindest
735cc8e6c8
[ROCm] enable If op for ROCm EP. (#17279)
### Description
Enable If op for ROCm EP.
2023-08-25 17:49:49 +08:00
Baiju Meswani
fca81cc5d5
ConvTransposeGrad CUDA Kernel (#17201) 2023-08-24 09:08:06 -07:00
cloudhan
87bef1f3f2
Move composable_kernel to deps.txt (#17245) 2023-08-23 17:39:16 -07:00
kunal-vaishnavi
edac3ef150
Add LLaMA scripts (#17020)
### Description
This PR adds the following scripts for LLaMA:
- LLaMA conversion (support for TorchScript and Dynamo exporters)
- LLaMA parity
- LLaMA benchmark
- LLaMA quantization
- LLaMA integration with [Hugging Face
Optimum](https://github.com/huggingface/optimum)



### Motivation and Context
This PR adds scripts for using LLaMA. There is a [follow-up
PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding
scripts for Whisper.
2023-08-22 18:05:11 -07:00
Edward Chen
bd8a488f4b
Enable verbose logging in unit test program with environment variable. (#17133)
Enable verbose logging in unit test program with environment variable.
E.g., `ORT_UNIT_TEST_MAIN_LOG_LEVEL=0 ./onnxruntime_test_all --gtest_filter="<test that I want to see more logs for>"`.
2023-08-22 12:13:52 -07:00
cloudhan
4e6cec4d09
Update ck and enable test (#16383)
Apply the fix in https://github.com/ROCmSoftwarePlatform/composable_kernel/issues/728
Introduce more kernel instances and allow the introduction of streamk and splitk.
2023-08-22 11:08:55 +08:00
Sheil Kumar
cbaa008391
Bump DirectML version from 1.12.0 to 1.12.1 (#17225)
Bump DirectML version from 1.12.0 to 1.12.1

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
2023-08-20 09:55:38 -07:00
Changming Sun
3cec88bd12
FIX: memory leak checker is incompatible with std::stacktrace (#17209)
### Description
When I worked on PR #17173, I didn't notice that
onnxruntime\core\platform\windows\debug_alloc.cc also needs to call
dbghelp functions like SymInitialize. So, if we use vc runtime's
stacktrace functionality, vc runtime will initialize/uninitialize the
dbghelp library independently and vc runtime's stacktrace helper DLLs
get unloaded before our memory leak checker starts get work. Then we
call SymSetOptions, it crashes.

More details:
In VC runtime the C++23 stacktrace functions are implemented on top of
dbgeng.dll. In C:\Program Files\Microsoft Visual
Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\crt\src\stl\stacktrace.cpp,
you can see it has:
```
                dbgeng = LoadLibraryExW(L"dbgeng.dll", nullptr, LOAD_LIBRARY_SEARCH_SYSTEM32);
```
The dbgeng.dll is a wrapper around dbghelp.dll. It calls SymInitialize
and SymCleanup. dbgeng.dll gets unloaded before our memory leak check
starts to run. In theory we should be able to call SymInitialize again
if the previous user who called SymInitialize has also called
SymCleanup. However, users can use
SymRegisterCallback/SymRegisterCallback64/SymRegisterCallbackW64 to
register callback functions to dbghelp.dll. These callback functions
need to be alive when SymSetOptions(and some other dbghelp APIs) get
called.

### Motivation and Context
2023-08-18 17:10:33 -07:00
Changming Sun
ee09a5ff35
Add DISABLE_CUSPARSE_DEPRECATED flag to CUDA build (#17207)
This is to suppress a warning and make Windows CUDA 12.2 build work.
2023-08-18 10:25:49 -07:00
Chi Lo
2fb148dd88
Temporarily enforce "Debug build" TRT EP with trt oss parser on Windows (#17059)
This PR handles two changes:

1. There is an issue when running "Debug build" TRT EP with "Release
build" TRT builtin parser on Windows. Enforce use oss parser for Debug
build.
Note: args.config in build.py is an array, for example ["Debug",
"Release"...]. The code will be much mess if we made the change there.
2. Update to use latest commit of oss parser.

Please see the https://github.com/microsoft/onnxruntime/issues/16273
2023-08-17 12:17:25 -07:00
Changming Sun
5249b7ab7c
Re-implement stacktrace (#17173)
### Description
Re-implement stacktrace. The new implementation doesn't directly use
Windows API, hence can avoid problems regarding to
initialize/uninitialize the dbghelp library.

### Motivation and Context
2023-08-16 16:07:49 -07:00
Dmitri Smirnov
f45eef399e
Fix visualization issues with Attribute/Tensor protos (#17188)
### Description
Protobuf Natvis
2023-08-16 13:56:51 -07:00
RandySheriffH
3dd2c1b4d7
EP context for custom op (#16454)
Implement infrastructures to allow EP resources surfaced to custom ops.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-08-16 13:03:40 -07:00
Maximilian Müller
7b9d1f18c7
NVTX windows include and link fixes (#16831)
### Description

For windows headers are not duplicated to the normal cuda include. For
linux they are:
```
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include/nvtx3 | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
(base) maximilianm@maximilianm-dt-linux:~$ ls /usr/local/cuda/include | grep nvTool
nvToolsExt.h
nvToolsExtCuda.h
nvToolsExtCudaRt.h
nvToolsExtOpenCL.h
nvToolsExtSync.h
```
Is the preference via those added defines or should the include just be
changed to be `nvtx3/` ?

Also there is no library linking needed on Windows and the library is
not even present.
2023-08-16 11:53:58 -07:00
Changming Sun
8e203efc69
Cleanup cmake file (#17154)
### Description
1. Clean up cmake files. Remove some unused code
2. Remove the "Semmle" task from
tools/ci_build/github/azure-pipelines/templates/win-ci.yml. Semmle is
deprecated and replaced by CodeQL.
2023-08-15 10:51:33 -07:00
Matthieu Darbois
5e971bc51a
Rework WIL dependency retrieval/usage (#17130)
### Description
1. `onnxruntime_fetchcontent_makeavailable` works around unconditional
install commands so that can be used instead of `FetchContent_Populate`
2. This dependency is Windows specific, mark it as such.

### Motivation and Context
1. This simplifies `cmake/external/wil.cmake` not to do anything
specific wether WIL was fetched or found
2. Given it's specific to Windows, it might not be available on other OS
in specific air-gapped environment such as
[conan-center-index](https://github.com/conan-io/conan-center-index).
This allows downstream builds not to require specific patches for
something not required by the build in the first place.
2023-08-15 09:11:46 -07:00
Wenbing Li
d052c8a45c
Remove the extensions submodule (#17097)
### Description
Remove the onnxruntime-extensions submodule since it now was used via
cmake FetchContent


### Motivation and Context
The submodule relies on an outdated version of the extensions, and the
build instructions should be updated to eliminate any confusion.
2023-08-14 10:16:33 -07:00
Yulong Wang
5704e71b89
update onnx.patch to apply wasm build break fix (#17104)
### Description
This PR fixes build break for WebAssembly introduced in
6986981482
(435ad2b1d8).

This change updates onnx.patch in onnxruntime repo. the corresponding PR
in onnx repo is: https://github.com/onnx/onnx/pull/5495.

It may takes a while for the next onnx version bump.
2023-08-11 15:00:39 -07:00
Changming Sun
4728f20f9a
Fix CI build (#17118)
### Description
Some pipelines are failing. It is because PR #16325 set ONNX version to
`rel-1.14.1` . It is a branch name, not a commit or tag name. It means
whenever the branch got a new commit, we will auto pick it and use it.
2023-08-11 10:56:38 -07:00
Yulong Wang
9cd4e5af68
[wasm] upgrade emsdk to 3.1.44 (#17069)
### Description
This change upgrade emsdk to 3.1.44.

Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".

most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
2023-08-10 16:08:36 -07:00
Bowen Bao
6986981482
Bump ONNX version (#16325)
### Description
Bump ONNX version to https://github.com/onnx/onnx/tree/rel-1.14.1 to
include a fix for segfault when shape inferencing nested onnx functions.



### Motivation and Context
Resolves #16170
2023-08-10 11:27:28 -07:00
Jeff Daily
dbbfc249f7
[ROCm] update header and binary search paths used by cmake (#17083)
This is in preparation for planned ROCm 6.0 changes that are not
backward compatible. However, the adjustments made by this PR to the
current onnxruntime cmake files will work with ROCm 5.x and 6.x.
2023-08-10 11:05:21 +08:00