onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-02 03:55:34 +00:00

Author	SHA1	Message	Date
aamajumder	589aeb7036	[DML EP] Register DFT-20 (#20341 ) ### Description <!-- Describe your changes. --> This PR registers DFT-20 to the DML EP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-02 11:08:39 -07:00
Tianlei Wu	9f0fae29e8	[CUDA] Add SparseAttention operator for Phi-3-small (#20216 ) ### Description Add CUDA implementation for block sparse attention for Phi-3-small. Block sparse attention was proposed in [Sparse Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different sparse layout. In Phi-3-small, the sparse layout is static, and works with unidirectional (causal) attention. Compared to dense attention, the benefit of block sparse is to speed up both training and inference. It could save memory thus support longer context length. - [x] Add operator spec and shape inference - [x] Symbolic shape inference - [x] Refactor GroupQueryAttention to expose common kernels for kv cache concatenation, q/k/v transpose etc. - [x] Add cuda kernel to convert block mask to CSR format - [x] Add cuda kernel to generate position ids - [x] Add compile script and template files to convert triton kernel to cubin and dispatcher. - [x] Add triton kernel v1 for prompt - [x] Add triton kernel v2 for token generation and support padding - [x] Update IO Binding Helper to allow buffer sharing. - [x] Test relevance - [x] Test performance ### Performance Test in A100-SXM4-80GB with `batch_size=4, num_heads=32, max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8` We compare sparse attention to corresponding GQA with local attention windows size 1024, or GQA with dense causal. Average latency in milliseconds (for fused attention kernel used in prompt prefilling): seq_len \| GQA-Dense \| GQA-Local \| SparseAttention -- \| -- \| -- \| -- 64 \| 0.0465 \| 0.0722 \| 0.0641 128 \| 0.0618 \| 0.0787 \| 0.0672 256 \| 0.1086 \| 0.1076 \| 0.0943 512 \| 0.2535 \| 0.2487 \| 0.1676 1024 \| 0.7042 \| 0.7050 \| 0.3800 2048 \| 2.4125 \| 1.9316 \| 0.8966 4096 \| 8.9346 \| 4.5699 \| 2.1129 8192 \| 40.5401 \| 10.3508 \| 5.1748 Average latency in milliseconds (for fused attention kernel used in token generation: past_seq_len \| GQA-Dense \| GQA-Local \| SparseAttention -- \| -- \| -- \| -- 64 \| 0.0186 \| 0.0186 \| 0.0870 128 \| 0.0408 \| 0.0466 \| 0.1165 256 \| 0.0530 \| 0.0592 \| 0.0988 512 \| 0.0445\| 0.0447 \| 0.1150 1024 \| 0.0634 \| 0.0640 \| 0.1454 2048 \| 0.1027 \| 0.0637 \| 0.1589 4096 \| 0.1789 \| 0.0631 \| 0.1806 8192 \| 0.3288 \| 0.0655 \| 0.2146 We can see that the kernel for token generation still have room to improve. #### Limitations Only support right-side padding and unidirectional attention. The following are not supported in the first version: (1) Packed mode like PackedMultiHeadAttention where input has been removed padding. (2) paged attention. (3) bidirectional attention. (4) GPU compute capacity that is not 8.0, 8.6 and 8.9. (5) Left side padding. Some of these limitations will be removed in the future (may be in a new operator).	2024-04-30 09:06:29 -07:00
Yi-Hong Lyu	b2481e3602	Bump up version in main from 1.18.0 to 1.19.0 (#20489 ) Bump up version in main from 1.18.0 to 1.19.0 since the release branch has been cut. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-04-29 20:21:41 -07:00
liqun Fu	cc26b2dac2	Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels (#20163 ) ### Description ``` Avx2: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22% Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32% ``` --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>	2024-04-25 21:30:50 -07:00
Frank Dong	227c4419fc	add bf16 support for few ops (#20385 ) ### Description Add bf16 support for below ops: ConstantOfShape Exp Erf convolution PythonOp ### Motivation and Context phimm model works on bf16, ORT need support bf16 on previous ops to work with phimm on bf16	2024-04-25 11:28:34 -07:00
Xavier Dupré	80213a9e66	Add implementation for ScatterND (#19540 ) ### Description onnxruntime switches to CPU for ScatterND after opset 13. This extends the implementation of higher opsets.	2024-04-24 14:08:50 +02:00
aciddelgado	94c69f55d4	GQA 4 CPU (#20299 ) ### Description Support GQA operator on CPU with FP32. ### Motivation and Context Right now, models generated for CPU and GPU must be different. GQA CPU allows these models to be the same.	2024-04-22 19:57:05 -07:00
aamajumder	d0e33d2078	[DML EP] Register opset 20 operators (#20092 ) ### Description This PR registers the following opset 20 operators to the DML EP: -IsNaN-20 -IsInf-20 -ReduceMax-20 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-22 12:01:59 -07:00
Patrice Vignola	8fbb8a149f	[DML EP] Add MatMulNBits (#20308 )	2024-04-19 15:05:37 -07:00
Patrice Vignola	4d98f06f93	[DML EP] Add GroupQueryAttention (#20327 )	2024-04-19 10:25:29 -07:00
Patrice Vignola	b8c90beef2	[DML EP] Add SimplifiedLayerNorm and SkipSimplifiedLayerNorm (#20326 )	2024-04-18 22:17:31 -07:00
Adam Louly	ee74fb6908	Introducing ORTPipelineModule - DeepSpeed Parallel Pipeline Support. (#20287 ) ### Description Introducing a new class ORTPipelineModule to handle wrapping layers in DeepSpeed pipeline parallel. ### Motivation and Context To support pipeline parallelism on ORTModule. This PR will include an initial support of deepspeed Pipeline parallelism. - [x] Support Pipeline parallel where layers are nn Modules in Sequential. - [ ] Support LayerSpec and TiedLayerSpec - [ ] Enable partitioning to accept List - [ ] Full-GPU Graph Consolidation - [ ] Subgraph Merging for Inference	2024-04-18 11:30:15 -07:00
jingyanwangms	c11941289b	Add Gemma Rotary Embedding (#20267 ) ### Description Add GemmaRotaryEmbedding kernel which includes sin and cos in GemmaRotaryEmbedding forward and apply_rotary_pos_emb. See gemma_rotary_emb_impl.cu for subgraph details ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-16 15:31:56 -07:00
liqun Fu	cd7112f800	Integration with ONNX 1.16.0 (#19745 ) ### Description update with ONNX 1.16.0 branch according to https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md ONNX 1.16.0 release notes: https://github.com/onnx/onnx/releases/tag/v1.16.0 #### Updated ops for CPU EP: - DequantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block dequantization support - QuantizeLinear(21) - Added int16 and uint16 support + various optimizer tests - Missing int4 and uint4 support - Missing block quantization support - Cast(21) - Missing int4 and uint4 support - CastLike(21) - Missing int4 and uint4 support - ConstantOfShape(21) - Missing int4 and uint4 support - Identity(21) - Missing int4 and uint4 support - If(21) - Missing int4 and uint4 support - Loop(21) - Missing int4 and uint4 support - Reshape(21) - Missing int4 and uint4 support - Scan(21) - Missing int4 and uint4 support - Shape(21) - Missing int4 and uint4 support - Size(21) - Missing int4 and uint4 support - Flatten(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Pad(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Squeeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Transpose(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support - Unsqueeze(21) - Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4 support #### Unimplemented opset 21 features/ops - int4 and uint4 data type - QLinearMatMul(21) - GroupNormalization(21) - ai.onnx.ml.TreeEnsemble(5) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ### Disabled tests #### ORT Training orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py - test_ort_custom_ops: Potential shape inference bug for custom ops #### Python quantization unit tests test/onnx/python/quantization (shape inference bug) - test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16 - test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16 - test_op_gemm.py: test_quantize_qop_gemm_s8s8 - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3 - test_op_matmul.py: test_quantize_matmul_u8u8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16 - test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy - test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile - test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution - test_op_relu.py: test_quantize_qop_relu_s8s8 #### ONNX tests - test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a maxpool output size bug and added this test. Enable this test when [ORT PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged. Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741). - test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op ai.onnx.ml.TreeEnsemble - test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same - test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same - test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same - test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4 yet - test_cast_INT4_to_INT8_cpu: same - test_cast_UINT4_to_FLOAT_cpu: same - test_cast_UINT4_to_UINT8_cpu: same - test_cast_INT4_to_FLOAT_cuda - test_cast_INT4_to_INT8_cuda - test_cast_UINT4_to_FLOAT_cuda - test_cast_UINT4_to_UINT8_cuda - test_constantofshape_float_ones_cuda: ConstantOfShape(21) not implemented for cuda - test_constantofshape_int_shape_zero_cuda: same - test_constantofshape_int_zeros_cuda: same - test_flatten_axis0_cuda: Flatten(21) not implemented for cuda - test_flatten_axis1_cuda: same - test_flatten_axis2_cuda: same - test_flatten_axis3_cuda: same - test_flatten_default_axis_cuda: same - test_flatten_negative_axis1_cuda: same - test_flatten_negative_axis2_cuda: same - test_flatten_negative_axis3_cuda: same - test_flatten_negative_axis4_cuda: same - test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not implemented in ORT yet - test_qlinearmatmul_2D_int8_float32_cpu: same - test_qlinearmatmul_2D_uint8_float16_cpu: same - test_qlinearmatmul_2D_uint8_float32_cpu: same - test_qlinearmatmul_3D_int8_float16_cpu: same - test_qlinearmatmul_3D_int8_float32_cpu: same - test_qlinearmatmul_3D_uint8_float16_cpu: same - test_qlinearmatmul_3D_uint8_float32_cpu: same - test_qlinearmatmul_2D_int8_float16_cuda: same - test_qlinearmatmul_2D_int8_float32_cuda: same - test_qlinearmatmul_2D_uint8_float16_cuda: same - test_qlinearmatmul_2D_uint8_float32_cuda: same - test_qlinearmatmul_3D_int8_float16_cuda: same - test_qlinearmatmul_3D_int8_float32_cuda: same - test_qlinearmatmul_3D_uint8_float16_cuda: same - test_qlinearmatmul_3D_uint8_float32_cuda: same - test_size_cuda: Size(21) not implemented for cuda - test_size_example_cuda: same - test_dequantizelinear_blocked: Missing implementation for block dequant for DequantizeLinear(21) - test_quantizelinear_blocked_asymmetric: Missing implementation for block quant for QuantizeLinear(21) - test_quantizelinear_blocked_symmetric: Missing implementation for block quant for QuantizeLinear(21) --------- Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: Ganesan Ramalingam <grama@microsoft.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>	2024-04-12 09:46:49 -07:00
Patrice Vignola	12042a9387	[DML] Add FastGelu (#20066 ) Although DML doesn't have a "fast" gelu approximation operator, its standard GELU operator is still faster than having to combine all the separate elementwise operators from different ops.	2024-04-11 14:40:28 -07:00
pengwa	280b2634c5	Prompt layer-wise recompute when applicable (#20126 ) ### Prompt layer-wise when applicable Give explicit prompts in export failures to users to enable layer-wise memory optimization if we found the checkpoint function is used. - Using checkpoint function is a strong indicator that the model is too large to fit in GPU memory. - If we don't override the checkpoint function here, mostly ONNX export will be failed. 1. For old version PyTorch, when handling gradient checkpoint feature, we just throw an exception. 2. For new version PyTorch, an export failure happens. - But both failures did not give users explicitly "HOW" to mitigate. This PR did that. `` ![image](https://github.com/microsoft/onnxruntime/assets/10530022/c0476748-5818-4cc8-b2d6-88c7580fe4da) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-04-10 11:50:28 +08:00
Ye Wang	17919717b5	add QMoE (#20108 ) ### Description <!-- Describe your changes. --> 1. Introduce latest cutlass extension from TRTLLM that gives us cutlass upgrade(to 3.4) opportunity from MoE side. 2. Fix Windows build issue 3. Add Int4 MoE op and ut ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-29 10:24:19 -07:00
Ye Wang	6ff31e06d5	[MoE] Add TP and Mixtral MoE (#19945 ) ### Description <!-- Describe your changes. --> 1.Support Tensor Parallelism in ShardedMoE. 2.Make necessary code changes to support Mixtral MoE. 3.Fix a bug related to using IOBinding in test script. 4.Fix the input size limitation ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-19 21:28:15 -07:00
Justin Chu	faea42af95	Bump ruff to 0.3.2 and black to 24 (#19878 ) ### Motivation and Context Routing updates	2024-03-13 10:00:32 -07:00
pengwa	3e954da3e6	Fix and enable few ORTModule Unit Tests (#19847 ) ### Fix and enable few ORTModule Unit Tests Fix 'test_bert_inputs_with_dynamic_shape' and 'test_bert_result_with_layerwise_recompute' generate Nan loss in ORT run. The root cause is, the logic to generatic attention mask test data is not correct, only 0 or 1 is allowed in the dataset, but we see lots of other numbers. ( The reason we don't have this using old version of transformers for example v4.4.2 or 4.16.2 is because they don't contains such `d3cb28886a`, which increase the scaling to a bigger number, causing a overflow to inf) Another improvement during the investigation using convergence tools: Don't dump the activations during model export phase, otherwise, the dumped data might contains some PyTorch run's result making us confused during comparing with stock PyTorch run results. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-12 10:49:19 +08:00
raoanag	89aa4697b1	[DML] QAttention (#19766 ) ### Description DML Implementation for [com.microsoft.QAttention](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QAttention) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Xiang Zhang <xianz@microsoft.com>	2024-03-11 10:44:34 -07:00
raoanag	fa73d7cbf9	[DML] DynamicQuantizeMatMul (#19763 ) ### Description DML Implementation for [com.microsoft.DynamicQuantizeMatMul ](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.DynamicQuantizeMatMul) ``` .\onnxruntime_test_all.exe --gtest_filter="DynamicQuantizeMatMul." Note: Google Test filter = DynamicQuantizeMatMul. [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from DynamicQuantizeMatMul [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_S8 (635 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_NoBias_test_U8 (514 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_S8 (512 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_HasBias_test_U8 (505 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_S8 (526 ms) [ RUN ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8 [ OK ] DynamicQuantizeMatMul.NoZeroPoint_NoBias_test_U8 (504 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_S8 (512 ms) [ RUN ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8 [ OK ] DynamicQuantizeMatMul.HasZeroPoint_HasBias_test_U8 (512 ms) [ RUN ] DynamicQuantizeMatMul.UInt8_test_with_empty_input [ OK ] DynamicQuantizeMatMul.UInt8_test_with_empty_input (112 ms) [ RUN ] DynamicQuantizeMatMul.B_PerColumn_ND [ OK ] DynamicQuantizeMatMul.B_PerColumn_ND (348 ms) [----------] 10 tests from DynamicQuantizeMatMul (4685 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (4686 ms total) [ PASSED ] 10 tests. memleakdbg: ----- No memory leaks detected ----- ``` ### Motivation and Context - CalculateDynamicQuantizeMatMul to replace CPU EP run reference - Added more FP32 testcases to isolate all input datatype combinations --------- Co-authored-by: Xiang Zhang <xianz@microsoft.com>	2024-03-08 15:35:10 -08:00
Dmitri Smirnov	2964352641	Implement IsNaN-9,13,20 for CUDA along with tests (#19807 ) ### Description ### Motivation and Context Some models require IsNan CUDA along with training	2024-03-07 15:46:11 -08:00
pengwa	d102569755	Fix seed for recomputed Dropout (#19715 ) ### Fix seed for recomputed Dropout If Dropout node is recomputed in the backward, we should make sure its execution is same as the run in the forward. If we don't set seed attribute, then this cannot be guaranteed. Add ` export ORTMODULE_MEMORY_OPT_LEVEL=2` to enabled per layer recompute with compromised recomputable subgraphs.	2024-03-06 10:06:25 +08:00
Dmitri Smirnov	1e78bcea60	Implement CUDA IsInf-10,20 (#19772 ) ### Description Implment IsInf-10,20 for CUDA. Add FP16 types also on CPU. ### Motivation and Context Certain models lag in performance due to IsInf not available on CUDA.	2024-03-05 13:33:01 -08:00
guyang3532	cd56ea4a74	enable embedding sparse optimization by default (#19714 )	2024-03-05 13:15:30 +08:00
wejoncy	7e613ee821	[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" (#19106 ) ### Description <!-- Describe your changes. --> 1. Support quantized GPTQ weight in huggingface like [TheBloke/Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ) 2. Support Act_order for GPTQ 3. Support [HQQ](https://mobiusml.github.io/hqq_blog/) algorithm to quantize matmul weight and add quant script ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-05 11:45:45 +08:00
raoanag	27b1dc91ab	[DML] MatrixMultiplyIntegerToFloat (#19608 ) ### Description DML Implementation for [com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat) ``` .\onnxruntime_test_all.exe --gtest_filter="MatMulIntegerToFloat." Note: Google Test filter = MatMulIntegerToFloat. [==========] Running 22 tests from 1 test suite. [----------] Global test environment set-up. [----------] 22 tests from MatMulIntegerToFloat [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms) [ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 [ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms) [ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 [ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms) [ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 [ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms) [ RUN ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint [ OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms) [----------] 22 tests from MatMulIntegerToFloat (8679 ms total) [----------] Global test environment tear-down [==========] 22 tests from 1 test suite ran. (8680 ms total) [ PASSED ] 22 tests. memleakdbg: ----- No memory leaks detected ----- ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> * `CalculateMatMulIntegerToFloat` to replace CPU EP run reference * Added more FP32 testcases to isolate all input datatype combinations * Added fixed input to `MatMulIntegerToFloat_FP16` test cases as for FP16 test cases. onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of generating FP16 models, but we do not produce any for now	2024-03-04 11:55:35 -08:00
pengwa	acbfc29f27	Follow up fix for Gelu impl (#19693 ) ### Follow up fix for Gelu impl There are two minor comments in https://github.com/microsoft/onnxruntime/pull/19560. Fix them in this pull request. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-03-01 10:57:14 +08:00
Dmitri Smirnov	5ee62a6bcc	CUDA Resize-18 implementation (#19595 ) ### Description Implement Resize-18 on CUDA. ### Motivation and Context Performance	2024-02-29 14:46:42 -08:00
Markus Tavenrath	5e432a3ae6	Add support for NHWC GridSample in the CUDA EP and enable grid_sample_test for all EPs (#19562 ) I've added NHWC GridSample support to the CUDA EP to reduce the number of layout transforms. Also I've enabled the full set of GridSampleTests for all EPs. I've also added the GridSample OpSet 16 to the registered kernels. ### Motivation and Context This is the first PR is a series of enhancements of the CUDA EP improving NHWC support to avoid costly layout transforms between NWHC and NCHW nodes which are layout sensitive. Also testing was quite rudimentary for the CUDA EP while it was great for the CPU path. I've regenerated grid_sample_test.cc enabling tests for other platforms as well. Those tests resurfaced #10607 again which is fixed as well.	2024-02-22 19:47:15 -08:00
pengwa	ae92d593c0	ONNX Gelu Op in Opset 20 (#19560 ) ### ONNX Gelu Op in Opset 20 Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op 1. Move CPU-GELU implmentation from `onnxruntime/contrib_ops/cpu/activations.h/cc` to `onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation for approximate attribute to be 'none'. 2. Dumplicate some logic from `onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to `onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation for approximate attribute to be 'tanh'. 3. Register ONNX domain Gelu CPU kernel from opset 20 in `onnxruntime/core/providers/cpu/cpu_execution_provider.cc`. 4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to `onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and `onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu` respectively, as the implementation for approximate attribute to be 'tanh'. 5. Implement the logic for approximate attribute to be 'none' in `onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`. 6. Register ONNX domain Gelu CUDA kernel from opset 20 in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. 7. ROCM ep related changes. 8. Enrich the tests for ONNX domain Gelu in `onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.	2024-02-23 11:05:16 +08:00
kunal-vaishnavi	44d8ad93b2	Whisper Timestamps and Temperature (#19509 ) ### Description This PR updates exporting and running the Whisper model with beam search by adding the following. - Adds temperature as a graph input to the exported model - Fixes the token ids by adding them as attributes to `WhisperBeamSearch` - Fixes the timestamps test cases so they pass now - Fixes a bug with invoking `torch.onnx.export` - Cleans up the Whisper scripts and groups the arguments in `convert_to_onnx.py` - Adds a `requirements.txt` file to specify package dependencies - Adds `whisper-large-v3` to list of pretrained models - Fixes a bug with missing cross-attention KV cache inputs in the decoder subgraph ### Motivation and Context - This is a follow-up to [this PR](https://github.com/microsoft/onnxruntime/pull/19188). - The incorrect token ids in the timestamps processor were first noticed during [this PR review](https://github.com/microsoft/onnxruntime/pull/17500#discussion_r1333520007). When they were originally added in [this PR](https://github.com/microsoft/onnxruntime/pull/15853), the offsets were previously constant across the Whisper model sizes. When comparing the new `whisper-large-v3` variant, the English-only variants (e.g. `whisper-tiny.en`), and the original variants (e.g. `whisper-tiny`), both the values and the offsets differ. Therefore, it is easier to set the token ids as attributes to `WhisperBeamSearch` when exporting to ensure the right values are used in the timestamps processor. - The Hugging Face API for returning timestamps and the expected outputs from the PyTorch model have both changed. - The fix for `torch.onnx.export` is a follow-up to [this PR review](https://github.com/microsoft/onnxruntime/pull/17179#issuecomment-1683001470). - The argument grouping is a follow-up to [this PR review](https://github.com/microsoft/onnxruntime/pull/17500#discussion_r1333521721). - Specific package versions are needed to run the Whisper scripts and the `requirements.txt` file ensures that these versions are installed. - The `whisper-large-v3` variant is released and should be in the list of official pretrained models. - After the changes from [this PR](https://github.com/microsoft/onnxruntime/pull/17316), the exported model is not loading in an ORT inference session because the cross-attention KV cache inputs are missing in the decoder subgraph.	2024-02-16 15:21:43 -08:00
jingyanwangms	775c774f4b	Add BF16 to Sqrt (#19363 ) ### Description Sqrt does not have BF16 support yet. Adding that with this PR ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-02-14 18:07:51 -08:00
Prathik Rao	544407038d	SimplifiedLayerNormalization Fusion BFloat16 support for Llama-v2 on A100 (#18898 ) ### Description <!-- Describe your changes. --> Adds bfloat16 as a supported dtype for SimplifiedLayerNormFusion which will provide speedup for Llama-v2 on A100 using bfloat16 numerical format. _layernorm_optimized_training.onnx exported in bfloat16 vs. float16:_ ![image](https://github.com/microsoft/onnxruntime/assets/31260940/8c0a5f0f-5fcb-4637-bcd9-f34272ec0284) ### Repro Instructions ```python from torch import nn from onnxruntime.training.ortmodule import ORTModule, DebugOptions, LogLevel import torch dtype = torch.bfloat16 # dtype = torch.float16 class Net(nn.Module): def __init__(self): super().__init__() self.fc = nn.Linear(784, 10, dtype=dtype) self.layernorm = nn.LayerNorm([784], dtype=dtype) def forward(self, x): x = x.view(x.shape[0], -1) x = self.layernorm(x) x = self.fc(x) return x model = Net() model = ORTModule(model, DebugOptions(save_onnx=True, onnx_prefix='layernorm', log_level=LogLevel.INFO)) model.to("cuda") images = torch.randn((8, 28, 28), dtype=dtype).to("cuda") output = model(images) ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ONNX Runtime integration with Llama-v2 family of LLMs. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2024-02-14 10:05:16 -08:00
Justin Chu	3d2ddf96e3	Bump ruff linter to 0.2.1 (#19471 ) ### Motivation and Context Include new lint rules	2024-02-08 16:08:27 -08:00
Wei-Sheng Chin	ffc3431a66	Update ScatterElements to Support Opset 13, 15, 18 (#19198 ) `ScatterElements` in opset 18 has been around for a while. However, the highest opset supporting `ScatterElements` in ORT is 13. This PR implement this op in CUDA EP by replacing `assignment` in the current CDUA kernel with `atomic reduction` (e.g., atomic add, atomic max). A series of fundamental atomic functions (e.g., atomic max for int8_t and half) are implemented in `common.cuh`; the implementation is general enough to cover old CUDA and new CUDA versions. - The core changes are in `cuda/atomic/common.cuh` with very detailed documentation including `bit-wise operation's visualization`. They are also copied to `rocm/atomic/common.cuh` to support AMD GPU. - `/cuda/tensor/gather_elements_impl.cu` contains small changes to call the new atomic functions to support new `reduction` behavior in new `ScatterElements`. - New `ScatterElements` are defined in `rocm_execution_provider.cc` and `cuda_execution_provider.cc`.	2024-01-30 09:18:50 -08:00
Baiju Meswani	465540d29b	Update training api python documentation (#19287 )	2024-01-29 14:14:15 -08:00
Dmitri Smirnov	7dd1f4b8e2	Pad-18 Cuda implementation (#19211 ) ### Description Implement Pad-18 for Cuda. ### Motivation and Context Latest models converted by Dynamo fall back on CPU for Pad with performance degradation. This contributes to https://github.com/microsoft/onnx-rewriter/issues/126	2024-01-24 18:12:04 -08:00
aciddelgado	cbb29d80ff	GQA Rotary and Packed QKV with Flash (#18906 ) ### Description These changes add rotary embedding and packed qkv input to gqa. As of now, the changes are only supported with Flash-Attention (SM >= 80) but should soon be supported with Memory Efficient Attention as well. ### Motivation and Context With the fusion of rotary embedding into this Attention op, we hope to observe some perf gain. The packed QKV should also provide some perf gain in the context of certain models, like Llama2, that would benefit from running ops on the fused QKV matrix, rather than the separate Q, K, and V. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>	2024-01-23 16:34:26 -08:00
petermcaughan	f53068446e	Add Temperature to WhisperBeamSearch input (#19188 ) ### Description <!-- Describe your changes. --> Add `temperature` as an input to WhisperBeamSearch op and initialize correctly in parameter setup. ### Motivation and Context Currently, temperature is included as an attribute to the BeamSearch op, which doesn't let the model act dynamically in a single inference session. By including this variable as an input, the temperature value can be altered in any inference call (important for 1P teams) --------- Co-authored-by: Peter McAughan <petermca@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>	2024-01-23 13:44:34 -08:00
Linnea May	24b74aebcb	[DML] Register DML operators for opset 19 (#16939 ) ### Description <!-- Describe your changes. --> Register DML operators for opset 19. - Cast19 - Castlike19 - Constant19 - Equal19 - Identity19 - QuantizeLinear19 - DequantizeLinear19 - Reshape19 - Shape19 - Size ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: linnealovespie <linneamay@microsoft.com>	2024-01-22 15:37:09 -08:00
Ye Wang	21034a2c37	phi2 contrib ops changes (#19112 ) ### Description <!-- Describe your changes. --> 1. support causal mask in MHA cpu 2. support custom rotary_dim in rotary_emb 3. add bf16 for rotary_emb 4. fix a bug in attention rotary ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-22 10:17:11 -08:00
Xavier Dupré	eaf047c820	Increment year to 2024 in conf.py (python documentation) (#19107 ) ### Description Update copyright in python documentation.	2024-01-19 19:36:19 +01:00
Rachel Guo	bd9d8fb2a5	[ORT 1.17.0 release] Bump up version to 1.18.0 (#19170 ) ### Description <!-- Describe your changes. --> Bump up version to 1.18.0 since the release branch has been cut. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2024-01-17 11:18:32 -08:00
pengwa	1150b1f81e	ORTModule memory improvement (#18924 ) ## Dependency https://github.com/microsoft/onnxruntime/pull/19007 ## ORTModule memory efficient gradient management Previously I have tried to solve the coarsed-grained gradient accumulation/update problem in ORTModule with https://github.com/microsoft/onnxruntime/pull/8979, while that resolution somehow is not fully validated with DDP or there is user hooks on the gradient accumulation on torch parameter. This PR is addressing the problem in the similar approach as PR 8979, e.g. trigger gradient accumulation once ORT computed the grad, but instead of use a AccumulateGrad op, this time with a ONNX operator PythonOp, internally it will call param.backward(grad), which will help handle all related hooks correctly. ## Design Check the details from https://microsoftapc-my.sharepoint.com/:p:/g/personal/pengwa_microsoft_com/EaaBq4EzsFhOmsDEXCG7Ba4Bb9bwd0O2sFV_JXJ4jBLYLA?e=7Sz2g8&nav=eyJzSWQiOjI3MSwiY0lkIjozMjE4NzI1NDIzfQ ## Convergence Validation: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/ccf3a213-e815-4b23-b759-165033b2d9fe) differences are on mostly 0.000x, sometimes 0.00x, which may comes from the different order gradient apply happens before or after this change (on deepspeed zero stage 2) ## TODO Consolidate the logic with Stage3's similar logic.	2024-01-16 08:57:37 +08:00
Aditya Goel	dcd6d4cad6	Label encoder opset4 (#17977 ) ### Description <!-- Describe your changes. --> Implements LabelEncoder as per `ai.onnx.ml` opset 4 for the upcoming ONNX 1.15 release. ~~This currently depends on a new ONNX release candidate and so is marked as draft in the meantime.~~ ### Motivation and Context Closes https://github.com/microsoft/onnxruntime/issues/17602	2024-01-12 12:43:44 -08:00
Aditya Goel	c23410a182	StringSplit operator (#18016 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Closes https://github.com/microsoft/onnxruntime/issues/17596	2024-01-12 09:46:23 -08:00
Chi Lo	46dd0d3f52	[TensorRT EP] Load precompiled TRT engine file directly (#18217 ) When the TRT engine cache (precompiled engine) is present, it doesn't make sense to go over the processes of model verification, model optimization, TRT EP's GetCapability(), TRT EP's model proto reconstruction, calling TRT parser and engine compilation. This PR makes TRT EP skip those processes and directly load the engine to perform inference. The feature request: https://github.com/microsoft/onnxruntime/issues/18072 Features: - Replace original model with TRT engine wrapped ONNX model. It can save a lot of time as mentioned above. - How to get TRT engine wrapped ONNX model? 1. Set `trt_dump_ep_context_model` provider option to "true" and run the inference. You will find the "xxx_wrapper.onnx" at the engine cache path. (The same logic of generating engine cache) 2. Use gen_trt_engine_wrapper_onnx_model.py - Three provider options are added, `trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP `trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine cache path, 1 means engine binary data. `trt_ep_context_compute_capability_enable`: Add hardware_arch as attribute. When running the model, TRT EP will check consistency between model's hardware_arch and GPU's compute capability. - When the engine cache path is given in the wrapped model, TRT EP will first search for the engine file using the path (relative to model path), if it can't find it, it will change to use the path as it is (depends on user, could be relative to working dir or absolute path) Note: 1. This PR includes the change of https://github.com/microsoft/onnxruntime/pull/17751 Constraints: 1. The whole model should be fully supported by TRT. 4. Users need to make sure the engine is built with min/max/opt optimization profiles that large enough to cover the range of all inputs. TRT EP will simply fail and won't rebuild the engine if the input shape is out of range during runtime.	2024-01-11 22:20:54 -08:00
Ye Wang	b6d82834d4	add bfp16 to gqa (#19095 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-01-11 20:53:31 -08:00
Aditya Goel	d8962d67f4	RegexFullMatch operator (#18002 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Closes https://github.com/microsoft/onnxruntime/issues/17594.	2024-01-11 15:50:07 -08:00
Aditya Goel	4694edcd41	String concat operator (#17994 ) ### Description <!-- Describe your changes. --> ### Motivation and Context Closes https://github.com/microsoft/onnxruntime/issues/17595. --------- Signed-off-by: Aditya Goel <agoel4512@gmail.com>	2024-01-11 10:01:43 -08:00
liqun Fu	e10a8ae31f	reduce max/min 20 (#17805 ) ### Description reducemax/min have been updated in onnx(20). implement it in ort ### Motivation and Context this is for ort1.17.0 release --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2024-01-04 17:41:01 -08:00
Jeff Bloomfield	7401b6661d	Update OperatorKernels.md	2024-01-04 11:27:03 -08:00
Jeff Bloomfield	8ea3e68192	Update ContribOperators.md	2024-01-04 10:10:46 -08:00
liqun Fu	32fcf73740	Implement dft(20) (#17821 ) ### Description dft is updated in opset20. implement it in ort ### Motivation and Context this is for ort 1.17.0 release Fixes #17723 --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-12-19 10:42:54 -08:00
luoyu-intel	5f00bc9931	Integrate high-performance x64 gemm library to MLAS (#17669 ) ### Description Improve MLAS to support high-performance x64 INT4 kernels ### Motivation and Context 1. improve LLM inference performance on Intel CPUs. 2. support more 4bit quantization types: nf4, fp4 3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension 4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16 5. support MatMulNBits' data format ### Tasks - [x] support block_size: 32, 128, -1(per channel) - [x] get weight pack size without memory allocation - [x] use ort's thread pool for parallelism - [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8 ### Benchmark Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 47613 \| 47401 \| 12970 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 6347792 \| 6317562 \| 109 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 11814014 \| 11757847 \| 59 Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 50222 \| 50031 \| 13759 Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 2038222 \| 2028743 \| 341 Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 3792832 \| 3774485 \| 191 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 58717 \| 58501 \| 11467 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 1360846 \| 1354598 \| 543 Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 2564232 \| 2551365 \| 266 Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 57929 \| 57694 \| 12047 Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5495330 \| 5465810 \| 126 Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10676240 \| 10617817 \| 66 Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 68305 \| 68047 \| 10026 Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5504862 \| 5476215 \| 126 Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 11758623 \| 11697337 \| 66 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 67713 \| 67451 \| 10298 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5508325 \| 5480237 \| 126 Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10738528 \| 10681656 \| 64 Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 60708 \| 60486 \| 11321 Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5523784 \| 5495736 \| 126 Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10829633 \| 10772161 \| 67 Reference: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time \| 53088 \| 52911 \| 13364 Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time \| 6268981 \| 6230335 \| 110 Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time \| 11701237 \| 11632339 \| 59 Win11+12900K 8 cores: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time \| 215976 \| 211295 \| 2884 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time \| 60960590 \| 60937500 \| 10 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time \| 1.18E+08 \| 1.19E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time \| 470377 \| 453059 \| 1414 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time \| 1.54E+08 \| 1.53E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time \| 3.18E+08 \| 3.13E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time \| 569072 \| 559398 \| 1229 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time \| 1.54E+08 \| 1.52E+08 \| 4 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time \| 3.22E+08 \| 3.28E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time \| 1486055 \| 1473325 \| 403 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time \| 4.14E+08 \| 4.14E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time \| 8.88E+08 \| 8.59E+08 \| 1 --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mengni.wang@intel.com>	2023-12-19 09:36:31 -08:00
pengwa	ccf3b2054b	Allow layer-wise recompute (#18566 ) ### Allow layer-wise recompute Early, we need users/developers to specify the subgraphs to recompute, now we introduced a more user-friendly way to enable recompute for all detected stashed activation recomputation subgraphs. This scarifies getting the best configs while makes it easier to support user requirements when they switches from PyTorch per-layer gradient checkpoint to ORTModule. `ORTMODULE_MEMORY_OPT_LEVEL` is introduced to control the usage, by default, it is 0, e.g. `USER_SPECIFIED`, all subgraphs definedin `ORTMODULE_MEMORY_OPT_CONFIG` will be recomputed. So this is compatible to existing recompute usage in ORTModule integrated models. Using `ORTMODULE_MEMORY_OPT_LEVEL=1`, we will enable all recompute plans detected, so those configs in `ORTMODULE_MEMORY_OPT_CONFIG` will not be respected any more. Add Unit Tests using 3 layer blooms. https://github.com/microsoft/onnxruntime/blob/pengwa/add_aggresive_recompute/docs/Memory_Optimizer.md	2023-12-12 08:44:05 +08:00
Xavier Dupré	d41dd77241	Extend API page on the python documentation (#18762 )	2023-12-09 15:33:57 -08:00
Hector Li	9768a727e1	[QNN EP] Fix a bug that can't create context binary if the model has inputs/outputs with different data type (#18722 ) Fix a bug that can't create context binary if the model has inputs/outputs with different data type ### Description Update EPContext op schema to unblock nodes with different data type among inputs & outputs	2023-12-06 13:07:09 -08:00
pengwa	4bfa84487c	Skip module clone for preparing large model export (#18663 ) ### Skip module clone for preparing large model export For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It failed during preparing outputs which will be used for torch.onnx.export. The reason, we deep copy all the params including both big sizes of frozen weights, + a little bit of Lora trainable weight. This PR will firstly check whether the GPU memmory is enough for a cloned module, if not, skip the copy. Copying the module is to guarantee the fw path run may change the weight, while this case should be rare. But for now, Not-Able-To-Run is worse than Runnable-with-A-little-bit-different-initial-weight, especially for large models.	2023-12-05 12:41:17 -08:00
Vincent Wang	e1d1033131	[ORTModule] Remove Unused Arguments from Generated Triton Code (#18636 ) This PR: - Remove unused arguments from generated triton code, - Remove unnecessary mask for symbolic shape case from generated triton code. - Add doc for usage of ORTMODULE_TRITON_CONFIG_FILE.	2023-11-30 18:32:36 +08:00
Dmitri Smirnov	d2dfbf4179	Add float16 type support to SplitToSequence and make code type independent (#18594 ) ### Description Add support for `float16` type to address the below issue. Re-work the code to make it type independent. This reduces binary size by ~11 K. ![image](https://github.com/microsoft/onnxruntime/assets/11303988/1a77c7bc-34a8-478c-a16a-abd94062c6c6) ### Motivation and Context This PR addresses https://github.com/microsoft/onnxruntime/issues/18481	2023-11-29 10:44:59 -08:00
pengwa	43a5147e01	Memory optimization refactor and refinement (#17481 ) ### Memory optimization refactor and refinement Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - *** ONNX Runtime Training (ORTModule) is accelerating your model *** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0inputs_input_ids_dim1(384.0inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0inputs_input_ids_dim0inputs_input_ids_dim1(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0inputs_input_ids_dim0inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== \|Freq \| Memory Optimization Opportunities (Clustered by node-level activation patterns) \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|3 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(3), \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(2), \| \| \| - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=2 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \| \| \| \| \|>>Option 2 : RecomputeWithCompromise subgraph Cast+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. *********************************************************************** ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-23 11:39:00 +08:00
Vincent Wang	3bc9efc7b2	[ORTModule] Adjust Attention Patterns for Efficient Attention ATen Fallback (#18471 ) Adjust attention patterns to match latest Whisper+exporter. Also add some condition check and add docs.	2023-11-22 15:24:05 +08:00
Jambay Kinley	1af0681554	Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484 ) ### Description <!-- Describe your changes. --> Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for QLoRA fine-tuning. - On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16 dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16` type which uses float for compute. - I have validated the op in a llama2-7b training scenario. The losses match pytorch training and the training throughput is better. - Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable. The custom autograd function exporter in onnxruntime-training is updated to support the latest version of bitsandbytes. They changed how the `quant_state` is stored. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable QLoRA fine-tuning with bfloat16.	2023-11-20 09:52:58 -08:00
kailums	1a29460919	rope support 4D input tensor (#18454 ) ### Description <!-- Describe your changes. --> change RotaryEmbeddings op implementation, add support for 4D input tensor that is with shape of [batch, num_heads, seq_len, head_size]. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Current RotaryEmbedding op only support 3d input tensor with shape [batch, seq_len, hidden_size] For llamav2 model, when using FusionRotaryEmbeddings to only fuse RotaryEmbeddings op, there will be a transpose operation for query and key, and then the input tensor of RotaryEmbeddings becomes 4D [batch, num_heads, seq_len, head_size]. This scenario can't be supported by current RotaryEmbeddings implementation. So it needs to support 4D input tensor.	2023-11-17 20:38:15 +08:00
aciddelgado	adb56df2e8	Aciddelgado/gqa local (#18375 ) ### Description Implement preliminary version of local (sliding window) attention. Currently only supported by Flash Attention (sm >= 80, Linux). Currently only supports sliding attention with a large cached kv. ### Motivation and Context This change enables to run Mistral and other models which use sliding window attention.	2023-11-16 15:01:06 -08:00
Ye Wang	f9af94009b	onboard MoE (#18279 ) ### Description <!-- Describe your changes. --> 1. Introduce MoE CUDA op to ORT based on FT implementation. 2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows. Remove patch file for cutlass 3.0.0. 3. Sharded MoE implementation will come with another PR limitation: __CUDA_ARCH__ >= 700 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-14 16:48:51 -08:00
Prathik Rao	7a3da4526f	add bfloat16 support for CUDA Neg kernel (#18306 ) ### Description <!-- Describe your changes. --> Registers BFloat16 datatype as valid input type for CUDA Neg Kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-08 18:32:12 -08:00
pengwa	2151c79bf1	Tune ORTModule logging experience a bit (#18298 ) ### Tune logging experience a bit After last time we update the ORTModule log experience, we found few issues: 1. `INFO` level output too many things, including PyTorch exporter verbose logs (tracing graphs) on every ranks. On this level, we only want to - Output a little bit more information to Users than `WARNING` level, for example the memory recomputation recommendations or other not-fully-ready features. - Output a little bit more information for a quick diagnostic, collected on rank-0 only. 2. ONNX Runtime logging filter during graph build, session init sometimes will hide the issues (for example segement fault), there is no useful information in `WARNING`/`INFO` for users to report to us. This is not good! 3. Some of our devs like using `pdb` to debug Python code, but if we add `import pdb; pdb.set_trace()` in models' code might hang when they use `INFO` or `WARNING`, where exporter happens and all output got redirected due to log filtering. The only workaround is to switch to VERBOSE, which output toooooooooooo many logs. The corresponding changes proposed here are: 1. For `INFO` logging, - We only logs rank-0. - We restricted the ORT backend logging level to be WARNING in this case, because ORT backend code output way too many logs that should be under verbose, while we cannot guarantee we can get them cleaned up immediately once they are added. - We output the PyTorch exporter verbose log (including tracing graph), which is useful for a quick diagnostic when an issue happens. 2. Remove all logging filtering on ORT backend, then the segment fault issue details will not be hidden once it happens again. 3. Introduced a `DEVINFO` logging, - Log logs on all ranks - Log ORT backend logging level INFO - PyTorch exporter logging filtering are all turned OFF (to unblock the pdb debugging). 4. Currently, to use Memory Optimizer, need use DEVINFO (which will output ORT backend INFO log). So update memory optimizer document to reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will update the requirement back to INFO for show memory optimization infos. You can check https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations for a better view of different log levels. This PR also extract some changes from a bigger one https://github.com/microsoft/onnxruntime/pull/17481, to reduce its complexity for review. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>	2023-11-08 17:42:50 +08:00
aciddelgado	3dece27f51	GQA Flash Attention with Attention Mask (#18283 ) ### Description GQA now only works with Flash Attention with Attention Mask input, allowing for batched input. Note: This PR Disables Memory Efficient Attention, only allowing Flash Attention kernel to be used. ### Motivation and Context Allows GQA to work with batched input. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>	2023-11-07 17:47:51 -08:00
liqun Fu	6127dd1d2d	implement gridsample 20 (#17744 )	2023-11-07 10:42:41 -08:00
Patrice Vignola	800ae7742c	[DML EP] Add RotaryEmbedding (#18158 ) This is a graph implementation of RotaryEmbedding since there's no time to add it to DML before 1.16.2, but it eventually should move into DirectML since we're bandwidth-bound.	2023-11-07 08:26:11 -08:00
Prathik Rao	8978bdc59d	add bfloat16 support for where operator (#18118 ) ### Description <!-- Describe your changes. --> Adds bfloat16 as a valid input parameter type for where node for ONNX opset 16+. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-02 12:23:20 -07:00
pengwa	c8e1038eab	Optimize 4bit Qlora training (#18131 ) ### Optimize 4bit Qlora training Extent existing `MatmulBnb4bit` to its usage in training scenarios. The PR includes following changes: 1. Add special `torch.autograd.Function` export logic for `bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before common PythonOp exporter. 2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which help skip some inference specific logic in implementation. 3. Add `transB` optional attribute, which is by default be 1; setting it to be 0 is needed by backward usage. Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9% throughput gains. The reason is: `bitsandbytes.autograd._functions.MatMul4Bit` has logic `ctx.save_for_backward`, which would need an additional copy in PythonOp, otherwise, the tensor might be released by ORT, while backward op still references it. Removing the clones also reduce the peak memory consumptions because `bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not needed in backward compute.	2023-11-02 09:46:11 -07:00
aciddelgado	178f7caaeb	GQA Memory Efficient Kernel (#17920 ) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.	2023-11-01 20:04:22 -07:00
Preetha Veeramalai	d87216bcb1	Openvino ep ort 23.1 (#17911 ) ### Description Integration to OpenVINO 2023.1 ### Motivation and Context - Alignment with latest OpenVINO Version. - Device name change from VPUX to NPU and Remove from supported list until official public support is available. --------- Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com> Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com>	2023-11-01 08:39:39 -07:00
Tianlei Wu	95f053c652	[CUDA] Update GroupNorm and Add SkipGroupNorm (#18091 ) * Add a new operator SkipGroupNorm to support skip and bias inputs. * Update GroupNorm kernel to support number of channels used in SD XLrefiner. * Add epsilon in kernel * Add parity and performance test script * Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc. ### Motivation and Context Update GroupNorm to support SD XL Refiner and beyond.	2023-10-31 10:27:20 -07:00
Xavier Dupré	b5f242e978	GemmFloat8 as a contrib ops (#16051 ) ### Description Add support for Gemm with float 8 as a contrib op. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-10-27 14:33:55 +02:00
Tang, Cheng	37873be86d	enable reduce ops on opset18 (#18053 ) ### Description Opset 18 apply the "axes as input" change from ReduceSum to all the other reduce ops. Our cuda kernel actually support it, but we didn't enable it for opset18. This PR update the reduce ops' kernel registration to enable the "axes as input" behavior for opset18. As part of the fix, I also simplify the reduce op kernel registration part. ORT doesn't require the kernel definition need to be exactly the same as onnx op definition. For our case, which we share the same kernel for all the reduce ops (from version 1 to version 18), we don't need to maintain different version of kernel definitions. we can simplify it by just using a single kernel definition for multiple versions. Although for some cases, we might register more types for legacy versions, but it is harmless. Framework is using schema to validate the graph, not kernel definition. --------- Co-authored-by: Cheng Tang <chenta@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com>	2023-10-26 16:57:21 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
liqun Fu	706e13e0c9	implement affinegrid cpu kernel (#17777 )	2023-10-25 10:46:04 -07:00
liqun Fu	efa0cc2562	implement isinf20 and isnan20 (#17874 )	2023-10-24 10:58:54 -07:00
kunal-vaishnavi	2a17d5cf32	LLaMA Model Optimization (#18021 ) ### Description This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. ### Usage To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/14997 - https://github.com/microsoft/onnxruntime/issues/16254 - https://github.com/microsoft/onnxruntime/issues/17681 - https://github.com/microsoft/onnxruntime/issues/17925 - https://github.com/microsoft/onnxruntime-inference-examples/issues/320 This PR uses changes from the following PRs: - https://github.com/pytorch/pytorch/pull/104468 - https://github.com/pytorch/pytorch/pull/109759 - https://github.com/microsoft/onnxruntime/pull/17020 - https://github.com/microsoft/onnxruntime/pull/17674 - https://github.com/microsoft/onnxruntime/pull/17890 - https://github.com/microsoft/onnxruntime/pull/17920 - https://github.com/huggingface/transformers/pull/26162 - https://github.com/huggingface/optimum/pull/1257 - https://github.com/huggingface/optimum/pull/1289 - https://github.com/huggingface/optimum/pull/1462 ### New TorchDynamo Exporter (experimental stage) This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - https://github.com/huggingface/transformers/pull/26307 - https://github.com/pytorch/pytorch/issues/104903 - https://github.com/pytorch/pytorch/pull/105040 - https://github.com/microsoft/onnxscript/pull/847 - https://github.com/microsoft/onnxscript/pull/862 - https://github.com/microsoft/onnxscript/issues/493	2023-10-23 13:00:56 -07:00
Yufeng Li	11af34440a	Add MatMul 4bits support on GPU (#17890 ) ### Description <!-- Describe your changes. --> Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU	2023-10-13 16:55:30 -07:00
Zhang Lei	762703e037	Support output cross qk, dtw and more for whisper model (#17500 ) Support cross qk in beam search for whisper model and related features Make whisper exporting tools support cross qk and some related features, * extra_decoding_ids * no_speech_prob Implement DTW kernel, unfold tensor kernel with unit test Several fix related with multiple session running parallel, like: * guard multihead_attention, fused_fp16_runner_ * some memory allocation with stream awareness * add use_ep_level_unified_stream option	2023-10-13 11:47:15 -07:00
pengwa	63dc5dc1a9	Add document for PythonOp (#17888 ) ### Add document for PythonOp https://github.com/microsoft/onnxruntime/blob/pengwa/pythonop_doc/docs/ORTModule_PythonOp_Notes.md ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-12 08:36:22 +08:00
aciddelgado	406cd324e0	[CUDA] GroupQueryAttention operator using FlashAttention (#17674 ) ### Description Added Group Query Attention op, supporting integer multiple number of heads for Q / KV. As of now, this op can only use FlashAttention kernel, meaning it only supports sm>=80 on Linux. Results from onnxruntime/test/python/transformers/benchmark_gqa.py show an on-average ~37% speed-up over Decoder Masked Multi-Head Attention, with even greater improvements for long past sequence lengths. ``` op batch s_kv heads h_dim ms TFLOPS gqa 16 2048 8 32 0.34 0.10 dmmha 16 2048 8 32 0.39 0.09 --------- gqa 16 2048 8 64 0.45 0.15 dmmha 16 2048 8 64 0.61 0.11 --------- gqa 16 2048 8 128 0.54 0.25 dmmha 16 2048 8 128 0.83 0.16 --------- gqa 16 2048 16 32 0.45 0.15 dmmha 16 2048 16 32 0.69 0.10 --------- gqa 16 2048 16 64 0.69 0.19 dmmha 16 2048 16 64 0.83 0.16 --------- gqa 16 2048 16 128 0.71 0.38 dmmha 16 2048 16 128 1.28 0.21 --------- gqa 16 2048 32 32 0.58 0.23 dmmha 16 2048 32 32 0.77 0.17 --------- gqa 16 2048 32 64 0.58 0.46 dmmha 16 2048 32 64 1.25 0.21 --------- gqa 16 2048 32 128 0.76 0.71 dmmha 16 2048 32 128 2.15 0.25 --------- gqa 16 2048 64 32 0.68 0.39 dmmha 16 2048 64 32 1.23 0.22 --------- gqa 16 2048 64 64 0.77 0.70 dmmha 16 2048 64 64 2.11 0.25 --------- gqa 16 2048 64 128 1.10 0.97 dmmha 16 2048 64 128 4.06 0.26 --------- gqa 16 2048 128 32 1.00 0.54 dmmha 16 2048 128 32 2.09 0.26 --------- gqa 16 2048 128 64 1.10 0.97 dmmha 16 2048 128 64 4.08 0.26 ``` ### Motivation and Context As of now, this op is targeted for use on LLama models, as it supports kv-caching and different number of heads for Q and KV (Grouped Query Attention). We plan to add support for more platforms, input formats, etc. in the future. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-10-09 12:43:12 -07:00
kyoshisuki	ba72bb6f98	Fix a typo in ABI_Dev_Notes.md (#17832 )	2023-10-09 07:51:34 -07:00
Hector Li	385fab5bae	[QNN EP] Qnn cache improvement (#17757 ) ### Description Improve the QNN context binary cache feature to reduce the memory overhead and initialization time overhead. Instead of dumping a Qnn context binary file with metadata as header, we dump a Onnx format file with metadata inside Onnx node. ### Motivation and Context reduce the memory overhead and initialization time overhead	2023-10-06 15:56:33 -07:00
liqun Fu	2be4dc6d04	ONNX 1.15 integration (#17125 ) ### Description this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released ### Motivation and Context Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-09-26 14:44:48 -07:00
Nicolò Lucchesi	4ab0e17fe8	[Technical docs] Fixed a couple of old links in `FAQ.md` (#17415 ) ### Description Updated a couple of old links in the technical documentation that where pointing to files present prior to the migration to https://onnxruntime.ai/docs.	2023-09-26 13:38:24 -07:00
Vincent Wang	e6301eee6a	Bump Up Version to 1.17.0 (#17587 ) Bump up version to 1.17.0 as the 1.16.0 release branch had been branched out.	2023-09-20 11:02:58 +08:00
Adrian Lizarraga	dea425e7c1	[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015 ) ### Description - Adds 16-bit integer support to: - Quantization kernel implementations: Intel, Neon, and Power intrinsics - DequantizeLinear and QuantizeLinear contrib ops - QNN EP Quantize and Dequantize operators - Python quantization scripts - Disables QDQ fusions for most 16-bit QDQ node groups (need to add 16-bit support to QLinear* ops) - Retains support for dropping QDQ nodes from Split, Gather, Reshape, Transpose, Squeeze, and Unsqueeze node groups. Sample python code to generate QDQ model with 16-bit activations and 8-bit weights: ```python quantize_static( input_model_path, output_model_path, data_reader, quant_format=args.quant_format, per_channel=args.per_channel, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8, extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True}, ) ``` Note that enabling the `UseQDQContribOps` extra option is not strictly necessary. If the 16bit types are used without enabling `UseQDQContribOps`, the QDQ ops domains are overridden to 'com.microsoft', and a warning is printed to stdout. ### Automated Tests MLAS/CPU EP: - [x] 16-bit QuantizeLinear computation - [x] 16-bit DequantizeLinear computation Optimizer: - [x] Transpose QDQ fusion - [x] Gather QDQ fusion - [x] Reshape QDQ fusion - [x] Squeeze QDQ fusion - [x] Unsqueeze QDQ fusion - [x] Split drop QDQ - [x] DoubleQDQPairRemover - [x] Transpose optimization - [x] EnsureUniqueDQForNodeUnit - [x] Common subexpression elimination (DQ not removed) - [x] Constant folding QNN EP: - [x] Conv 16-bit activations, 8-bit weights - [x] MatMul 16-bit activations, 8-bit weights - [x] Unary 16-bit QDQ ops - [x] Binary 16-bit QDQ ops Quantization tool: - [x] Test creation of 16-bit QDQ model ### Motivation and Context Support mixed precision (8bit weights, 16bit activations) models. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-09-18 09:43:34 -07:00
Nat Kershaw (MSFT)	a2fba28f6c	Remove extraneous javascript includes (#17558 )	2023-09-14 20:43:24 -07:00
Nat Kershaw (MSFT)	bbcf4b45dc	Upgrade doxygen to 1.9.8 (#17525 )	2023-09-12 20:44:27 -07:00
Baiju Meswani	5d2c57363f	Sign CUDA Kernel (#17293 )	2023-08-28 21:03:58 -07:00
Adrian Lizarraga	5a83a67f32	Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (#17127 ) ### Description - Enables int32 support for com.microsoft.DequantizeLinear (contrib op) - Makes the `zero_point` input optional for Quantize/Dequantize contrib ops - Enables QDQ transformations with the Quantize/Dequantize contrib ops - Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests, TransposeOptimizerTests ### Testing List of tested graph transformations: - [x] QDQSelectorActionTransformer - qdq_transformer_test.cc - [x] QDQS8ToU8Transformer - qdq_transformer_test.cc - [x] DoubleQDQPairsRemover - qdq_transformer_test.cc - [x] IdenticalChildrenConsolidation - qdq_transformer_test.cc - [x] QDQPropagation - qdq_transformer_test.cc - [x] QDQFinalCleanup - qdq_transformer_test.cc - [x] CliQuantFusion - qdq_transformer_test.cc - [x] ReluQuantFusion - qdq_transformer_test.cc - [x] EnsureUniqueDQForNodeUnit - ensure_unique_dq_for_node_unit_test.cc - [x] TransposeOptimizer - transpose_optimizer_test.cc - [x] CommonSubexpressionElimination - graph_transform_test.cc - [x] ConstantFolding - graph_transform_test.cc ### Motivation and Context We need to [support mixed 16-bit/8-bit precision QDQ models](https://github.com/microsoft/onnxruntime/pull/17015). This PR is the first step in achieving this goal: we need to make QDQ contrib ops work with our optimizations/transformations. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-08-25 09:57:51 -07:00
pengwa	d90afc697b	Introduce ZeROOffloadSubscriber for ORTModule (#17006 ) ### Introduce ZeROOffloadSubscriber for ORTModule As part of the work: integrate ORTModule with DeepSpeed stage3, this PR mainly focus on moving original PyTorch-based (leveraging hooks) param partition/offload implementation to ORTModule compatible implementation. Changes include: 1. Refactor `SubscriberBase`/`SubcriberManager` to support pre-forward/post_forward hooks. 2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook function as much as possible. Since all hook functions are defined in `DeepSpeedZeRoOffload._register_hooks_recursively` and `DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is, the closure is not complex, all hooks are referencing the owning `DeepSpeedZeRoOffload` instance, so we can create new hook function with `FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance, then call the new created function in subscriber's `pre_forward_module_apply_impl` and `post_forward_module_apply_impl` interfaces. 3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to register the `ZeROOffloadSubscriber` for the model, then we don't need change any code on the DeepSpeed repo (at least so far). 4. Fix the ATen embedding custom symbolic exporter function by tolerating weights size be (0) (changed by DeepSpeed zero stage 3). UT will be added once stage3 is fully supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 00:15:22 +08:00

1 2 3 4 5 ...

737 commits