onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-01 03:45:06 +00:00

Author	SHA1	Message	Date
liqun Fu	e10a8ae31f	reduce max/min 20 (#17805 ) ### Description reducemax/min have been updated in onnx(20). implement it in ort ### Motivation and Context this is for ort1.17.0 release --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2024-01-04 17:41:01 -08:00
Jeff Bloomfield	7401b6661d	Update OperatorKernels.md	2024-01-04 11:27:03 -08:00
Jeff Bloomfield	8ea3e68192	Update ContribOperators.md	2024-01-04 10:10:46 -08:00
liqun Fu	32fcf73740	Implement dft(20) (#17821 ) ### Description dft is updated in opset20. implement it in ort ### Motivation and Context this is for ort 1.17.0 release Fixes #17723 --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-12-19 10:42:54 -08:00
luoyu-intel	5f00bc9931	Integrate high-performance x64 gemm library to MLAS (#17669 ) ### Description Improve MLAS to support high-performance x64 INT4 kernels ### Motivation and Context 1. improve LLM inference performance on Intel CPUs. 2. support more 4bit quantization types: nf4, fp4 3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension 4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16 5. support MatMulNBits' data format ### Tasks - [x] support block_size: 32, 128, -1(per channel) - [x] get weight pack size without memory allocation - [x] use ort's thread pool for parallelism - [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8 ### Benchmark Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 47613 \| 47401 \| 12970 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 6347792 \| 6317562 \| 109 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 11814014 \| 11757847 \| 59 Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 50222 \| 50031 \| 13759 Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 2038222 \| 2028743 \| 341 Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 3792832 \| 3774485 \| 191 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 58717 \| 58501 \| 11467 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 1360846 \| 1354598 \| 543 Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 2564232 \| 2551365 \| 266 Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 57929 \| 57694 \| 12047 Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5495330 \| 5465810 \| 126 Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10676240 \| 10617817 \| 66 Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 68305 \| 68047 \| 10026 Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5504862 \| 5476215 \| 126 Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 11758623 \| 11697337 \| 66 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 67713 \| 67451 \| 10298 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5508325 \| 5480237 \| 126 Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10738528 \| 10681656 \| 64 Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 60708 \| 60486 \| 11321 Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5523784 \| 5495736 \| 126 Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10829633 \| 10772161 \| 67 Reference: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time \| 53088 \| 52911 \| 13364 Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time \| 6268981 \| 6230335 \| 110 Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time \| 11701237 \| 11632339 \| 59 Win11+12900K 8 cores: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time \| 215976 \| 211295 \| 2884 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time \| 60960590 \| 60937500 \| 10 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time \| 1.18E+08 \| 1.19E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time \| 470377 \| 453059 \| 1414 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time \| 1.54E+08 \| 1.53E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time \| 3.18E+08 \| 3.13E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time \| 569072 \| 559398 \| 1229 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time \| 1.54E+08 \| 1.52E+08 \| 4 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time \| 3.22E+08 \| 3.28E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time \| 1486055 \| 1473325 \| 403 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time \| 4.14E+08 \| 4.14E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time \| 8.88E+08 \| 8.59E+08 \| 1 --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mengni.wang@intel.com>	2023-12-19 09:36:31 -08:00
pengwa	ccf3b2054b	Allow layer-wise recompute (#18566 ) ### Allow layer-wise recompute Early, we need users/developers to specify the subgraphs to recompute, now we introduced a more user-friendly way to enable recompute for all detected stashed activation recomputation subgraphs. This scarifies getting the best configs while makes it easier to support user requirements when they switches from PyTorch per-layer gradient checkpoint to ORTModule. `ORTMODULE_MEMORY_OPT_LEVEL` is introduced to control the usage, by default, it is 0, e.g. `USER_SPECIFIED`, all subgraphs definedin `ORTMODULE_MEMORY_OPT_CONFIG` will be recomputed. So this is compatible to existing recompute usage in ORTModule integrated models. Using `ORTMODULE_MEMORY_OPT_LEVEL=1`, we will enable all recompute plans detected, so those configs in `ORTMODULE_MEMORY_OPT_CONFIG` will not be respected any more. Add Unit Tests using 3 layer blooms. https://github.com/microsoft/onnxruntime/blob/pengwa/add_aggresive_recompute/docs/Memory_Optimizer.md	2023-12-12 08:44:05 +08:00
Xavier Dupré	d41dd77241	Extend API page on the python documentation (#18762 )	2023-12-09 15:33:57 -08:00
Hector Li	9768a727e1	[QNN EP] Fix a bug that can't create context binary if the model has inputs/outputs with different data type (#18722 ) Fix a bug that can't create context binary if the model has inputs/outputs with different data type ### Description Update EPContext op schema to unblock nodes with different data type among inputs & outputs	2023-12-06 13:07:09 -08:00
pengwa	4bfa84487c	Skip module clone for preparing large model export (#18663 ) ### Skip module clone for preparing large model export For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It failed during preparing outputs which will be used for torch.onnx.export. The reason, we deep copy all the params including both big sizes of frozen weights, + a little bit of Lora trainable weight. This PR will firstly check whether the GPU memmory is enough for a cloned module, if not, skip the copy. Copying the module is to guarantee the fw path run may change the weight, while this case should be rare. But for now, Not-Able-To-Run is worse than Runnable-with-A-little-bit-different-initial-weight, especially for large models.	2023-12-05 12:41:17 -08:00
Vincent Wang	e1d1033131	[ORTModule] Remove Unused Arguments from Generated Triton Code (#18636 ) This PR: - Remove unused arguments from generated triton code, - Remove unnecessary mask for symbolic shape case from generated triton code. - Add doc for usage of ORTMODULE_TRITON_CONFIG_FILE.	2023-11-30 18:32:36 +08:00
Dmitri Smirnov	d2dfbf4179	Add float16 type support to SplitToSequence and make code type independent (#18594 ) ### Description Add support for `float16` type to address the below issue. Re-work the code to make it type independent. This reduces binary size by ~11 K. ![image](https://github.com/microsoft/onnxruntime/assets/11303988/1a77c7bc-34a8-478c-a16a-abd94062c6c6) ### Motivation and Context This PR addresses https://github.com/microsoft/onnxruntime/issues/18481	2023-11-29 10:44:59 -08:00
pengwa	43a5147e01	Memory optimization refactor and refinement (#17481 ) ### Memory optimization refactor and refinement Currently memory optimizer runs graph transformations and print recompute opportunities in INFO level, while ORT backend has many many INFO level logs making users hard to find those information. So we are looking for a Python binding API to retrieve the memory optimization opportunities instead of depending on the MemoryOptimizer's default logging. Then we can print ORTModule feature statistics using this information. Also, with such an API, we can create an ORT session created, where allocation plan is done, the analysis will consider buffer reuse as well. This can void giving some recomputation subgraphs that are reusing other subgraphs' output buffers. Check https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md for the new flow using `MemoryOptimizer`. This pull requests made following refactoring: 1. Print the log in ORTModule Python script, along with ORTModule feature enabling stats. This is implemented by exposing an API `get_serialized_ortmodule_memory_stat` to retrieve the memory optimization opportunities. 2. We are analyzing memory optimization opportunities considering ORT memory planning. This is done by firstly creating the execution graph without enabling MemoryOptimizer, then we call `execution_agent.get_serialized_ortmodule_memory_stat` which internally will consider the session memory allocation planner when analyzing memory optimization opportunity. As a direct result, the memory optimization opportunities can show those stashed activations that are reusing other buffers. 3. Move recompute analysis logic from memory_optimizer.h/cc to recompute_analysis.h/cc. 4. Abstract optimization strategies for their own implementation. This will make introducing new strategies (for example compression and decompression ) easier. New logging matrix (INFO Level), in WARNING level, the details will NOT show. ``` 2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] - *** ONNX Runtime Training (ORTModule) is accelerating your model *** ORTModule is enabled with following features ON/OFF for [training] mode: ATen Executor : ON : Dispatch ATen operators to ORT's ATen executor Cast Propagation : ON : Level 1 enabled Custom Function : ON : Support custom torch.autograd.Function export and execution Memory Optimizer : ON : RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs: Config Freq Saving(B) Saving Symbolic(Bytes) - Plan 1 : ON : Reshape+Where+BiasSoftmax+:1:-1 5 671,088,640 640.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 2 : ON : Cast+:1:-1 6 402,587,648 inputs_input_ids_dim0inputs_input_ids_dim1(384.0inputs_input_ids_dim1 - 64.0) - Plan 3 : OFF : Reshape+Where+:1:-1 1 134,217,728 128.0inputs_input_ids_dim0inputs_input_ids_dim1*2 - Plan 4 : OFF : BiasSoftmax+:1:-1 1 134,086,656 128.0inputs_input_ids_dim0inputs_input_ids_dim1(inputs_input_ids_dim1 - 1) - Plan 5 : OFF : BiasGelu+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 6 : OFF : FusedMatMul+:1:-1 6 125,808,640 inputs_input_ids_dim0(122880.0inputs_input_ids_dim1 - 20480.0) - Plan 7 : OFF : FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 5 26,214,400 25600.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 8 : OFF : Add+:1:-1 1 5,237,760 5120.0inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) - Plan 9 : OFF : Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 1 4,096 4.0inputs_input_ids_dim0inputs_input_ids_dim1 - Plan 10 : OFF : Cast+:2:-1 1 2,048 2.0inputs_input_ids_dim0inputs_input_ids_dim1 Compute Optimizer : ON : Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0 - FLOPReduction : ON : Reduce FLOPs by upstreaming shrinking-sized ops Auto Fallback : ON : Fallback to PyTorch when encountering unsupported ops TritonOp Enabled : OFF : ORT will switch to Triton for executing some ops to further accelerate training. ZeRO Stage3 Support : OFF : Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0 Total ORT initialization overhead is 10.73s where export takes 8.39s. Other overhead details: graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0 Note 1: use comma to enable multiple plans at the same time. export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,... Note 2: saving is calculated based on the 1st batch symbolic dim values: inputs_input_ids_dim0=1, inputs_input_ids_dim1=1024, inputs_attention_mask_dim0=1, inputs_attention_mask_dim1=1024, inputs_labels_dim0=1, inputs_labels_dim1=1024, ************************************************************************ ``` If DEVINFO level is enabled, then more details about the memory optimizations are printed. ``` MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1 ========================================================================================================================================== \|Freq \| Memory Optimization Opportunities (Clustered by node-level activation patterns) \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|3 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(3), \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1 \| \| \| Stashed Activations: \| \| \| - ReuseFreq : Output 0(2), \| \| \| - Output 0 : [ x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=2 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|2 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Where+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph FusedMatMul+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \| \| \| \| \|>>Option 2 : RecomputeWithCompromise subgraph Cast+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasSoftmax+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph BiasGelu+ \| \| \| Status : Enabled, requested count=-1, actual applied count=1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| \|1 \|For each row options are mutually exclusive, only one of them can be enabled. \| \| \| \| \| \|>>Option 1 : Recompute subgraph Add+ \| \| \| Status : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1 \| \| \| Stashed Activations: \| \| \| - Output 0 : [inputs_input_ids_dim0(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved \| \|_ _ _ _\|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \| ========================================================================================================================================== Note: use comma as a separator for enabling more than one subgraphs. *********************************************************************** ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-23 11:39:00 +08:00
Vincent Wang	3bc9efc7b2	[ORTModule] Adjust Attention Patterns for Efficient Attention ATen Fallback (#18471 ) Adjust attention patterns to match latest Whisper+exporter. Also add some condition check and add docs.	2023-11-22 15:24:05 +08:00
Jambay Kinley	1af0681554	Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484 ) ### Description <!-- Describe your changes. --> Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for QLoRA fine-tuning. - On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16 dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16` type which uses float for compute. - I have validated the op in a llama2-7b training scenario. The losses match pytorch training and the training throughput is better. - Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable. The custom autograd function exporter in onnxruntime-training is updated to support the latest version of bitsandbytes. They changed how the `quant_state` is stored. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable QLoRA fine-tuning with bfloat16.	2023-11-20 09:52:58 -08:00
kailums	1a29460919	rope support 4D input tensor (#18454 ) ### Description <!-- Describe your changes. --> change RotaryEmbeddings op implementation, add support for 4D input tensor that is with shape of [batch, num_heads, seq_len, head_size]. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Current RotaryEmbedding op only support 3d input tensor with shape [batch, seq_len, hidden_size] For llamav2 model, when using FusionRotaryEmbeddings to only fuse RotaryEmbeddings op, there will be a transpose operation for query and key, and then the input tensor of RotaryEmbeddings becomes 4D [batch, num_heads, seq_len, head_size]. This scenario can't be supported by current RotaryEmbeddings implementation. So it needs to support 4D input tensor.	2023-11-17 20:38:15 +08:00
aciddelgado	adb56df2e8	Aciddelgado/gqa local (#18375 ) ### Description Implement preliminary version of local (sliding window) attention. Currently only supported by Flash Attention (sm >= 80, Linux). Currently only supports sliding attention with a large cached kv. ### Motivation and Context This change enables to run Mistral and other models which use sliding window attention.	2023-11-16 15:01:06 -08:00
Ye Wang	f9af94009b	onboard MoE (#18279 ) ### Description <!-- Describe your changes. --> 1. Introduce MoE CUDA op to ORT based on FT implementation. 2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows. Remove patch file for cutlass 3.0.0. 3. Sharded MoE implementation will come with another PR limitation: __CUDA_ARCH__ >= 700 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-11-14 16:48:51 -08:00
Prathik Rao	7a3da4526f	add bfloat16 support for CUDA Neg kernel (#18306 ) ### Description <!-- Describe your changes. --> Registers BFloat16 datatype as valid input type for CUDA Neg Kernel. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-08 18:32:12 -08:00
pengwa	2151c79bf1	Tune ORTModule logging experience a bit (#18298 ) ### Tune logging experience a bit After last time we update the ORTModule log experience, we found few issues: 1. `INFO` level output too many things, including PyTorch exporter verbose logs (tracing graphs) on every ranks. On this level, we only want to - Output a little bit more information to Users than `WARNING` level, for example the memory recomputation recommendations or other not-fully-ready features. - Output a little bit more information for a quick diagnostic, collected on rank-0 only. 2. ONNX Runtime logging filter during graph build, session init sometimes will hide the issues (for example segement fault), there is no useful information in `WARNING`/`INFO` for users to report to us. This is not good! 3. Some of our devs like using `pdb` to debug Python code, but if we add `import pdb; pdb.set_trace()` in models' code might hang when they use `INFO` or `WARNING`, where exporter happens and all output got redirected due to log filtering. The only workaround is to switch to VERBOSE, which output toooooooooooo many logs. The corresponding changes proposed here are: 1. For `INFO` logging, - We only logs rank-0. - We restricted the ORT backend logging level to be WARNING in this case, because ORT backend code output way too many logs that should be under verbose, while we cannot guarantee we can get them cleaned up immediately once they are added. - We output the PyTorch exporter verbose log (including tracing graph), which is useful for a quick diagnostic when an issue happens. 2. Remove all logging filtering on ORT backend, then the segment fault issue details will not be hidden once it happens again. 3. Introduced a `DEVINFO` logging, - Log logs on all ranks - Log ORT backend logging level INFO - PyTorch exporter logging filtering are all turned OFF (to unblock the pdb debugging). 4. Currently, to use Memory Optimizer, need use DEVINFO (which will output ORT backend INFO log). So update memory optimizer document to reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will update the requirement back to INFO for show memory optimization infos. You can check https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations for a better view of different log levels. This PR also extract some changes from a bigger one https://github.com/microsoft/onnxruntime/pull/17481, to reduce its complexity for review. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>	2023-11-08 17:42:50 +08:00
aciddelgado	3dece27f51	GQA Flash Attention with Attention Mask (#18283 ) ### Description GQA now only works with Flash Attention with Attention Mask input, allowing for batched input. Note: This PR Disables Memory Efficient Attention, only allowing Flash Attention kernel to be used. ### Motivation and Context Allows GQA to work with batched input. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>	2023-11-07 17:47:51 -08:00
liqun Fu	6127dd1d2d	implement gridsample 20 (#17744 )	2023-11-07 10:42:41 -08:00
Patrice Vignola	800ae7742c	[DML EP] Add RotaryEmbedding (#18158 ) This is a graph implementation of RotaryEmbedding since there's no time to add it to DML before 1.16.2, but it eventually should move into DirectML since we're bandwidth-bound.	2023-11-07 08:26:11 -08:00
Prathik Rao	8978bdc59d	add bfloat16 support for where operator (#18118 ) ### Description <!-- Describe your changes. --> Adds bfloat16 as a valid input parameter type for where node for ONNX opset 16+. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime training. --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-11-02 12:23:20 -07:00
pengwa	c8e1038eab	Optimize 4bit Qlora training (#18131 ) ### Optimize 4bit Qlora training Extent existing `MatmulBnb4bit` to its usage in training scenarios. The PR includes following changes: 1. Add special `torch.autograd.Function` export logic for `bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before common PythonOp exporter. 2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which help skip some inference specific logic in implementation. 3. Add `transB` optional attribute, which is by default be 1; setting it to be 0 is needed by backward usage. Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9% throughput gains. The reason is: `bitsandbytes.autograd._functions.MatMul4Bit` has logic `ctx.save_for_backward`, which would need an additional copy in PythonOp, otherwise, the tensor might be released by ORT, while backward op still references it. Removing the clones also reduce the peak memory consumptions because `bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not needed in backward compute.	2023-11-02 09:46:11 -07:00
aciddelgado	178f7caaeb	GQA Memory Efficient Kernel (#17920 ) Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.	2023-11-01 20:04:22 -07:00
Preetha Veeramalai	d87216bcb1	Openvino ep ort 23.1 (#17911 ) ### Description Integration to OpenVINO 2023.1 ### Motivation and Context - Alignment with latest OpenVINO Version. - Device name change from VPUX to NPU and Remove from supported list until official public support is available. --------- Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com> Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com>	2023-11-01 08:39:39 -07:00
Tianlei Wu	95f053c652	[CUDA] Update GroupNorm and Add SkipGroupNorm (#18091 ) * Add a new operator SkipGroupNorm to support skip and bias inputs. * Update GroupNorm kernel to support number of channels used in SD XLrefiner. * Add epsilon in kernel * Add parity and performance test script * Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc. ### Motivation and Context Update GroupNorm to support SD XL Refiner and beyond.	2023-10-31 10:27:20 -07:00
Xavier Dupré	b5f242e978	GemmFloat8 as a contrib ops (#16051 ) ### Description Add support for Gemm with float 8 as a contrib op. --------- Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-10-27 14:33:55 +02:00
Tang, Cheng	37873be86d	enable reduce ops on opset18 (#18053 ) ### Description Opset 18 apply the "axes as input" change from ReduceSum to all the other reduce ops. Our cuda kernel actually support it, but we didn't enable it for opset18. This PR update the reduce ops' kernel registration to enable the "axes as input" behavior for opset18. As part of the fix, I also simplify the reduce op kernel registration part. ORT doesn't require the kernel definition need to be exactly the same as onnx op definition. For our case, which we share the same kernel for all the reduce ops (from version 1 to version 18), we don't need to maintain different version of kernel definitions. we can simplify it by just using a single kernel definition for multiple versions. Although for some cases, we might register more types for legacy versions, but it is harmless. Framework is using schema to validate the graph, not kernel definition. --------- Co-authored-by: Cheng Tang <chenta@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com>	2023-10-26 16:57:21 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
liqun Fu	706e13e0c9	implement affinegrid cpu kernel (#17777 )	2023-10-25 10:46:04 -07:00
liqun Fu	efa0cc2562	implement isinf20 and isnan20 (#17874 )	2023-10-24 10:58:54 -07:00
kunal-vaishnavi	2a17d5cf32	LLaMA Model Optimization (#18021 ) ### Description This PR contains fusion-level and kernel-level optimizations for [Meta's LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/). Some of the added optimizations include: - SimplifiedLayerNorm changes - Fusions for multiple variants - SkipSimplifiedLayerNorm changes - Kernel support for CPU - Rotary embeddings (previously did not exist) - Fusions for multiple variants - CPU and CUDA kernels - Supports interleaving and non-interleaving in the same kernels - Optimized cache that requires half of its originally exported sizes - Reduced from `(max_sequence_length, head_size)` to `(max_sequence_length, head_size / 2)` - Multi-head attention - Support for 2D and 3D attention masks - Group query attention (for FP16 CUDA and INT4 CUDA) - Integration with flash attention v2 and past-present buffer sharing - Removes need for `attention_mask` input as it is supported in the kernel - 4 bit quantization - `block_size` parameter is available for customizing - Support the new changes for [Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Support combinations of the below variants (ex: export ORT version and run with Optimum) Supported variants of LLaMA-2 include: - [ORT version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama) - Produces one ONNX file that is already optimized (and quantized if requested) - Integrates with Optimum - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx) - Already exported and available off-the-shelf - Faster versions of those models will be uploaded there soon - [Hugging Face version](https://huggingface.co/meta-llama) - Models that end with `-hf` - Some older and current versions of [`transformers`](https://github.com/huggingface/transformers) and [`optimum`](https://github.com/huggingface/optimum) that export the model to ONNX differently - Note that while some older versions are supported, it is recommended to use the latest package versions. ### Usage To use the optimizations, please see `README.md` for details. Please note the various `requirements.txt` files for the package versions recommended in order to use these changes. To run the ORT transformer optimizer separately, run the script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0 ``` ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/14997 - https://github.com/microsoft/onnxruntime/issues/16254 - https://github.com/microsoft/onnxruntime/issues/17681 - https://github.com/microsoft/onnxruntime/issues/17925 - https://github.com/microsoft/onnxruntime-inference-examples/issues/320 This PR uses changes from the following PRs: - https://github.com/pytorch/pytorch/pull/104468 - https://github.com/pytorch/pytorch/pull/109759 - https://github.com/microsoft/onnxruntime/pull/17020 - https://github.com/microsoft/onnxruntime/pull/17674 - https://github.com/microsoft/onnxruntime/pull/17890 - https://github.com/microsoft/onnxruntime/pull/17920 - https://github.com/huggingface/transformers/pull/26162 - https://github.com/huggingface/optimum/pull/1257 - https://github.com/huggingface/optimum/pull/1289 - https://github.com/huggingface/optimum/pull/1462 ### New TorchDynamo Exporter (experimental stage) This PR uses changes from the following issues and PRs to begin supporting the [new TorchDynamo exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter): - https://github.com/huggingface/transformers/pull/26307 - https://github.com/pytorch/pytorch/issues/104903 - https://github.com/pytorch/pytorch/pull/105040 - https://github.com/microsoft/onnxscript/pull/847 - https://github.com/microsoft/onnxscript/pull/862 - https://github.com/microsoft/onnxscript/issues/493	2023-10-23 13:00:56 -07:00
Yufeng Li	11af34440a	Add MatMul 4bits support on GPU (#17890 ) ### Description <!-- Describe your changes. --> Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU	2023-10-13 16:55:30 -07:00
Zhang Lei	762703e037	Support output cross qk, dtw and more for whisper model (#17500 ) Support cross qk in beam search for whisper model and related features Make whisper exporting tools support cross qk and some related features, * extra_decoding_ids * no_speech_prob Implement DTW kernel, unfold tensor kernel with unit test Several fix related with multiple session running parallel, like: * guard multihead_attention, fused_fp16_runner_ * some memory allocation with stream awareness * add use_ep_level_unified_stream option	2023-10-13 11:47:15 -07:00
pengwa	63dc5dc1a9	Add document for PythonOp (#17888 ) ### Add document for PythonOp https://github.com/microsoft/onnxruntime/blob/pengwa/pythonop_doc/docs/ORTModule_PythonOp_Notes.md ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-12 08:36:22 +08:00
aciddelgado	406cd324e0	[CUDA] GroupQueryAttention operator using FlashAttention (#17674 ) ### Description Added Group Query Attention op, supporting integer multiple number of heads for Q / KV. As of now, this op can only use FlashAttention kernel, meaning it only supports sm>=80 on Linux. Results from onnxruntime/test/python/transformers/benchmark_gqa.py show an on-average ~37% speed-up over Decoder Masked Multi-Head Attention, with even greater improvements for long past sequence lengths. ``` op batch s_kv heads h_dim ms TFLOPS gqa 16 2048 8 32 0.34 0.10 dmmha 16 2048 8 32 0.39 0.09 --------- gqa 16 2048 8 64 0.45 0.15 dmmha 16 2048 8 64 0.61 0.11 --------- gqa 16 2048 8 128 0.54 0.25 dmmha 16 2048 8 128 0.83 0.16 --------- gqa 16 2048 16 32 0.45 0.15 dmmha 16 2048 16 32 0.69 0.10 --------- gqa 16 2048 16 64 0.69 0.19 dmmha 16 2048 16 64 0.83 0.16 --------- gqa 16 2048 16 128 0.71 0.38 dmmha 16 2048 16 128 1.28 0.21 --------- gqa 16 2048 32 32 0.58 0.23 dmmha 16 2048 32 32 0.77 0.17 --------- gqa 16 2048 32 64 0.58 0.46 dmmha 16 2048 32 64 1.25 0.21 --------- gqa 16 2048 32 128 0.76 0.71 dmmha 16 2048 32 128 2.15 0.25 --------- gqa 16 2048 64 32 0.68 0.39 dmmha 16 2048 64 32 1.23 0.22 --------- gqa 16 2048 64 64 0.77 0.70 dmmha 16 2048 64 64 2.11 0.25 --------- gqa 16 2048 64 128 1.10 0.97 dmmha 16 2048 64 128 4.06 0.26 --------- gqa 16 2048 128 32 1.00 0.54 dmmha 16 2048 128 32 2.09 0.26 --------- gqa 16 2048 128 64 1.10 0.97 dmmha 16 2048 128 64 4.08 0.26 ``` ### Motivation and Context As of now, this op is targeted for use on LLama models, as it supports kv-caching and different number of heads for Q and KV (Grouped Query Attention). We plan to add support for more platforms, input formats, etc. in the future. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>	2023-10-09 12:43:12 -07:00
kyoshisuki	ba72bb6f98	Fix a typo in ABI_Dev_Notes.md (#17832 )	2023-10-09 07:51:34 -07:00
Hector Li	385fab5bae	[QNN EP] Qnn cache improvement (#17757 ) ### Description Improve the QNN context binary cache feature to reduce the memory overhead and initialization time overhead. Instead of dumping a Qnn context binary file with metadata as header, we dump a Onnx format file with metadata inside Onnx node. ### Motivation and Context reduce the memory overhead and initialization time overhead	2023-10-06 15:56:33 -07:00
liqun Fu	2be4dc6d04	ONNX 1.15 integration (#17125 ) ### Description this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released ### Motivation and Context Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-09-26 14:44:48 -07:00
Nicolò Lucchesi	4ab0e17fe8	[Technical docs] Fixed a couple of old links in `FAQ.md` (#17415 ) ### Description Updated a couple of old links in the technical documentation that where pointing to files present prior to the migration to https://onnxruntime.ai/docs.	2023-09-26 13:38:24 -07:00
Vincent Wang	e6301eee6a	Bump Up Version to 1.17.0 (#17587 ) Bump up version to 1.17.0 as the 1.16.0 release branch had been branched out.	2023-09-20 11:02:58 +08:00
Adrian Lizarraga	dea425e7c1	[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015 ) ### Description - Adds 16-bit integer support to: - Quantization kernel implementations: Intel, Neon, and Power intrinsics - DequantizeLinear and QuantizeLinear contrib ops - QNN EP Quantize and Dequantize operators - Python quantization scripts - Disables QDQ fusions for most 16-bit QDQ node groups (need to add 16-bit support to QLinear* ops) - Retains support for dropping QDQ nodes from Split, Gather, Reshape, Transpose, Squeeze, and Unsqueeze node groups. Sample python code to generate QDQ model with 16-bit activations and 8-bit weights: ```python quantize_static( input_model_path, output_model_path, data_reader, quant_format=args.quant_format, per_channel=args.per_channel, activation_type=QuantType.QUInt16, weight_type=QuantType.QUInt8, extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True}, ) ``` Note that enabling the `UseQDQContribOps` extra option is not strictly necessary. If the 16bit types are used without enabling `UseQDQContribOps`, the QDQ ops domains are overridden to 'com.microsoft', and a warning is printed to stdout. ### Automated Tests MLAS/CPU EP: - [x] 16-bit QuantizeLinear computation - [x] 16-bit DequantizeLinear computation Optimizer: - [x] Transpose QDQ fusion - [x] Gather QDQ fusion - [x] Reshape QDQ fusion - [x] Squeeze QDQ fusion - [x] Unsqueeze QDQ fusion - [x] Split drop QDQ - [x] DoubleQDQPairRemover - [x] Transpose optimization - [x] EnsureUniqueDQForNodeUnit - [x] Common subexpression elimination (DQ not removed) - [x] Constant folding QNN EP: - [x] Conv 16-bit activations, 8-bit weights - [x] MatMul 16-bit activations, 8-bit weights - [x] Unary 16-bit QDQ ops - [x] Binary 16-bit QDQ ops Quantization tool: - [x] Test creation of 16-bit QDQ model ### Motivation and Context Support mixed precision (8bit weights, 16bit activations) models. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-09-18 09:43:34 -07:00
Nat Kershaw (MSFT)	a2fba28f6c	Remove extraneous javascript includes (#17558 )	2023-09-14 20:43:24 -07:00
Nat Kershaw (MSFT)	bbcf4b45dc	Upgrade doxygen to 1.9.8 (#17525 )	2023-09-12 20:44:27 -07:00
Baiju Meswani	5d2c57363f	Sign CUDA Kernel (#17293 )	2023-08-28 21:03:58 -07:00
Adrian Lizarraga	5a83a67f32	Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (#17127 ) ### Description - Enables int32 support for com.microsoft.DequantizeLinear (contrib op) - Makes the `zero_point` input optional for Quantize/Dequantize contrib ops - Enables QDQ transformations with the Quantize/Dequantize contrib ops - Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests, TransposeOptimizerTests ### Testing List of tested graph transformations: - [x] QDQSelectorActionTransformer - qdq_transformer_test.cc - [x] QDQS8ToU8Transformer - qdq_transformer_test.cc - [x] DoubleQDQPairsRemover - qdq_transformer_test.cc - [x] IdenticalChildrenConsolidation - qdq_transformer_test.cc - [x] QDQPropagation - qdq_transformer_test.cc - [x] QDQFinalCleanup - qdq_transformer_test.cc - [x] CliQuantFusion - qdq_transformer_test.cc - [x] ReluQuantFusion - qdq_transformer_test.cc - [x] EnsureUniqueDQForNodeUnit - ensure_unique_dq_for_node_unit_test.cc - [x] TransposeOptimizer - transpose_optimizer_test.cc - [x] CommonSubexpressionElimination - graph_transform_test.cc - [x] ConstantFolding - graph_transform_test.cc ### Motivation and Context We need to [support mixed 16-bit/8-bit precision QDQ models](https://github.com/microsoft/onnxruntime/pull/17015). This PR is the first step in achieving this goal: we need to make QDQ contrib ops work with our optimizations/transformations. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2023-08-25 09:57:51 -07:00
pengwa	d90afc697b	Introduce ZeROOffloadSubscriber for ORTModule (#17006 ) ### Introduce ZeROOffloadSubscriber for ORTModule As part of the work: integrate ORTModule with DeepSpeed stage3, this PR mainly focus on moving original PyTorch-based (leveraging hooks) param partition/offload implementation to ORTModule compatible implementation. Changes include: 1. Refactor `SubscriberBase`/`SubcriberManager` to support pre-forward/post_forward hooks. 2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook function as much as possible. Since all hook functions are defined in `DeepSpeedZeRoOffload._register_hooks_recursively` and `DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is, the closure is not complex, all hooks are referencing the owning `DeepSpeedZeRoOffload` instance, so we can create new hook function with `FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance, then call the new created function in subscriber's `pre_forward_module_apply_impl` and `post_forward_module_apply_impl` interfaces. 3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to register the `ZeROOffloadSubscriber` for the model, then we don't need change any code on the DeepSpeed repo (at least so far). 4. Fix the ATen embedding custom symbolic exporter function by tolerating weights size be (0) (changed by DeepSpeed zero stage 3). UT will be added once stage3 is fully supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-08-25 00:15:22 +08:00
Emmanuel Ferdman	08ca624d2b	Fix: update hyperlinks to the Jupyter notebooks (#16145 ) ### Description <!-- Describe your changes. --> This PR fixes broken hyperlinks in the documentation that should lead users to Jupyter notebooks. Currently, the hyperlinks are not working as intended. The PR resolves this issue by updating the hyperlinks to correctly direct users to the Jupyter notebooks. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? --> It fixes broken hyperlinks leading to the Jupyter notebooks.	2023-08-21 09:53:05 -07:00
Wenbing Li	d052c8a45c	Remove the extensions submodule (#17097 ) ### Description Remove the onnxruntime-extensions submodule since it now was used via cmake FetchContent ### Motivation and Context The submodule relies on an outdated version of the extensions, and the build instructions should be updated to eliminate any confusion.	2023-08-14 10:16:33 -07:00

1 2 3 4 5 ...

635 commits