Commit graph

635 commits

Author SHA1 Message Date
liqun Fu
e10a8ae31f
reduce max/min 20 (#17805)
### Description
reducemax/min have been updated in onnx(20). implement it in ort



### Motivation and Context
this is for ort1.17.0 release

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2024-01-04 17:41:01 -08:00
Jeff Bloomfield
7401b6661d Update OperatorKernels.md 2024-01-04 11:27:03 -08:00
Jeff Bloomfield
8ea3e68192 Update ContribOperators.md 2024-01-04 10:10:46 -08:00
liqun Fu
32fcf73740
Implement dft(20) (#17821)
### Description
dft is updated in opset20. implement it in ort



### Motivation and Context
this is for ort 1.17.0 release

Fixes #17723

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2023-12-19 10:42:54 -08:00
luoyu-intel
5f00bc9931
Integrate high-performance x64 gemm library to MLAS (#17669)
### Description
Improve MLAS to support high-performance x64 INT4 kernels



### Motivation and Context
1. improve LLM inference performance on Intel CPUs.
2. support more 4bit quantization types: nf4, fp4
3. support dynamic block size: block size aligned with kernel's tiling
size(e.g. 4 for VNNI kernel), per channel on N dimension
4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni,
amx_bf16, amx_int8, avx512_fp16
5. support MatMulNBits' data format

### Tasks
- [x] support block_size: 32, 128, -1(per channel)
- [x] get weight pack size without memory allocation
- [x] use ort's thread pool for parallelism
- [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8

### Benchmark
Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 47613
| 47401 | 12970
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
6347792 | 6317562 | 109
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
11814014 | 11757847 | 59
Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
50222 | 50031 | 13759
Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
2038222 | 2028743 | 341
Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
3792832 | 3774485 | 191
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
58717 | 58501 | 11467
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
1360846 | 1354598 | 543
Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
2564232 | 2551365 | 266
Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 57929
| 57694 | 12047
Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5495330 | 5465810 | 126
Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10676240 | 10617817 | 66
Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
68305 | 68047 | 10026
Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5504862 | 5476215 | 126
Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
11758623 | 11697337 | 66
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
67713 | 67451 | 10298
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5508325 | 5480237 | 126
Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10738528 | 10681656 | 64
Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
60708 | 60486 | 11321
Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5523784 | 5495736 | 126
Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10829633 | 10772161 | 67


Reference:

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time | 53088 | 52911 |
13364
Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time | 6268981 |
6230335 | 110
Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time | 11701237 |
11632339 | 59

Win11+12900K 8 cores:
Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time | 215976
| 211295 | 2884
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time |
60960590 | 60937500 | 10
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time |
1.18E+08 | 1.19E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time |
470377 | 453059 | 1414
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time |
1.54E+08 | 1.53E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time |
3.18E+08 | 3.13E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time |
569072 | 559398 | 1229
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time |
1.54E+08 | 1.52E+08 | 4
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time |
3.22E+08 | 3.28E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time |
1486055 | 1473325 | 403
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time |
4.14E+08 | 4.14E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time |
8.88E+08 | 8.59E+08 | 1

---------

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
Co-authored-by: Mengni Wang <mengni.wang@intel.com>
2023-12-19 09:36:31 -08:00
pengwa
ccf3b2054b
Allow layer-wise recompute (#18566)
### Allow layer-wise recompute 

Early, we need users/developers to specify the subgraphs to recompute,
now we introduced a more user-friendly way to enable recompute for all
detected stashed activation recomputation subgraphs. This scarifies
getting the best configs while makes it easier to support user
requirements when they switches from PyTorch per-layer gradient
checkpoint to ORTModule.

`ORTMODULE_MEMORY_OPT_LEVEL` is introduced to control the usage, by
default, it is 0, e.g. `USER_SPECIFIED`, all subgraphs definedin
`ORTMODULE_MEMORY_OPT_CONFIG` will be recomputed. So this is compatible
to existing recompute usage in ORTModule integrated models.

Using `ORTMODULE_MEMORY_OPT_LEVEL=1`, we will enable all recompute plans
detected, so those configs in `ORTMODULE_MEMORY_OPT_CONFIG` will not be
respected any more.


Add Unit Tests using 3 layer blooms. 



https://github.com/microsoft/onnxruntime/blob/pengwa/add_aggresive_recompute/docs/Memory_Optimizer.md
2023-12-12 08:44:05 +08:00
Xavier Dupré
d41dd77241
Extend API page on the python documentation (#18762) 2023-12-09 15:33:57 -08:00
Hector Li
9768a727e1
[QNN EP] Fix a bug that can't create context binary if the model has inputs/outputs with different data type (#18722)
Fix a bug that can't create context binary if the model has inputs/outputs with different data type

### Description
Update EPContext op schema to unblock nodes with different data type among inputs & outputs
2023-12-06 13:07:09 -08:00
pengwa
4bfa84487c
Skip module clone for preparing large model export (#18663)
### Skip module clone for preparing large model export

For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It
failed during preparing outputs which will be used for
torch.onnx.export. The reason, we deep copy all the params including
both big sizes of frozen weights, + a little bit of Lora trainable
weight.

This PR will firstly check whether the GPU memmory is enough for a
cloned module, if not, skip the copy.

Copying the module is to guarantee the fw path run may change the
weight, while this case should be rare. But for now, Not-Able-To-Run is
worse than Runnable-with-A-little-bit-different-initial-weight,
especially for large models.
2023-12-05 12:41:17 -08:00
Vincent Wang
e1d1033131
[ORTModule] Remove Unused Arguments from Generated Triton Code (#18636)
This PR:
- Remove unused arguments from generated triton code,
- Remove unnecessary mask for symbolic shape case from generated triton
code.
- Add doc for usage of ORTMODULE_TRITON_CONFIG_FILE.
2023-11-30 18:32:36 +08:00
Dmitri Smirnov
d2dfbf4179
Add float16 type support to SplitToSequence and make code type independent (#18594)
### Description
Add support for `float16` type to address the below issue.
Re-work the code to make it type independent.
This reduces binary size by ~11 K.


![image](https://github.com/microsoft/onnxruntime/assets/11303988/1a77c7bc-34a8-478c-a16a-abd94062c6c6)


### Motivation and Context
This PR addresses https://github.com/microsoft/onnxruntime/issues/18481
2023-11-29 10:44:59 -08:00
pengwa
43a5147e01
Memory optimization refactor and refinement (#17481)
### Memory optimization refactor and refinement

Currently memory optimizer runs graph transformations and print
recompute opportunities in INFO level, while ORT backend has many many
INFO level logs making users hard to find those information. So we are
looking for a Python binding API to retrieve the memory optimization
opportunities instead of depending on the MemoryOptimizer's default
logging.
Then we can print ORTModule feature statistics using this information. 
Also, with such an API, we can create an ORT session created, where
allocation plan is done, the analysis will consider buffer reuse as
well. This can void giving some recomputation subgraphs that are reusing
other subgraphs' output buffers.

Check
https://github.com/microsoft/onnxruntime/blob/pengwa/add_devinfo_level/docs/Memory_Optimizer.md
for the new flow using `MemoryOptimizer`.

This pull requests made following refactoring:
1. Print the log in ORTModule Python script, along with ORTModule
feature enabling stats. This is implemented by exposing an API
`get_serialized_ortmodule_memory_stat` to retrieve the memory
optimization opportunities.
2. We are analyzing memory optimization opportunities considering ORT
memory planning. This is done by firstly creating the execution graph
without enabling MemoryOptimizer, then we call
`execution_agent.get_serialized_ortmodule_memory_stat` which internally
will consider the session memory allocation planner when analyzing
memory optimization opportunity. As a direct result, the memory
optimization opportunities can show those stashed activations that are
reusing other buffers.
3. Move recompute analysis logic from memory_optimizer.h/cc to
recompute_analysis.h/cc.
4. Abstract optimization strategies for their own implementation. This
will make introducing new strategies (for example compression and
decompression ) easier.

New logging matrix (INFO Level), in WARNING level, the details will NOT
show.
```
2023-09-13 13:25:09,249 orttraining.rank-0 [WARNING] -
***** ONNX Runtime Training (ORTModule) is accelerating your model *****

ORTModule is enabled with following features ON/OFF for [training] mode:

  ATen Executor         :   ON    :   Dispatch ATen operators to ORT's ATen executor
  Cast Propagation      :   ON    :   Level 1 enabled
  Custom Function       :   ON    :   Support custom torch.autograd.Function export and execution
  Memory Optimizer      :   ON    :   RecomputeConfig: Reshape+Where+BiasSoftmax+:1:-1,Cast+:1:-1, ProbeLevel: 1, available configs:
                                      Config                                                      Freq    Saving(B)       Saving Symbolic(Bytes)
   - Plan 1             :   ON    :   Reshape+Where+BiasSoftmax+:1:-1                             5       671,088,640     640.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 2             :   ON    :   Cast+:1:-1                                                  6       402,587,648     inputs_input_ids_dim0*inputs_input_ids_dim1*(384.0*inputs_input_ids_dim1 - 64.0)
   - Plan 3             :   OFF   :   Reshape+Where+:1:-1                                         1       134,217,728     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1**2
   - Plan 4             :   OFF   :   BiasSoftmax+:1:-1                                           1       134,086,656     128.0*inputs_input_ids_dim0*inputs_input_ids_dim1*(inputs_input_ids_dim1 - 1)
   - Plan 5             :   OFF   :   BiasGelu+:1:-1                                              6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 6             :   OFF   :   FusedMatMul+:1:-1                                           6       125,808,640     inputs_input_ids_dim0*(122880.0*inputs_input_ids_dim1 - 20480.0)
   - Plan 7             :   OFF   :   FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1               5       26,214,400      25600.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 8             :   OFF   :   Add+:1:-1                                                   1       5,237,760       5120.0*inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1)
   - Plan 9             :   OFF   :   Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1         1       4,096           4.0*inputs_input_ids_dim0*inputs_input_ids_dim1
   - Plan 10            :   OFF   :   Cast+:2:-1                                                  1       2,048           2.0*inputs_input_ids_dim0*inputs_input_ids_dim1
  Compute Optimizer     :   ON    :   Enable/Disable with env ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=1/0
   - FLOPReduction      :   ON    :   Reduce FLOPs by upstreaming shrinking-sized ops
  Auto Fallback         :   ON    :   Fallback to PyTorch when encountering unsupported ops
  TritonOp Enabled      :   OFF   :   ORT will switch to Triton for executing some ops to further accelerate training.
  ZeRO Stage3 Support   :   OFF   :   Enable/Disable with env ORTMODULE_ENABLE_ZERO_STAGE3=1/0

Total ORT initialization overhead is 10.73s where export takes 8.39s.
Other overhead details:  graph builder init takes 0.06s, runtime detection takes 0.01s, graph building takes 0.31s, session creation takes 1.96s

Versions: ONNX Runtime - 1.16.0+cu118, ONNX - 1.11.0

Note 1: use comma to enable multiple plans at the same time.
  export ORTMODULE_MEMORY_OPT_CONFIG=<plan1 config>,<plan2 config>,...
Note 2: saving is calculated based on the 1st batch symbolic dim values:
  inputs_input_ids_dim0=1,
  inputs_input_ids_dim1=1024,
  inputs_attention_mask_dim0=1,
  inputs_attention_mask_dim1=1024,
  inputs_labels_dim0=1,
  inputs_labels_dim1=1024,

************************************************************************
```

If DEVINFO level is enabled, then more details about the memory
optimizations are printed.
```

MemoryInsight Summary - User config: BiasGelu+:1:-1,Cast+:2:-1
==========================================================================================================================================
|Freq   | Memory Optimization Opportunities (Clustered by node-level activation patterns)                                                |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|3      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+Reshape+                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+Reshape+:1:-1                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(3),                                                                                                  |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 32 x 240 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+                                                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+:1:-1                                         |
|       |  Stashed Activations:                                                                                                          |
|       |   - ReuseFreq :  Output 0(2),                                                                                                  |
|       |   - Output 0  : [ x 2560 x ], byte/elem: 2, 100% saved                                                                         |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+BiasSoftmax+                                                                  |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+BiasSoftmax+:1:-1                       |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=2                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 10240 x ], byte/elem: 2, 100% saved                           |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|2      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+Add+FusedMatMul+Add+Add+Add+                                                    |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+Add+FusedMatMul+Add+Add+Add+:1:-1         |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x inputs_input_ids_dim1 x 2560 x ], byte/elem: 2, 100% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Where+                                                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Where+:1:-1                                   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved      |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph FusedMatMul+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=FusedMatMul+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Cast+                                                                                       |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Cast+:1:-1                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 2, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+                                              |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Reshape+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Cast+:1:-1   |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved                           |
|       |                                                                                                                                |
|       |>>Option 2     : RecomputeWithCompromise subgraph Cast+                                                                         |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 1 x 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 50% saved                            |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasSoftmax+                                                                                |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=BiasSoftmax+:1:-1                                     |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0 x 32 x inputs_input_ids_dim1 - 1 x inputs_input_ids_dim1 x ], byte/elem: 4, 100% saved  |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph BiasGelu+                                                                                   |
|       |  Status       : Enabled, requested count=-1, actual applied count=1                                                            |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 10240 x ], byte/elem: 2, 100% saved                       |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
|1      |For each row options are mutually exclusive, only one of them can be enabled.                                                   |
|       |                                                                                                                                |
|       |>>Option 1     : Recompute subgraph Add+                                                                                        |
|       |  Status       : Disabled. Enable with export ORTMODULE_MEMORY_OPT_CONFIG=Add+:1:-1                                             |
|       |  Stashed Activations:                                                                                                          |
|       |   - Output 0  : [inputs_input_ids_dim0*(inputs_input_ids_dim1 - 1) x 2560 x ], byte/elem: 2, 100% saved                        |
|_ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |
==========================================================================================================================================
Note: use comma as a separator for enabling more than one subgraphs.

************************************************************************

```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-23 11:39:00 +08:00
Vincent Wang
3bc9efc7b2
[ORTModule] Adjust Attention Patterns for Efficient Attention ATen Fallback (#18471)
Adjust attention patterns to match latest Whisper+exporter. Also add
some condition check and add docs.
2023-11-22 15:24:05 +08:00
Jambay Kinley
1af0681554
Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484)
### Description
<!-- Describe your changes. -->
Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for
QLoRA fine-tuning.
- On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16
dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16`
type which uses float for compute.
- I have validated the op in a llama2-7b training scenario. The losses
match pytorch training and the training throughput is better.
- Cannot add a bfloat16 case in the op unit test since casting BFloat16
to and from float multiple times during the test causes the required
tolerances to be unachievable.

The custom autograd function exporter in onnxruntime-training is updated
to support the latest version of bitsandbytes. They changed how the
`quant_state` is stored.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable QLoRA fine-tuning with bfloat16.
2023-11-20 09:52:58 -08:00
kailums
1a29460919
rope support 4D input tensor (#18454)
### Description
<!-- Describe your changes. -->

change RotaryEmbeddings op implementation, add support for 4D input
tensor that is with shape of [batch, num_heads, seq_len, head_size].

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Current RotaryEmbedding op only support 3d input tensor with shape
[batch, seq_len, hidden_size]

For llamav2 model, when using FusionRotaryEmbeddings to only fuse
RotaryEmbeddings op, there will be a transpose operation for query and
key, and then the input tensor of RotaryEmbeddings becomes 4D [batch,
num_heads, seq_len, head_size].

This scenario can't be supported by current RotaryEmbeddings
implementation. So it needs to support 4D input tensor.
2023-11-17 20:38:15 +08:00
aciddelgado
adb56df2e8
Aciddelgado/gqa local (#18375)
### Description
Implement preliminary version of local (sliding window) attention.
Currently only supported by Flash Attention (sm >= 80, Linux). Currently
only supports sliding attention with a large cached kv.



### Motivation and Context
This change enables to run Mistral and other models which use sliding
window attention.
2023-11-16 15:01:06 -08:00
Ye Wang
f9af94009b
onboard MoE (#18279)
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR

limitation: __CUDA_ARCH__ >= 700


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-14 16:48:51 -08:00
Prathik Rao
7a3da4526f
add bfloat16 support for CUDA Neg kernel (#18306)
### Description
<!-- Describe your changes. -->

Registers BFloat16 datatype as valid input type for CUDA Neg Kernel.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.

---------

Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-11-08 18:32:12 -08:00
pengwa
2151c79bf1
Tune ORTModule logging experience a bit (#18298)
### Tune logging experience a bit

After last time we update the ORTModule log experience, we found few
issues:
1. `INFO` level output too many things, including PyTorch exporter
verbose logs (tracing graphs) on every ranks. On this level, we only
want to
- Output a little bit more information to Users than `WARNING` level,
for example the memory recomputation recommendations or other
not-fully-ready features.
- Output a little bit more information for a quick diagnostic, collected
on rank-0 only.
2. ONNX Runtime logging filter during graph build, session init
sometimes will hide the issues (for example segement fault), there is no
useful information in `WARNING`/`INFO` for users to report to us. This
is not good!
3. Some of our devs like using `pdb` to debug Python code, but if we add
`import pdb; pdb.set_trace()` in models' code might hang when they use
`INFO` or `WARNING`, where exporter happens and all output got
redirected due to log filtering. The only workaround is to switch to
VERBOSE, which output toooooooooooo many logs.

The corresponding changes proposed here are:
1. For `INFO` logging, 
    - We only logs rank-0. 
- We restricted the ORT backend logging level to be WARNING in this
case, because ORT backend code output way too many logs that should be
under verbose, while we cannot guarantee we can get them cleaned up
immediately once they are added.
- We output the PyTorch exporter verbose log (including tracing graph),
which is useful for a quick diagnostic when an issue happens.
2. Remove all logging filtering on ORT backend, then the segment fault
issue details will not be hidden once it happens again.
 3. Introduced a `DEVINFO` logging,
     - Log logs on all ranks
     - Log ORT backend logging level INFO
- PyTorch exporter logging filtering are all turned OFF (to unblock the
pdb debugging).
4. Currently, to use Memory Optimizer, need use DEVINFO (which will
output ORT backend INFO log). So update memory optimizer document to
reflect this. https://github.com/microsoft/onnxruntime/pull/17481 will
update the requirement back to INFO for show memory optimization infos.

You can check
https://github.com/microsoft/onnxruntime/blob/pengwa/devinfo_level/docs/ORTModule_Training_Guidelines.md#log-level-explanations
for a better view of different log levels.

This PR also extract some changes from a bigger one
https://github.com/microsoft/onnxruntime/pull/17481, to reduce its
complexity for review.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
2023-11-08 17:42:50 +08:00
aciddelgado
3dece27f51
GQA Flash Attention with Attention Mask (#18283)
### Description
GQA now only works with Flash Attention with Attention Mask input,
allowing for batched input. Note: This PR Disables Memory Efficient
Attention, only allowing Flash Attention kernel to be used.



### Motivation and Context
Allows GQA to work with batched input.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
2023-11-07 17:47:51 -08:00
liqun Fu
6127dd1d2d
implement gridsample 20 (#17744) 2023-11-07 10:42:41 -08:00
Patrice Vignola
800ae7742c
[DML EP] Add RotaryEmbedding (#18158)
This is a graph implementation of RotaryEmbedding since there's no time
to add it to DML before 1.16.2, but it eventually should move into
DirectML since we're bandwidth-bound.
2023-11-07 08:26:11 -08:00
Prathik Rao
8978bdc59d
add bfloat16 support for where operator (#18118)
### Description
<!-- Describe your changes. -->

Adds bfloat16 as a valid input parameter type for where node for ONNX
opset 16+.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.

---------

Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-11-02 12:23:20 -07:00
pengwa
c8e1038eab
Optimize 4bit Qlora training (#18131)
### Optimize 4bit Qlora training

Extent existing `MatmulBnb4bit` to its usage in training scenarios. 

The PR includes following changes:
1. Add special `torch.autograd.Function` export logic for
`bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before
common PythonOp exporter.
2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which
help skip some inference specific logic in implementation.
3. Add `transB` optional attribute, which is by default be 1; setting it
to be 0 is needed by backward usage.

Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9%
throughput gains. The reason is:
`bitsandbytes.autograd._functions.MatMul4Bit` has logic
`ctx.save_for_backward`, which would need an additional copy in
PythonOp, otherwise, the tensor might be released by ORT, while backward
op still references it.

Removing the clones also reduce the peak memory consumptions because
`bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not
needed in backward compute.
2023-11-02 09:46:11 -07:00
aciddelgado
178f7caaeb
GQA Memory Efficient Kernel (#17920)
Implement Cutlass Memory Efficient Attention Kernel into Group Query
Attention Operator.

### Motivation and Context
Before this change, Group Query Attention Operator was supported only by
Flash-Attention. While this is the most efficient kernel for the
operation, it only supports sm >= 80. Cutlass Memory Efficient Attention
Kernel supports sm >= 53, allowing us to support a broader range of GPU
hardware.
2023-11-01 20:04:22 -07:00
Preetha Veeramalai
d87216bcb1
Openvino ep ort 23.1 (#17911)
### Description
Integration to OpenVINO 2023.1


### Motivation and Context

- Alignment with latest OpenVINO Version. 
- Device name change from VPUX to NPU and Remove from supported list
until official public support is available.

---------

Co-authored-by: Sahar Fatima <sfatima.3001@gmail.com>
Co-authored-by: Saurabh Kale <saurabh1.kale@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
2023-11-01 08:39:39 -07:00
Tianlei Wu
95f053c652
[CUDA] Update GroupNorm and Add SkipGroupNorm (#18091)
* Add a new operator SkipGroupNorm to support skip and bias inputs.
* Update GroupNorm kernel to support number of channels used in SD XLrefiner.
* Add epsilon in kernel
* Add parity and performance test script
* Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc.

### Motivation and Context

Update GroupNorm to support SD XL Refiner and beyond.
2023-10-31 10:27:20 -07:00
Xavier Dupré
b5f242e978
GemmFloat8 as a contrib ops (#16051)
### Description
Add support for Gemm with float 8 as a contrib op.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-10-27 14:33:55 +02:00
Tang, Cheng
37873be86d
enable reduce ops on opset18 (#18053)
### Description
Opset 18 apply the "axes as input" change from ReduceSum to all the
other reduce ops. Our cuda kernel actually support it, but we didn't
enable it for opset18. This PR update the reduce ops' kernel
registration to enable the "axes as input" behavior for opset18.

As part of the fix, I also simplify the reduce op kernel registration
part. ORT doesn't require the kernel definition need to be exactly the
same as onnx op definition. For our case, which we share the same kernel
for all the reduce ops (from version 1 to version 18), we don't need to
maintain different version of kernel definitions. we can simplify it by
just using a single kernel definition for multiple versions. Although
for some cases, we might register more types for legacy versions, but it
is harmless. Framework is using schema to validate the graph, not kernel
definition.

---------

Co-authored-by: Cheng Tang <chenta@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
2023-10-26 16:57:21 -07:00
Jambay Kinley
d30d4d372a
Add MatMul FP4 and NF4 Support (#18066)
### Description
Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to
support quantization on weight.

This PR adds:
- schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating
point) and NF4 (4-bit NormalFloat) quantization on weight.
- a naive implementation for MatMulBnb4 on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for MatMulBnb4 and related benchmark
tool.
- tool to quantize model to FP4 or NF4.
2023-10-25 15:34:58 -07:00
liqun Fu
706e13e0c9
implement affinegrid cpu kernel (#17777) 2023-10-25 10:46:04 -07:00
liqun Fu
efa0cc2562
implement isinf20 and isnan20 (#17874) 2023-10-24 10:58:54 -07:00
kunal-vaishnavi
2a17d5cf32
LLaMA Model Optimization (#18021)
### Description
This PR contains fusion-level and kernel-level optimizations for [Meta's
LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/).

Some of the added optimizations include:

- SimplifiedLayerNorm changes
  - Fusions for multiple variants
- SkipSimplifiedLayerNorm changes
  - Kernel support for CPU
- Rotary embeddings (previously did not exist)
  - Fusions for multiple variants
  - CPU and CUDA kernels
  - Supports interleaving and non-interleaving in the same kernels
  - Optimized cache that requires half of its originally exported sizes
- Reduced from `(max_sequence_length, head_size)` to
`(max_sequence_length, head_size / 2)`
- Multi-head attention
  - Support for 2D and 3D attention masks
- Group query attention (for FP16 CUDA and INT4 CUDA)
  - Integration with flash attention v2 and past-present buffer sharing
- Removes need for `attention_mask` input as it is supported in the
kernel
- 4 bit quantization
  - `block_size` parameter is available for customizing
- Support the new changes for [Microsoft
version](https://github.com/microsoft/Llama-2-Onnx)
- Support combinations of the below variants (ex: export ORT version and
run with Optimum)

Supported variants of LLaMA-2 include:
- [ORT
version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama)
- Produces one ONNX file that is already optimized (and quantized if
requested)
  - Integrates with Optimum
- [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx)
  - Already exported and available off-the-shelf
  - Faster versions of those models will be uploaded there soon
- [Hugging Face version](https://huggingface.co/meta-llama)
  - Models that end with `-hf`
- Some older and current versions of
[`transformers`](https://github.com/huggingface/transformers) and
[`optimum`](https://github.com/huggingface/optimum) that export the
model to ONNX differently
- Note that while some older versions are supported, it is recommended
to use the latest package versions.

### Usage

To use the optimizations, please see `README.md` for details. Please
note the various `requirements.txt` files for the package versions
recommended in order to use these changes.

To run the ORT transformer optimizer separately, run the script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0
```

### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/14997
- https://github.com/microsoft/onnxruntime/issues/16254
- https://github.com/microsoft/onnxruntime/issues/17681
- https://github.com/microsoft/onnxruntime/issues/17925
- https://github.com/microsoft/onnxruntime-inference-examples/issues/320

This PR uses changes from the following PRs:
- https://github.com/pytorch/pytorch/pull/104468
- https://github.com/pytorch/pytorch/pull/109759
- https://github.com/microsoft/onnxruntime/pull/17020
- https://github.com/microsoft/onnxruntime/pull/17674
- https://github.com/microsoft/onnxruntime/pull/17890
- https://github.com/microsoft/onnxruntime/pull/17920
- https://github.com/huggingface/transformers/pull/26162
- https://github.com/huggingface/optimum/pull/1257
- https://github.com/huggingface/optimum/pull/1289
- https://github.com/huggingface/optimum/pull/1462

### New TorchDynamo Exporter (experimental stage)

This PR uses changes from the following issues and PRs to begin
supporting the [new TorchDynamo
exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter):
- https://github.com/huggingface/transformers/pull/26307
- https://github.com/pytorch/pytorch/issues/104903
- https://github.com/pytorch/pytorch/pull/105040
- https://github.com/microsoft/onnxscript/pull/847
- https://github.com/microsoft/onnxscript/pull/862
- https://github.com/microsoft/onnxscript/issues/493
2023-10-23 13:00:56 -07:00
Yufeng Li
11af34440a
Add MatMul 4bits support on GPU (#17890)
### Description
<!-- Describe your changes. -->
Add a contrib op MatMulNBits and related toolchain to support
quantization on weight. This PR only adds support for 4bits. It:

- add schema for contrib op MatMulNBits which can support 1-7 bits
quantization on weight.
- a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for 4bits MatMulNBits and related
benchmark tool
- tool to quantization model with 4bits. 

Next:
- add general and more efficient kernels for 4bits MatMulNBits on CPU
and GPU
2023-10-13 16:55:30 -07:00
Zhang Lei
762703e037
Support output cross qk, dtw and more for whisper model (#17500)
Support cross qk in beam search for whisper model and related features
Make whisper exporting tools support cross qk and some related features,
* extra_decoding_ids
* no_speech_prob

Implement DTW kernel, unfold tensor kernel with unit test Several fix
related with multiple session running parallel, like:

* guard multihead_attention, fused_fp16_runner_
* some memory allocation with stream awareness
* add use_ep_level_unified_stream option
2023-10-13 11:47:15 -07:00
pengwa
63dc5dc1a9
Add document for PythonOp (#17888)
### Add document for PythonOp



https://github.com/microsoft/onnxruntime/blob/pengwa/pythonop_doc/docs/ORTModule_PythonOp_Notes.md



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-10-12 08:36:22 +08:00
aciddelgado
406cd324e0
[CUDA] GroupQueryAttention operator using FlashAttention (#17674)
### Description
Added Group Query Attention op, supporting integer multiple number of
heads for Q / KV. As of now, this op can only use FlashAttention kernel,
meaning it only supports sm>=80 on Linux.

Results from onnxruntime/test/python/transformers/benchmark_gqa.py show
an on-average ~37% speed-up over Decoder Masked Multi-Head Attention,
with even greater improvements for long past sequence lengths.

```
op      batch   s_kv    heads   h_dim   ms      TFLOPS
gqa     16      2048    8       32      0.34    0.10
dmmha   16      2048    8       32      0.39    0.09
---------
gqa     16      2048    8       64      0.45    0.15
dmmha   16      2048    8       64      0.61    0.11
---------
gqa     16      2048    8       128     0.54    0.25
dmmha   16      2048    8       128     0.83    0.16
---------
gqa     16      2048    16      32      0.45    0.15
dmmha   16      2048    16      32      0.69    0.10
---------
gqa     16      2048    16      64      0.69    0.19
dmmha   16      2048    16      64      0.83    0.16
---------
gqa     16      2048    16      128     0.71    0.38
dmmha   16      2048    16      128     1.28    0.21
---------
gqa     16      2048    32      32      0.58    0.23
dmmha   16      2048    32      32      0.77    0.17
---------
gqa     16      2048    32      64      0.58    0.46
dmmha   16      2048    32      64      1.25    0.21
---------
gqa     16      2048    32      128     0.76    0.71
dmmha   16      2048    32      128     2.15    0.25
---------
gqa     16      2048    64      32      0.68    0.39
dmmha   16      2048    64      32      1.23    0.22
---------
gqa     16      2048    64      64      0.77    0.70
dmmha   16      2048    64      64      2.11    0.25
---------
gqa     16      2048    64      128     1.10    0.97
dmmha   16      2048    64      128     4.06    0.26
---------
gqa     16      2048    128     32      1.00    0.54
dmmha   16      2048    128     32      2.09    0.26
---------
gqa     16      2048    128     64      1.10    0.97
dmmha   16      2048    128     64      4.08    0.26
```


### Motivation and Context
As of now, this op is targeted for use on LLama models, as it supports
kv-caching and different number of heads for Q and KV (Grouped Query
Attention). We plan to add support for more platforms, input formats,
etc. in the future.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-10-09 12:43:12 -07:00
kyoshisuki
ba72bb6f98
Fix a typo in ABI_Dev_Notes.md (#17832) 2023-10-09 07:51:34 -07:00
Hector Li
385fab5bae
[QNN EP] Qnn cache improvement (#17757)
### Description
Improve the QNN context binary cache feature to reduce the memory
overhead and initialization time overhead.
Instead of dumping a Qnn context binary file with metadata as header, we
dump a Onnx format file with metadata inside Onnx node.

### Motivation and Context
 reduce the memory overhead and initialization time overhead
2023-10-06 15:56:33 -07:00
liqun Fu
2be4dc6d04
ONNX 1.15 integration (#17125)
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released


### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2023-09-26 14:44:48 -07:00
Nicolò Lucchesi
4ab0e17fe8
[Technical docs] Fixed a couple of old links in FAQ.md (#17415)
### Description
Updated a couple of old links in the technical documentation that where
pointing to files present prior to the migration to
https://onnxruntime.ai/docs.
2023-09-26 13:38:24 -07:00
Vincent Wang
e6301eee6a
Bump Up Version to 1.17.0 (#17587)
Bump up version to 1.17.0 as the 1.16.0 release branch had been branched
out.
2023-09-20 11:02:58 +08:00
Adrian Lizarraga
dea425e7c1
[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015)
### Description
- Adds 16-bit integer support to:
- Quantization kernel implementations: Intel, Neon, and Power intrinsics
  - DequantizeLinear and QuantizeLinear contrib ops
  - QNN EP Quantize and Dequantize operators
  - Python quantization scripts
- Disables QDQ fusions for most 16-bit QDQ node groups (need to add
16-bit support to QLinear* ops)
- Retains support for dropping QDQ nodes from Split, Gather, Reshape,
Transpose, Squeeze, and Unsqueeze node groups.

Sample python code to generate QDQ model with 16-bit activations and
8-bit weights:
```python
    quantize_static(
        input_model_path,
        output_model_path,
        data_reader,
        quant_format=args.quant_format,
        per_channel=args.per_channel,
        activation_type=QuantType.QUInt16,
        weight_type=QuantType.QUInt8,
        extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True},
    )
``` 

Note that enabling the `UseQDQContribOps` extra option is not strictly
necessary. If the 16bit types are used without enabling
`UseQDQContribOps`, the QDQ ops domains are overridden to
'com.microsoft', and a warning is printed to stdout.

### Automated Tests
MLAS/CPU EP:
- [x] 16-bit QuantizeLinear computation
- [x] 16-bit DequantizeLinear computation

Optimizer:
- [x] Transpose QDQ fusion
- [x] Gather QDQ fusion
- [x] Reshape QDQ fusion
- [x] Squeeze QDQ fusion
- [x] Unsqueeze QDQ fusion
- [x] Split drop QDQ
- [x] DoubleQDQPairRemover 
- [x] Transpose optimization
- [x] EnsureUniqueDQForNodeUnit
- [x] Common subexpression elimination (DQ not removed)
- [x] Constant folding

QNN EP:
- [x] Conv 16-bit activations, 8-bit weights
- [x] MatMul 16-bit activations, 8-bit weights
- [x] Unary 16-bit QDQ ops
- [x] Binary 16-bit QDQ ops

Quantization tool:
- [x] Test creation of 16-bit QDQ model
### Motivation and Context
Support mixed precision (8bit weights, 16bit activations) models.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-09-18 09:43:34 -07:00
Nat Kershaw (MSFT)
a2fba28f6c
Remove extraneous javascript includes (#17558) 2023-09-14 20:43:24 -07:00
Nat Kershaw (MSFT)
bbcf4b45dc
Upgrade doxygen to 1.9.8 (#17525) 2023-09-12 20:44:27 -07:00
Baiju Meswani
5d2c57363f
Sign CUDA Kernel (#17293) 2023-08-28 21:03:58 -07:00
Adrian Lizarraga
5a83a67f32
Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (#17127)
### Description
- Enables int32 support for com.microsoft.DequantizeLinear (contrib op)
- Makes the `zero_point` input optional for Quantize/Dequantize contrib
ops
- Enables QDQ transformations with the Quantize/Dequantize contrib ops
- Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests,
TransposeOptimizerTests

### Testing
List of tested graph transformations:
- [x] QDQSelectorActionTransformer
  - qdq_transformer_test.cc
- [x] QDQS8ToU8Transformer
  - qdq_transformer_test.cc
- [x] DoubleQDQPairsRemover
  - qdq_transformer_test.cc
- [x] IdenticalChildrenConsolidation
  - qdq_transformer_test.cc
- [x] QDQPropagation
  - qdq_transformer_test.cc
- [x] QDQFinalCleanup
  - qdq_transformer_test.cc
- [x] CliQuantFusion
  - qdq_transformer_test.cc
- [x] ReluQuantFusion
  - qdq_transformer_test.cc
- [x] EnsureUniqueDQForNodeUnit 
  - ensure_unique_dq_for_node_unit_test.cc
- [x] TransposeOptimizer 
  - transpose_optimizer_test.cc
- [x] CommonSubexpressionElimination
  - graph_transform_test.cc
- [x] ConstantFolding
  - graph_transform_test.cc

### Motivation and Context
We need to [support mixed 16-bit/8-bit precision QDQ
models](https://github.com/microsoft/onnxruntime/pull/17015). This PR is
the first step in achieving this goal: we need to make QDQ contrib ops
work with our optimizations/transformations.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-08-25 09:57:51 -07:00
pengwa
d90afc697b
Introduce ZeROOffloadSubscriber for ORTModule (#17006)
### Introduce ZeROOffloadSubscriber for ORTModule

As part of the work: integrate ORTModule with DeepSpeed stage3, this PR
mainly focus on moving original PyTorch-based (leveraging hooks) param
partition/offload implementation to ORTModule compatible implementation.

Changes include:
1. Refactor `SubscriberBase`/`SubcriberManager` to support
pre-forward/post_forward hooks.
2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook
function as much as possible. Since all hook functions are defined in
`DeepSpeedZeRoOffload._register_hooks_recursively` and
`DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is,
the closure is not complex, all hooks are referencing the owning
`DeepSpeedZeRoOffload` instance, so we can create new hook function with
`FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance,
then call the new created function in subscriber's
`pre_forward_module_apply_impl` and `post_forward_module_apply_impl`
interfaces.
3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to
register the `ZeROOffloadSubscriber` for the model, then we don't need
change any code on the DeepSpeed repo (at least so far).
4. Fix the ATen embedding custom symbolic exporter function by
tolerating weights size be (0) (changed by DeepSpeed zero stage 3).

UT will be added once stage3 is fully supported. 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-08-25 00:15:22 +08:00
Emmanuel Ferdman
08ca624d2b
Fix: update hyperlinks to the Jupyter notebooks (#16145)
### Description
<!-- Describe your changes. -->

This PR fixes broken hyperlinks in the documentation that should lead
users to Jupyter notebooks. Currently, the hyperlinks are not working as
intended. The PR resolves this issue by updating the hyperlinks to
correctly direct users to the Jupyter notebooks.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->

It fixes broken hyperlinks leading to the Jupyter notebooks.
2023-08-21 09:53:05 -07:00
Wenbing Li
d052c8a45c
Remove the extensions submodule (#17097)
### Description
Remove the onnxruntime-extensions submodule since it now was used via
cmake FetchContent


### Motivation and Context
The submodule relies on an outdated version of the extensions, and the
build instructions should be updated to eliminate any confusion.
2023-08-14 10:16:33 -07:00