Commit graph

137 commits

Author SHA1 Message Date
wejoncy
7e613ee821
[quant] supports act_order inputs in Matmulnbits and new quantization algorithm "hqq" (#19106)
### Description
<!-- Describe your changes. -->
1. Support quantized GPTQ weight in huggingface like
[TheBloke/Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ)
2. Support Act_order for GPTQ
3. Support [HQQ](https://mobiusml.github.io/hqq_blog/) algorithm to
quantize matmul weight and add quant script



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-05 11:45:45 +08:00
raoanag
27b1dc91ab
[DML] MatrixMultiplyIntegerToFloat (#19608)
### Description
DML Implementation for
[com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat)

```
.\onnxruntime_test_all.exe --gtest_filter="*MatMulIntegerToFloat.*"
Note: Google Test filter = *MatMulIntegerToFloat.*
[==========] Running 22 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 22 tests from MatMulIntegerToFloat
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms)
[ RUN      ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8
[       OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms)
[ RUN      ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8
[       OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16
[       OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms)
[ RUN      ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint
[       OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms)
[----------] 22 tests from MatMulIntegerToFloat (8679 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 1 test suite ran. (8680 ms total)
[  PASSED  ] 22 tests.
memleakdbg:
----- No memory leaks detected -----
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* `CalculateMatMulIntegerToFloat` to replace CPU EP run reference
* Added more FP32 testcases to isolate all input datatype combinations 
* Added fixed input to `MatMulIntegerToFloat_FP16*` test cases as for
FP16 test cases.
* onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of
generating FP16 models, but we do not produce any for now
2024-03-04 11:55:35 -08:00
kunal-vaishnavi
44d8ad93b2
Whisper Timestamps and Temperature (#19509)
### Description
This PR updates exporting and running the Whisper model with beam search
by adding the following.

- Adds temperature as a graph input to the exported model
- Fixes the token ids by adding them as attributes to
`WhisperBeamSearch`
- Fixes the timestamps test cases so they pass now
- Fixes a bug with invoking `torch.onnx.export`
- Cleans up the Whisper scripts and groups the arguments in
`convert_to_onnx.py`
- Adds a `requirements.txt` file to specify package dependencies
- Adds `whisper-large-v3` to list of pretrained models
- Fixes a bug with missing cross-attention KV cache inputs in the
decoder subgraph

### Motivation and Context

- This is a follow-up to [this
PR](https://github.com/microsoft/onnxruntime/pull/19188).
- The incorrect token ids in the timestamps processor were first noticed
during [this PR
review](https://github.com/microsoft/onnxruntime/pull/17500#discussion_r1333520007).
When they were originally added in [this
PR](https://github.com/microsoft/onnxruntime/pull/15853), the offsets
were previously constant across the Whisper model sizes. When comparing
the new `whisper-large-v3` variant, the English-only variants (e.g.
`whisper-tiny.en`), and the original variants (e.g. `whisper-tiny`),
both the values and the offsets differ. Therefore, it is easier to set
the token ids as attributes to `WhisperBeamSearch` when exporting to
ensure the right values are used in the timestamps processor.
- The Hugging Face API for returning timestamps and the expected outputs
from the PyTorch model have both changed.
- The fix for `torch.onnx.export` is a follow-up to [this PR
review](https://github.com/microsoft/onnxruntime/pull/17179#issuecomment-1683001470).
- The argument grouping is a follow-up to [this PR
review](https://github.com/microsoft/onnxruntime/pull/17500#discussion_r1333521721).
- Specific package versions are needed to run the Whisper scripts and
the `requirements.txt` file ensures that these versions are installed.
- The `whisper-large-v3` variant is released and should be in the list
of official pretrained models.
- After the changes from [this
PR](https://github.com/microsoft/onnxruntime/pull/17316), the exported
model is not loading in an ORT inference session because the
cross-attention KV cache inputs are missing in the decoder subgraph.
2024-02-16 15:21:43 -08:00
aciddelgado
cbb29d80ff
GQA Rotary and Packed QKV with Flash (#18906)
### Description
These changes add rotary embedding and packed qkv input to gqa. As of
now, the changes are only supported with Flash-Attention (SM >= 80) but
should soon be supported with Memory Efficient Attention as well.



### Motivation and Context
With the fusion of rotary embedding into this Attention op, we hope to
observe some perf gain. The packed QKV should also provide some perf
gain in the context of certain models, like Llama2, that would benefit
from running ops on the fused QKV matrix, rather than the separate Q, K,
and V.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
2024-01-23 16:34:26 -08:00
petermcaughan
f53068446e
Add Temperature to WhisperBeamSearch input (#19188)
### Description
<!-- Describe your changes. -->
Add `temperature` as an input to WhisperBeamSearch op and initialize
correctly in parameter setup.


### Motivation and Context
Currently, temperature is included as an attribute to the BeamSearch op,
which doesn't let the model act dynamically in a single inference
session. By including this variable as an input, the temperature value
can be altered in any inference call (important for 1P teams)

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
2024-01-23 13:44:34 -08:00
Ye Wang
21034a2c37
phi2 contrib ops changes (#19112)
### Description
<!-- Describe your changes. -->
1. support causal mask in MHA cpu
2. support custom rotary_dim in rotary_emb
3. add bf16 for rotary_emb
4. fix a bug in attention rotary


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-22 10:17:11 -08:00
Chi Lo
46dd0d3f52
[TensorRT EP] Load precompiled TRT engine file directly (#18217)
When the TRT engine cache (precompiled engine) is present, it doesn't
make sense to go over the processes of model verification, model
optimization, TRT EP's GetCapability(), TRT EP's model proto
reconstruction, calling TRT parser and engine compilation.
This PR makes TRT EP skip those processes and directly load the engine
to perform inference.

The feature request:
https://github.com/microsoft/onnxruntime/issues/18072

Features:

- Replace original model with TRT engine wrapped ONNX model. It can save
a lot of time as mentioned above.

- How to get TRT engine wrapped ONNX model?
1. Set `trt_dump_ep_context_model` provider option to "true" and run the
inference. You will find the "xxx_wrapper.onnx" at the engine cache
path. (The same logic of generating engine cache)
    2. Use gen_trt_engine_wrapper_onnx_model.py

- Three provider options are added, 
`trt_dump_ep_context_model`: Enable dump wrapped onnx model by TRT EP
`trt_ep_context_embed_mode`: Add embed_mode as attribute. 0 means engine
cache path, 1 means engine binary data.
`trt_ep_context_compute_capability_enable`: Add hardware_arch as
attribute. When running the model, TRT EP will check consistency between
model's hardware_arch and GPU's compute capability.

- When the engine cache path is given in the wrapped model, TRT EP will
first search for the engine file using the path (relative to model
path), if it can't find it, it will change to use the path as it is
(depends on user, could be relative to working dir or absolute path)

Note: 

1. This PR includes the change of
https://github.com/microsoft/onnxruntime/pull/17751


Constraints:

1. The whole model should be fully supported by TRT. 
4. Users need to make sure the engine is built with min/max/opt
optimization profiles that large enough to cover the range of all
inputs. TRT EP will simply fail and won't rebuild the engine if the
input shape is out of range during runtime.
2024-01-11 22:20:54 -08:00
Ye Wang
b6d82834d4
add bfp16 to gqa (#19095)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-11 20:53:31 -08:00
Ye Wang
1c2dca95d8
pass rotary embedding to attention op (#18846)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-02 20:38:33 -08:00
luoyu-intel
5f00bc9931
Integrate high-performance x64 gemm library to MLAS (#17669)
### Description
Improve MLAS to support high-performance x64 INT4 kernels



### Motivation and Context
1. improve LLM inference performance on Intel CPUs.
2. support more 4bit quantization types: nf4, fp4
3. support dynamic block size: block size aligned with kernel's tiling
size(e.g. 4 for VNNI kernel), per channel on N dimension
4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni,
amx_bf16, amx_int8, avx512_fp16
5. support MatMulNBits' data format

### Tasks
- [x] support block_size: 32, 128, -1(per channel)
- [x] get weight pack size without memory allocation
- [x] use ort's thread pool for parallelism
- [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8

### Benchmark
Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time | 47613
| 47401 | 12970
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
6347792 | 6317562 | 109
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
11814014 | 11757847 | 59
Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
50222 | 50031 | 13759
Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
2038222 | 2028743 | 341
Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
3792832 | 3774485 | 191
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time |
58717 | 58501 | 11467
Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time |
1360846 | 1354598 | 543
Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time |
2564232 | 2551365 | 266
Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time | 57929
| 57694 | 12047
Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5495330 | 5465810 | 126
Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10676240 | 10617817 | 66
Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
68305 | 68047 | 10026
Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5504862 | 5476215 | 126
Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
11758623 | 11697337 | 66
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
67713 | 67451 | 10298
Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5508325 | 5480237 | 126
Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10738528 | 10681656 | 64
Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time |
60708 | 60486 | 11321
Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time |
5523784 | 5495736 | 126
Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time |
10829633 | 10772161 | 67


Reference:

Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time | 53088 | 52911 |
13364
Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time | 6268981 |
6230335 | 110
Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time | 11701237 |
11632339 | 59

Win11+12900K 8 cores:
Benchmark | Time | CPU | Iterations
-- | -- | -- | --
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time | 215976
| 211295 | 2884
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time |
60960590 | 60937500 | 10
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time |
1.18E+08 | 1.19E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time |
470377 | 453059 | 1414
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time |
1.54E+08 | 1.53E+08 | 5
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time |
3.18E+08 | 3.13E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time |
569072 | 559398 | 1229
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time |
1.54E+08 | 1.52E+08 | 4
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time |
3.22E+08 | 3.28E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time |
1486055 | 1473325 | 403
Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time |
4.14E+08 | 4.14E+08 | 2
Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time |
8.88E+08 | 8.59E+08 | 1

---------

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
Co-authored-by: Mengni Wang <mengni.wang@intel.com>
2023-12-19 09:36:31 -08:00
Hector Li
9768a727e1
[QNN EP] Fix a bug that can't create context binary if the model has inputs/outputs with different data type (#18722)
Fix a bug that can't create context binary if the model has inputs/outputs with different data type

### Description
Update EPContext op schema to unblock nodes with different data type among inputs & outputs
2023-12-06 13:07:09 -08:00
Jambay Kinley
1af0681554
Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 (#18484)
### Description
<!-- Describe your changes. -->
Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for
QLoRA fine-tuning.
- On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16
dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16`
type which uses float for compute.
- I have validated the op in a llama2-7b training scenario. The losses
match pytorch training and the training throughput is better.
- Cannot add a bfloat16 case in the op unit test since casting BFloat16
to and from float multiple times during the test causes the required
tolerances to be unachievable.

The custom autograd function exporter in onnxruntime-training is updated
to support the latest version of bitsandbytes. They changed how the
`quant_state` is stored.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable QLoRA fine-tuning with bfloat16.
2023-11-20 09:52:58 -08:00
kailums
1a29460919
rope support 4D input tensor (#18454)
### Description
<!-- Describe your changes. -->

change RotaryEmbeddings op implementation, add support for 4D input
tensor that is with shape of [batch, num_heads, seq_len, head_size].

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Current RotaryEmbedding op only support 3d input tensor with shape
[batch, seq_len, hidden_size]

For llamav2 model, when using FusionRotaryEmbeddings to only fuse
RotaryEmbeddings op, there will be a transpose operation for query and
key, and then the input tensor of RotaryEmbeddings becomes 4D [batch,
num_heads, seq_len, head_size].

This scenario can't be supported by current RotaryEmbeddings
implementation. So it needs to support 4D input tensor.
2023-11-17 20:38:15 +08:00
aciddelgado
adb56df2e8
Aciddelgado/gqa local (#18375)
### Description
Implement preliminary version of local (sliding window) attention.
Currently only supported by Flash Attention (sm >= 80, Linux). Currently
only supports sliding attention with a large cached kv.



### Motivation and Context
This change enables to run Mistral and other models which use sliding
window attention.
2023-11-16 15:01:06 -08:00
Ye Wang
f9af94009b
onboard MoE (#18279)
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR

limitation: __CUDA_ARCH__ >= 700


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-11-14 16:48:51 -08:00
aciddelgado
3dece27f51
GQA Flash Attention with Attention Mask (#18283)
### Description
GQA now only works with Flash Attention with Attention Mask input,
allowing for batched input. Note: This PR Disables Memory Efficient
Attention, only allowing Flash Attention kernel to be used.



### Motivation and Context
Allows GQA to work with batched input.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
2023-11-07 17:47:51 -08:00
pengwa
c8e1038eab
Optimize 4bit Qlora training (#18131)
### Optimize 4bit Qlora training

Extent existing `MatmulBnb4bit` to its usage in training scenarios. 

The PR includes following changes:
1. Add special `torch.autograd.Function` export logic for
`bitsandbytes.autograd._functions.MatMul4Bit` that is preferred before
common PythonOp exporter.
2. Add `training_mode` optional attribute for op `MatmulBnb4bit`, which
help skip some inference specific logic in implementation.
3. Add `transB` optional attribute, which is by default be 1; setting it
to be 0 is needed by backward usage.

Changing from `PythonOp` to this `MatmulBnb4bit` brings roughly ~2.9%
throughput gains. The reason is:
`bitsandbytes.autograd._functions.MatMul4Bit` has logic
`ctx.save_for_backward`, which would need an additional copy in
PythonOp, otherwise, the tensor might be released by ORT, while backward
op still references it.

Removing the clones also reduce the peak memory consumptions because
`bitsandbytes.autograd._functions.MatMul4Bit` saved tensors that are not
needed in backward compute.
2023-11-02 09:46:11 -07:00
aciddelgado
178f7caaeb
GQA Memory Efficient Kernel (#17920)
Implement Cutlass Memory Efficient Attention Kernel into Group Query
Attention Operator.

### Motivation and Context
Before this change, Group Query Attention Operator was supported only by
Flash-Attention. While this is the most efficient kernel for the
operation, it only supports sm >= 80. Cutlass Memory Efficient Attention
Kernel supports sm >= 53, allowing us to support a broader range of GPU
hardware.
2023-11-01 20:04:22 -07:00
Tianlei Wu
95f053c652
[CUDA] Update GroupNorm and Add SkipGroupNorm (#18091)
* Add a new operator SkipGroupNorm to support skip and bias inputs.
* Update GroupNorm kernel to support number of channels used in SD XLrefiner.
* Add epsilon in kernel
* Add parity and performance test script
* Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc.

### Motivation and Context

Update GroupNorm to support SD XL Refiner and beyond.
2023-10-31 10:27:20 -07:00
Xavier Dupré
b5f242e978
GemmFloat8 as a contrib ops (#16051)
### Description
Add support for Gemm with float 8 as a contrib op.

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-10-27 14:33:55 +02:00
Jambay Kinley
d30d4d372a
Add MatMul FP4 and NF4 Support (#18066)
### Description
Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to
support quantization on weight.

This PR adds:
- schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating
point) and NF4 (4-bit NormalFloat) quantization on weight.
- a naive implementation for MatMulBnb4 on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for MatMulBnb4 and related benchmark
tool.
- tool to quantize model to FP4 or NF4.
2023-10-25 15:34:58 -07:00
kunal-vaishnavi
2a17d5cf32
LLaMA Model Optimization (#18021)
### Description
This PR contains fusion-level and kernel-level optimizations for [Meta's
LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/).

Some of the added optimizations include:

- SimplifiedLayerNorm changes
  - Fusions for multiple variants
- SkipSimplifiedLayerNorm changes
  - Kernel support for CPU
- Rotary embeddings (previously did not exist)
  - Fusions for multiple variants
  - CPU and CUDA kernels
  - Supports interleaving and non-interleaving in the same kernels
  - Optimized cache that requires half of its originally exported sizes
- Reduced from `(max_sequence_length, head_size)` to
`(max_sequence_length, head_size / 2)`
- Multi-head attention
  - Support for 2D and 3D attention masks
- Group query attention (for FP16 CUDA and INT4 CUDA)
  - Integration with flash attention v2 and past-present buffer sharing
- Removes need for `attention_mask` input as it is supported in the
kernel
- 4 bit quantization
  - `block_size` parameter is available for customizing
- Support the new changes for [Microsoft
version](https://github.com/microsoft/Llama-2-Onnx)
- Support combinations of the below variants (ex: export ORT version and
run with Optimum)

Supported variants of LLaMA-2 include:
- [ORT
version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama)
- Produces one ONNX file that is already optimized (and quantized if
requested)
  - Integrates with Optimum
- [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx)
  - Already exported and available off-the-shelf
  - Faster versions of those models will be uploaded there soon
- [Hugging Face version](https://huggingface.co/meta-llama)
  - Models that end with `-hf`
- Some older and current versions of
[`transformers`](https://github.com/huggingface/transformers) and
[`optimum`](https://github.com/huggingface/optimum) that export the
model to ONNX differently
- Note that while some older versions are supported, it is recommended
to use the latest package versions.

### Usage

To use the optimizations, please see `README.md` for details. Please
note the various `requirements.txt` files for the package versions
recommended in order to use these changes.

To run the ORT transformer optimizer separately, run the script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0
```

### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/14997
- https://github.com/microsoft/onnxruntime/issues/16254
- https://github.com/microsoft/onnxruntime/issues/17681
- https://github.com/microsoft/onnxruntime/issues/17925
- https://github.com/microsoft/onnxruntime-inference-examples/issues/320

This PR uses changes from the following PRs:
- https://github.com/pytorch/pytorch/pull/104468
- https://github.com/pytorch/pytorch/pull/109759
- https://github.com/microsoft/onnxruntime/pull/17020
- https://github.com/microsoft/onnxruntime/pull/17674
- https://github.com/microsoft/onnxruntime/pull/17890
- https://github.com/microsoft/onnxruntime/pull/17920
- https://github.com/huggingface/transformers/pull/26162
- https://github.com/huggingface/optimum/pull/1257
- https://github.com/huggingface/optimum/pull/1289
- https://github.com/huggingface/optimum/pull/1462

### New TorchDynamo Exporter (experimental stage)

This PR uses changes from the following issues and PRs to begin
supporting the [new TorchDynamo
exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter):
- https://github.com/huggingface/transformers/pull/26307
- https://github.com/pytorch/pytorch/issues/104903
- https://github.com/pytorch/pytorch/pull/105040
- https://github.com/microsoft/onnxscript/pull/847
- https://github.com/microsoft/onnxscript/pull/862
- https://github.com/microsoft/onnxscript/issues/493
2023-10-23 13:00:56 -07:00
Yufeng Li
11af34440a
Add MatMul 4bits support on GPU (#17890)
### Description
<!-- Describe your changes. -->
Add a contrib op MatMulNBits and related toolchain to support
quantization on weight. This PR only adds support for 4bits. It:

- add schema for contrib op MatMulNBits which can support 1-7 bits
quantization on weight.
- a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for 4bits MatMulNBits and related
benchmark tool
- tool to quantization model with 4bits. 

Next:
- add general and more efficient kernels for 4bits MatMulNBits on CPU
and GPU
2023-10-13 16:55:30 -07:00
Zhang Lei
762703e037
Support output cross qk, dtw and more for whisper model (#17500)
Support cross qk in beam search for whisper model and related features
Make whisper exporting tools support cross qk and some related features,
* extra_decoding_ids
* no_speech_prob

Implement DTW kernel, unfold tensor kernel with unit test Several fix
related with multiple session running parallel, like:

* guard multihead_attention, fused_fp16_runner_
* some memory allocation with stream awareness
* add use_ep_level_unified_stream option
2023-10-13 11:47:15 -07:00
aciddelgado
406cd324e0
[CUDA] GroupQueryAttention operator using FlashAttention (#17674)
### Description
Added Group Query Attention op, supporting integer multiple number of
heads for Q / KV. As of now, this op can only use FlashAttention kernel,
meaning it only supports sm>=80 on Linux.

Results from onnxruntime/test/python/transformers/benchmark_gqa.py show
an on-average ~37% speed-up over Decoder Masked Multi-Head Attention,
with even greater improvements for long past sequence lengths.

```
op      batch   s_kv    heads   h_dim   ms      TFLOPS
gqa     16      2048    8       32      0.34    0.10
dmmha   16      2048    8       32      0.39    0.09
---------
gqa     16      2048    8       64      0.45    0.15
dmmha   16      2048    8       64      0.61    0.11
---------
gqa     16      2048    8       128     0.54    0.25
dmmha   16      2048    8       128     0.83    0.16
---------
gqa     16      2048    16      32      0.45    0.15
dmmha   16      2048    16      32      0.69    0.10
---------
gqa     16      2048    16      64      0.69    0.19
dmmha   16      2048    16      64      0.83    0.16
---------
gqa     16      2048    16      128     0.71    0.38
dmmha   16      2048    16      128     1.28    0.21
---------
gqa     16      2048    32      32      0.58    0.23
dmmha   16      2048    32      32      0.77    0.17
---------
gqa     16      2048    32      64      0.58    0.46
dmmha   16      2048    32      64      1.25    0.21
---------
gqa     16      2048    32      128     0.76    0.71
dmmha   16      2048    32      128     2.15    0.25
---------
gqa     16      2048    64      32      0.68    0.39
dmmha   16      2048    64      32      1.23    0.22
---------
gqa     16      2048    64      64      0.77    0.70
dmmha   16      2048    64      64      2.11    0.25
---------
gqa     16      2048    64      128     1.10    0.97
dmmha   16      2048    64      128     4.06    0.26
---------
gqa     16      2048    128     32      1.00    0.54
dmmha   16      2048    128     32      2.09    0.26
---------
gqa     16      2048    128     64      1.10    0.97
dmmha   16      2048    128     64      4.08    0.26
```


### Motivation and Context
As of now, this op is targeted for use on LLama models, as it supports
kv-caching and different number of heads for Q and KV (Grouped Query
Attention). We plan to add support for more platforms, input formats,
etc. in the future.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: tlwu@microsoft.com <tlwu@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
2023-10-09 12:43:12 -07:00
Hector Li
385fab5bae
[QNN EP] Qnn cache improvement (#17757)
### Description
Improve the QNN context binary cache feature to reduce the memory
overhead and initialization time overhead.
Instead of dumping a Qnn context binary file with metadata as header, we
dump a Onnx format file with metadata inside Onnx node.

### Motivation and Context
 reduce the memory overhead and initialization time overhead
2023-10-06 15:56:33 -07:00
Adrian Lizarraga
dea425e7c1
[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015)
### Description
- Adds 16-bit integer support to:
- Quantization kernel implementations: Intel, Neon, and Power intrinsics
  - DequantizeLinear and QuantizeLinear contrib ops
  - QNN EP Quantize and Dequantize operators
  - Python quantization scripts
- Disables QDQ fusions for most 16-bit QDQ node groups (need to add
16-bit support to QLinear* ops)
- Retains support for dropping QDQ nodes from Split, Gather, Reshape,
Transpose, Squeeze, and Unsqueeze node groups.

Sample python code to generate QDQ model with 16-bit activations and
8-bit weights:
```python
    quantize_static(
        input_model_path,
        output_model_path,
        data_reader,
        quant_format=args.quant_format,
        per_channel=args.per_channel,
        activation_type=QuantType.QUInt16,
        weight_type=QuantType.QUInt8,
        extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True},
    )
``` 

Note that enabling the `UseQDQContribOps` extra option is not strictly
necessary. If the 16bit types are used without enabling
`UseQDQContribOps`, the QDQ ops domains are overridden to
'com.microsoft', and a warning is printed to stdout.

### Automated Tests
MLAS/CPU EP:
- [x] 16-bit QuantizeLinear computation
- [x] 16-bit DequantizeLinear computation

Optimizer:
- [x] Transpose QDQ fusion
- [x] Gather QDQ fusion
- [x] Reshape QDQ fusion
- [x] Squeeze QDQ fusion
- [x] Unsqueeze QDQ fusion
- [x] Split drop QDQ
- [x] DoubleQDQPairRemover 
- [x] Transpose optimization
- [x] EnsureUniqueDQForNodeUnit
- [x] Common subexpression elimination (DQ not removed)
- [x] Constant folding

QNN EP:
- [x] Conv 16-bit activations, 8-bit weights
- [x] MatMul 16-bit activations, 8-bit weights
- [x] Unary 16-bit QDQ ops
- [x] Binary 16-bit QDQ ops

Quantization tool:
- [x] Test creation of 16-bit QDQ model
### Motivation and Context
Support mixed precision (8bit weights, 16bit activations) models.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-09-18 09:43:34 -07:00
Adrian Lizarraga
5a83a67f32
Support QDQ transformations with com.microsoft.Quantize/Dequantize ops (#17127)
### Description
- Enables int32 support for com.microsoft.DequantizeLinear (contrib op)
- Makes the `zero_point` input optional for Quantize/Dequantize contrib
ops
- Enables QDQ transformations with the Quantize/Dequantize contrib ops
- Update tests: EnsureUniqueDQForNodeUnitTests, QDQTransformerTests,
TransposeOptimizerTests

### Testing
List of tested graph transformations:
- [x] QDQSelectorActionTransformer
  - qdq_transformer_test.cc
- [x] QDQS8ToU8Transformer
  - qdq_transformer_test.cc
- [x] DoubleQDQPairsRemover
  - qdq_transformer_test.cc
- [x] IdenticalChildrenConsolidation
  - qdq_transformer_test.cc
- [x] QDQPropagation
  - qdq_transformer_test.cc
- [x] QDQFinalCleanup
  - qdq_transformer_test.cc
- [x] CliQuantFusion
  - qdq_transformer_test.cc
- [x] ReluQuantFusion
  - qdq_transformer_test.cc
- [x] EnsureUniqueDQForNodeUnit 
  - ensure_unique_dq_for_node_unit_test.cc
- [x] TransposeOptimizer 
  - transpose_optimizer_test.cc
- [x] CommonSubexpressionElimination
  - graph_transform_test.cc
- [x] ConstantFolding
  - graph_transform_test.cc

### Motivation and Context
We need to [support mixed 16-bit/8-bit precision QDQ
models](https://github.com/microsoft/onnxruntime/pull/17015). This PR is
the first step in achieving this goal: we need to make QDQ contrib ops
work with our optimizations/transformations.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-08-25 09:57:51 -07:00
Chen Fu
3c10f027de
4b quantization for weights of LLMs (#16833)
### Description
Blockwise 4b quantization for LLMs. 
1. Introduce 4b block-wise quantization for linear layer weights.
2. Implements matrix multiplication kernel for fp32 x int4
3. Implements special operator MatMulFpQ4
4. Implements quantization tool, that convert MatMul operator to
MatMulFpQ4, when the right hand side is 2D const tensor.


### Motivation and Context
Compress and accelerate LLMs

|Benchmark | Time(ns)|
|-------------|----------|
|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054|
|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155|
|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790|
|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152|
|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721|
|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200|
|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832|
|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874|
|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120|


|Benchmark | Time(ns)|
|-------------|----------|
|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8|   522610|
|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689|
|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467|

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-08-07 12:23:55 -07:00
Khalia Spear
4e6ea730d6
Broadcasting for SLN for CPU and CUDA (#16510)
### Description
Enhanced SkipLayerNorm by implementing broadcasting for both CPU and
CUDA



### Motivation and Context
The input and skip tensors no longer have to be the same size which
means that it can accept data where the skip shape can be the same size
as the input shape, have a shape of {1, sequence_length, hidden_size},
or {sequence_length, hidden_size}.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2023-08-07 09:55:42 -07:00
Tianlei Wu
50bf310dea
[CUDA] RelativePositionBias supports input with padding removed (#16923)
update RelativePositionBias to support input with padding removed.
- [x] add bias transpose kernel
- [x] add test
- [x] update operator document
2023-08-01 16:39:09 -07:00
Tianlei Wu
1fbd1ed179
[CUDA] PackedMultiHeadAttention support Bias and separated Q, K and V inputs (#16913)
### Description
Follow-up change for PackedMultiHeadAttention added in
https://github.com/microsoft/onnxruntime/pull/16779:
- [x] Add Bias input
- [x] Add CUDA kernels to support separated query, key and values
inputs.
- [x] Update operator documents
- [x] Add unit tests
2023-08-01 15:30:41 -07:00
Tianlei Wu
742edec5e8
[CUDA] Add PackedMultiHeadAttention operator (#16779)
### Description
Add new operator for MultiHeadAttention with inputs removed padding.
This only supports packed QKV format.
2023-07-28 16:35:38 -07:00
Alexey Kamenev
7c05f7bab1
Fix IRFFT contrib op output dimension calculation (#15662)
### Description
Fixes the issue with IRFFT output dimension calculation as described in
#13236

### Motivation and Context
Please refer to #13236 for detailed description.

Specifically, [this code](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cuda/math/fft_ops.cc#L103) computes the output dimension as:
```
out_dim = in_dim * 2 - 1
```
while it should be this instead:
```
out_dim = 2 * (in_dim - 1)
```
(assuming the original signal has even number of samples, of course).

For example, if the original signal has 4 samples, then the round trip should look something like:
```
4 -> (one-sided RFFT) -> 3 (complex) -> (one-sided IRFFT) -> 4
```
with the current code the output will be a signal with 5 points.

---------

Co-authored-by: Alexey Kamenev <akamenev@nvidia.com>
Co-authored-by: Nick Geneva <nicholasgeneva@gmail.com>
2023-07-28 15:52:37 -07:00
Patrice Vignola
649930142f
[DML EP] Add NCHW and float16 gamma/beta support for GroupNorm (#16814)
This will remove transposes that are non needed in the DML kernel. To
keep backward compatiblity, the default behavior is to set NHWC when no
attribute is set.
2023-07-25 21:43:29 -07:00
Ye Wang
dd7d721f3c
support rotary embeddings in decoder masked self-attention (#16556)
### Description
<!-- Describe your changes. -->

This PR adds support for rotary embeddings in decoder masked
self-attention

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-07-12 13:48:48 -07:00
Zhang Lei
0f8e66d905
optimization for whisper model with decoder masked multihead attention (#15827)
* graph tools update
* cuda kernel update
* operator spec update and implementation update
* greed search bug fix on wrong assumption for cross/self attention
input length
* avoid use of "" name in value info when loading graph which
historically in many model
2023-05-18 15:38:31 -07:00
stevenlix
270c09a37f
Add timestamp logits processor for whisper (#15853)
Enable timestamp estimation and logits processing for Whisper model.
2023-05-16 21:40:00 -07:00
kunal-vaishnavi
5b663d6797
Whisper Multitask and Multilingual (#15936)
### Description
This PR enables Whisper's multitask format and allows a user to use
Whisper for multiple tasks (e.g. transcription, translation) and for
multilingual purposes (e.g. English, Spanish). This PR also removes
`attention_mask` as a required input for Whisper with beam search.

### Usage
Here is an example of how you can use Whisper for English transcription.
```
import numpy as np
import onnxruntime as ort

from datasets import load_dataset
from transformers import AutoConfig, AutoProcessor

model = "openai/whisper-tiny"
config = AutoConfig.from_pretrained(model)
processor = AutoProcessor.from_pretrained(model)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe")
# forced_decoder_ids is of the format [(1, 50259), (2, 50359), (3, 50363)] and needs to be 
# of the format [50258, 50259, 50359, 50363] where 50258 is the start token id
forced_decoder_ids = [config.decoder_start_token_id] + list(map(lambda token: token[1], forced_decoder_ids))

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
input_features = processor(ds[0]["audio"]["array"], return_tensors="np").input_features

inputs = {
  "input_features": np.float32(input_features),
  "max_length": np.array([26], dtype=np.int32),
  "min_length": np.array([1], dtype=np.int32),
  "num_beams": np.array([2], dtype=np.int32),
  "num_return_sequences": np.array([1], dtype=np.int32),
  "length_penalty": np.array([1.0], dtype=np.float32),
  "repetition_penalty": np.array([1.0], dtype=np.float32),
  "decoder_input_ids": np.array([forced_decoder_ids], dtype=np.int32),
}
sess = ort.InferenceSession("whisper-tiny_beamsearch.onnx", providers=["CPUExecutionProvider"])
outputs = sess.run(None, inputs)

# Print tokens and decoded output
print(outputs[0][0][0])
print(processor.decode(outputs[0][0][0]))
```

If you don't want to provide specific decoder input ids or you want
Whisper to predict the output language and task, you can set
`forced_decoder_ids = [config.decoder_start_token_id]` instead.

### Motivation and Context

As seen in the figure below from the [OpenAI Whisper
paper](https://cdn.openai.com/papers/whisper.pdf), Whisper can be used
for multiple tasks and languages.

![Screenshot 2023-05-12
165215](https://github.com/microsoft/onnxruntime/assets/115581922/49335e39-a79c-4f78-92e9-89b034405f65)
2023-05-15 14:36:33 -07:00
Ye Wang
3418ca28a8
pack qkv in t5 decoder (#15801)
### Description
<!-- Describe your changes. -->

V100, b_4_s_128, max_output_len=64, beam=4

before:
t5_small: 101.28ms
t5_base:  200.07ms

after:
t5_small: 87.65ms
t5_base: 174.44ms



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-05-15 13:45:39 -07:00
kunal-vaishnavi
39d6d7050d
Change EmbedLayerNormalization mask index output to optional (#15526)
### Description
This PR changes an EmbedLayerNormalization node's mask index output to
be an optional output if a mask input is not provided.



### Motivation and Context
The documentation for EmbedLayerNormalization states 
```
The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated.
```
However, if the mask input is not provided, the mask index output is
still calculated and required.
2023-04-27 16:32:42 -07:00
Patrice Vignola
3be5bfe363
[DML EP] Add MatMul + SoftMax fusion (#15240) 2023-04-11 08:31:04 -07:00
stevenlix
6d126f8996
Add FP16 support for Whisper model (#15427)
Current ORT can only run inference for Whisper FP32 model. This PR adds
FP16 support.
2023-04-08 21:36:10 -07:00
Chen Fu
8dce83a818
Fuse 'Add' operator into FP16 Conv (#15213)
### Description
Adding 'Add' functionality to FP16 Conv operator. It takes a tensor that
has the same shape of the output tensor, and add it to the result
tensor.


### Motivation and Context
Needed to run Resnet 50
2023-04-07 09:51:03 -07:00
petermcaughan
1251964f96
Petermca/beamsearch whisper (#15339)
### Description
Adjust various code paths to allow Whisper model to function with
BeamSearch op.

Approach: Add a new kModelType enum value in IGenerationParameters as
so:
#### Old: 0 = GPT2, 1 = T5
#### New: 0 = GPT2, 1 = T5, 2 = Whisper

When the user assigns this attribute value to 2, various shape and type
checks are changed to accommodate Whisper inputs.


### Motivation and Context
BeamSearch is currently designed to function with BERT-based models with
inputs as vocab tokens, and needs changes to function with Whisper
inputs (3-D float values processed from audio data).

---------

Co-authored-by: Peter McAughan <petermca@microsoft.com>
2023-04-04 09:09:10 -07:00
Ye Wang
fbfe92f66a
DecoderMaskedMultiHeadAttention enhancement (#15292) 2023-04-02 21:53:03 -07:00
Yufeng Li
c08d6b42e8
Add tool to support packing mode for BERT model (#15283)
### Description
<!-- Describe your changes. -->
Add a tool to convert fused BERT like model to packing mode


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-03-31 08:46:47 -07:00
Ye Wang
44ba23e0f5
Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166)
### Description
<!-- Describe your changes. -->

As synced offline, rename this op and will create another op for mha
that supports both self and cross attention.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-03-23 12:31:38 -07:00
Ye Wang
2ee822d483
Extend memory efficient attention coverage in Attention/MHA cuda op (#15064)
### Description
<!-- Describe your changes. -->

1. upgrade cutlass to 3.0 that containing attn_bias support.
2. extend Attention/MHA to use memory efficient attention when
rel_pos_bias with [1, num_head, s, s*] and 1d mask with [2 * batch_size
+ 1] are present.

new mask format introduction:
MASK_1D_KEY_SEQ_LEN_START,  
[3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1],
query_start[0], ..., query_start[batch_size - 1], query_end[batch_size -
1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size -
1]]

e.g
2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this
1D mask is [3, 5, 0, 6, 12, 0, 6, 12]


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

It potentially benefits tnlrv6 and t5(encoder)

---------

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-03-23 11:05:17 -07:00
Hariharan Seshadri
7033346605 Support mask_filter_value attribute in DecoderMaskedMultiheadAttention (#15158) 2023-03-23 11:00:09 -07:00