onnxruntime/cmake/external
Tianlei Wu fbc3927231
[CUDA] cuDNN Flash Attention (#21629)
### Description
- [x] Add cuDNN flash attention using cudnn frontend, and enable it in
MultiHeadAttention operator.
- [x] Support attention mask.
- [x] Support attention bias.
- [x] Update tests and benchmark script.

The cuDNN SDPA is disabled by default. To enable it, need the following:
(1) Requires cuDNN 9.3 or newer version installed.
(2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or
set `sdpa_kernel=8` cuda provider option to enable it.
(3) Only works on devices with compute capability >= 8.0.

Note that some combinations of parameters might be rejected due to
limited support of head dimension or sequence lengths.

Future Works:
(1) FP8 and BF16 APIs.  Currently, only API for FP16 are exposed.
(2) Add API to support ragged batching (padding removed in inputs).
(3) Support other input formats (like QKV_BS3NH).
(4) Currently, q are converted to BSNH, k/v are converted to either BSNH
or BNSH format. May do some experiment to see whether converting q to
BNSH could be better in some case.

### Example Benchmark Results on H100

The following tests are on FP16 MultiHeadAttention operator without
attention mask and attention bias.

#### Test Setting 1
batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 256 | 0 | 32 | 128

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000075 | 229.5 | torch:flash
Q,K,V (BNSH) | 0.000119 | 144.8 | torch:efficient
Q,K,V (BNSH) | 0.000224 | 76.5 | torch:math
Q,K,V (BSNH) | 0.000075 | 227.8 | ort:cudnn
Q,K,V (BSNH) | 0.000094 | 182.8 | ort:flash
Q,K,V (BSNH) | 0.000138 | 124.7 | ort:efficient
Q,K,V (BSNH) | 0.000438 | 39.3 | ort:math
Q,KV | 0.000129 | 133.0 | ort:cudnn
Q,KV | 0.000151 | 114.1 | ort:flash
Q,KV | 0.000194 | 88.5 | ort:efficient
QKV | 0.000154 | 111.8 | ort:cudnn
QKV | 0.000175 | 98.0 | ort:flash
QKV | 0.000217 | 79.0 | ort:efficient

#### Test Setting 2

batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 512 | 0 | 16 | 64

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000069 | 249.2 | torch:flash
Q,K,V (BNSH) | 0.000141 | 121.7 | torch:efficient
Q,K,V (BNSH) | 0.000294 | 58.5 | torch:math
Q,K,V (BSNH) | 0.000077 | 221.7 | ort:cudnn
Q,K,V (BSNH)  | 0.000087 | 196.6 | ort:flash
Q,K,V (BSNH)  | 0.000163 | 105.6 | ort:efficient
Q,K,V (BSNH)  | 0.000651 | 26.4 | ort:math
Q,KV | 0.000103 | 167.1 | ort:cudnn
Q,KV | 0.000117 | 146.3 | ort:flash
Q,KV | 0.000192 | 89.6 | ort:efficient
QKV | 0.000113 | 151.5 | ort:cudnn
QKV | 0.000128 | 134.7 | ort:flash
QKV | 0.000201 | 85.3 | ort:efficient
2024-08-20 08:50:22 -07:00
..
emsdk@0fde048800 Upgrade emsdk from 3.1.59 to 3.1.62 (#21421) 2024-08-14 12:38:52 -07:00
git.Win32.2.41.03.patch Fix ability to use patch on Windows CI machines (#18356) 2023-11-11 07:32:14 +10:00
libprotobuf-mutator@7a2ed51a6b
onnx@595228d99e Update to onnx 1.16.1 (#20702) 2024-06-04 11:06:28 -07:00
abseil-cpp.cmake Update C++ dependencies (#21410) 2024-07-23 10:00:36 -07:00
abseil-cpp.natvis Update abseil to a release tag and register neural_speed (#19255) 2024-01-24 14:37:39 -08:00
composable_kernel.cmake [ROCm] Update ck to use ck_tile (#21030) 2024-06-19 14:06:10 +08:00
cuDNN.cmake [CUDA] cuDNN Flash Attention (#21629) 2024-08-20 08:50:22 -07:00
cudnn_frontend.cmake Adding CUDNN Frontend and use for CUDA NN Convolution (#19470) 2024-08-02 15:16:42 -07:00
cutlass.cmake [CUDA] upgrade cutlass to 3.5.0 (#20940) 2024-06-11 13:32:15 -07:00
dml.cmake [DML EP] Update DML to 1.15.1 (#21695) 2024-08-12 14:16:43 -07:00
dnnl.cmake Update oneDNN to v3.0.1 in order to support gcc 13 (#19344) 2024-02-01 15:39:03 -08:00
eigen.cmake Fix ability to use patch on Windows CI machines (#18356) 2023-11-11 07:32:14 +10:00
extensions.cmake Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470) 2023-09-08 13:35:04 -07:00
find_snpe.cmake Improve dependency management (#13523) 2022-12-01 09:51:59 -08:00
FindNumPy.cmake
helper_functions.cmake Put all external project targets under the 'External' folder in VS (#21765) 2024-08-16 15:51:50 +10:00
ipp-crypto.cmake [TVM EP] Hot fix of build on Windows of TVM EP with ipp-crypto (#12381) 2022-07-31 14:36:54 +02:00
mimalloc.cmake Improve dependency management (#13523) 2022-12-01 09:51:59 -08:00
neural_speed.cmake turn on neural_speed by default (#19627) 2024-03-20 12:49:58 -07:00
onnx_minimal.cmake Fix some build issues on MacOS with Xcode 14.3. (#15878) 2023-06-07 12:07:11 -07:00
onnx_protobuf.natvis Fix visualization issues with Attribute/Tensor protos (#17188) 2023-08-16 13:56:51 -07:00
onnxruntime_external_deps.cmake Adding CUDNN Frontend and use for CUDA NN Convolution (#19470) 2024-08-02 15:16:42 -07:00
protobuf_function.cmake Fix some build issues on MacOS with Xcode 14.3. (#15878) 2023-06-07 12:07:11 -07:00
pybind11.cmake Improve dependency management (#13523) 2022-12-01 09:51:59 -08:00
pyxir.cmake Check for Python_EXECUTABLE in pyxir.cmake to fix Vitis AI EP build (#8631) 2021-08-24 08:39:50 -07:00
tvm.cmake [TVM EP] Support zero copying TVM EP output tensor to ONNX Runtime output tensor (#12593) 2023-02-08 10:02:20 -08:00
wil.cmake Rework WIL dependency retrieval/usage (#17130) 2023-08-15 09:11:46 -07:00
xnnpack.cmake Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238) 2024-01-24 16:27:05 -08:00