onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-16 21:00:14 +00:00

History

Tianlei Wu fbc3927231 [CUDA] cuDNN Flash Attention (#21629 ) ### Description - [x] Add cuDNN flash attention using cudnn frontend, and enable it in MultiHeadAttention operator. - [x] Support attention mask. - [x] Support attention bias. - [x] Update tests and benchmark script. The cuDNN SDPA is disabled by default. To enable it, need the following: (1) Requires cuDNN 9.3 or newer version installed. (2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or set `sdpa_kernel=8` cuda provider option to enable it. (3) Only works on devices with compute capability >= 8.0. Note that some combinations of parameters might be rejected due to limited support of head dimension or sequence lengths. Future Works: (1) FP8 and BF16 APIs. Currently, only API for FP16 are exposed. (2) Add API to support ragged batching (padding removed in inputs). (3) Support other input formats (like QKV_BS3NH). (4) Currently, q are converted to BSNH, k/v are converted to either BSNH or BNSH format. May do some experiment to see whether converting q to BNSH could be better in some case. ### Example Benchmark Results on H100 The following tests are on FP16 MultiHeadAttention operator without attention mask and attention bias. #### Test Setting 1 batch_size \| sequence_length \| past_sequence_length \| num_heads \| head_size -- \| -- \| -- \| -- \| -- 16 \| 256 \| 0 \| 32 \| 128 format \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- Q,K,V (BNSH) \| 0.000075 \| 229.5 \| torch:flash Q,K,V (BNSH) \| 0.000119 \| 144.8 \| torch:efficient Q,K,V (BNSH) \| 0.000224 \| 76.5 \| torch:math Q,K,V (BSNH) \| 0.000075 \| 227.8 \| ort:cudnn Q,K,V (BSNH) \| 0.000094 \| 182.8 \| ort:flash Q,K,V (BSNH) \| 0.000138 \| 124.7 \| ort:efficient Q,K,V (BSNH) \| 0.000438 \| 39.3 \| ort:math Q,KV \| 0.000129 \| 133.0 \| ort:cudnn Q,KV \| 0.000151 \| 114.1 \| ort:flash Q,KV \| 0.000194 \| 88.5 \| ort:efficient QKV \| 0.000154 \| 111.8 \| ort:cudnn QKV \| 0.000175 \| 98.0 \| ort:flash QKV \| 0.000217 \| 79.0 \| ort:efficient #### Test Setting 2 batch_size \| sequence_length \| past_sequence_length \| num_heads \| head_size -- \| -- \| -- \| -- \| -- 16 \| 512 \| 0 \| 16 \| 64 format \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- Q,K,V (BNSH) \| 0.000069 \| 249.2 \| torch:flash Q,K,V (BNSH) \| 0.000141 \| 121.7 \| torch:efficient Q,K,V (BNSH) \| 0.000294 \| 58.5 \| torch:math Q,K,V (BSNH) \| 0.000077 \| 221.7 \| ort:cudnn Q,K,V (BSNH) \| 0.000087 \| 196.6 \| ort:flash Q,K,V (BSNH) \| 0.000163 \| 105.6 \| ort:efficient Q,K,V (BSNH) \| 0.000651 \| 26.4 \| ort:math Q,KV \| 0.000103 \| 167.1 \| ort:cudnn Q,KV \| 0.000117 \| 146.3 \| ort:flash Q,KV \| 0.000192 \| 89.6 \| ort:efficient QKV \| 0.000113 \| 151.5 \| ort:cudnn QKV \| 0.000128 \| 134.7 \| ort:flash QKV \| 0.000201 \| 85.3 \| ort:efficient		2024-08-20 08:50:22 -07:00
..
emsdk@0fde048800	Upgrade emsdk from 3.1.59 to 3.1.62 (#21421 )	2024-08-14 12:38:52 -07:00
git.Win32.2.41.03.patch	Fix ability to use patch on Windows CI machines (#18356 )	2023-11-11 07:32:14 +10:00
libprotobuf-mutator@7a2ed51a6b
onnx@595228d99e	Update to onnx 1.16.1 (#20702 )	2024-06-04 11:06:28 -07:00
abseil-cpp.cmake	Update C++ dependencies (#21410 )	2024-07-23 10:00:36 -07:00
abseil-cpp.natvis	Update abseil to a release tag and register neural_speed (#19255 )	2024-01-24 14:37:39 -08:00
composable_kernel.cmake	[ROCm] Update ck to use ck_tile (#21030 )	2024-06-19 14:06:10 +08:00
cuDNN.cmake	[CUDA] cuDNN Flash Attention (#21629 )	2024-08-20 08:50:22 -07:00
cudnn_frontend.cmake	Adding CUDNN Frontend and use for CUDA NN Convolution (#19470 )	2024-08-02 15:16:42 -07:00
cutlass.cmake	[CUDA] upgrade cutlass to 3.5.0 (#20940 )	2024-06-11 13:32:15 -07:00
dml.cmake	[DML EP] Update DML to 1.15.1 (#21695 )	2024-08-12 14:16:43 -07:00
dnnl.cmake	Update oneDNN to v3.0.1 in order to support gcc 13 (#19344 )	2024-02-01 15:39:03 -08:00
eigen.cmake	Fix ability to use patch on Windows CI machines (#18356 )	2023-11-11 07:32:14 +10:00
extensions.cmake	Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470 )	2023-09-08 13:35:04 -07:00
find_snpe.cmake	Improve dependency management (#13523 )	2022-12-01 09:51:59 -08:00
FindNumPy.cmake
helper_functions.cmake	Put all external project targets under the 'External' folder in VS (#21765 )	2024-08-16 15:51:50 +10:00
ipp-crypto.cmake	[TVM EP] Hot fix of build on Windows of TVM EP with ipp-crypto (#12381 )	2022-07-31 14:36:54 +02:00
mimalloc.cmake	Improve dependency management (#13523 )	2022-12-01 09:51:59 -08:00
neural_speed.cmake	turn on neural_speed by default (#19627 )	2024-03-20 12:49:58 -07:00
onnx_minimal.cmake	Fix some build issues on MacOS with Xcode 14.3. (#15878 )	2023-06-07 12:07:11 -07:00
onnx_protobuf.natvis	Fix visualization issues with Attribute/Tensor protos (#17188 )	2023-08-16 13:56:51 -07:00
onnxruntime_external_deps.cmake	Adding CUDNN Frontend and use for CUDA NN Convolution (#19470 )	2024-08-02 15:16:42 -07:00
protobuf_function.cmake	Fix some build issues on MacOS with Xcode 14.3. (#15878 )	2023-06-07 12:07:11 -07:00
pybind11.cmake	Improve dependency management (#13523 )	2022-12-01 09:51:59 -08:00
pyxir.cmake	Check for Python_EXECUTABLE in pyxir.cmake to fix Vitis AI EP build (#8631 )	2021-08-24 08:39:50 -07:00
tvm.cmake	[TVM EP] Support zero copying TVM EP output tensor to ONNX Runtime output tensor (#12593 )	2023-02-08 10:02:20 -08:00
wil.cmake	Rework WIL dependency retrieval/usage (#17130 )	2023-08-15 09:11:46 -07:00
xnnpack.cmake	Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238 )	2024-01-24 16:27:05 -08:00