onnxruntime/cmake
Tianlei Wu 6a9dc6c993
[CUDA] Update fused MHA to support flash attention and causal mask (#13953)
### Description
Update fused attention kernels to support flash attention and causal
mask (GPT-2 initial decoder run).

Note: Causal kernels are from FasterTransformer 5.2. Flash attention
kernels that is not causal are from TensorRT 8.5.1.

#### Performance Test of bert-base model

Test like the following:
```
 python -m onnxruntime.transformers.benchmark -m bert-base-cased -b 1 4 8 16 32 64 -s 512 -t 1000 -o by_script -g -p fp16 -i 3 --use_mask_index
```

Original Flash Attention is from
https://github.com/HazyResearch/flash-attention. RemovePadding and
RestorePadding is added before/after the original flash attention but
not for this PR, so the result is not apple-to-apple comparison. It is
added for reference only.

Average latency (ms) of float16 bert-base-cased model:

* A100

Kernel  | b1_s512 | b4_s512 | b8_s512 | b16_s512 | b32_s512 | b64_s512 |
b128_s512
-- | -- | -- | -- | -- | -- | -- | --
Unfused | 1.83 | 5.00 | 9.31 | 17.76 | 34.47 | 67.43 | 133.38
TRT Fused | 2.05 | 3.58 | 5.70 | 10.96 | 21.22 | 41.23 | 80.56
Flash Attention (from FT) | 1.43 | 3.20 | 5.71 | 10.95 | 22.19 | 42.96 |
84.54
Flash Attention (from TRT) | 1.44 | 3.28 | 5.70 | 10.86 | 21.00 | 40.56
| 79.53
Original Flash Attention | 1.81 | 4.04 | 6.82 | 13.06 | 24.62 | 46.58 |
91.10

* T4

  | b1_s512 | b4_s512 | b8_s512 | b16_s512 | b32_s512 | b64_s512
-- | -- | -- | -- | -- | -- | --
Unfused | 8.17 | 29.86 | 59.56 | 115.77 | 236.66 | 461.43
Flash Attention (from FT) | 5.65 | 21.12 | 44.94 | 86.83 | 174.16 |
351.38
Flash Attention (from TRT) | 5.73| 21.49| 45.49 | 89.15 | 174.37 |
352.08
Original Flash Attention | 6.22 | 22.16 | 43.39 | 83.8 | 168.77 | 337.04

* V100

Kernel | b1_s512 | b4_512 | b8_s512 | b16_s512 | b32_s512 | b64_s512
-- | -- | -- | -- | -- | -- | --
Unfused | 3.77 | 10.48 | 19.53 | 37.63 | 73.68 | 145.58
Flash Attention (from FT) | 3.21 | 8.25 | 14.95 | 28.83 | 56.28 | 111.15

#### Performance Test of GPT-2 model
Test like the following:
`
python benchmark_gpt2.py -m distilgpt2 -o --stage 1 --use_gpu -p fp16 -b
1 4 8 16 32 64 128 -s 0 --sequence_lengths 8 16 32 64 128 256 512
`
* A100

Note that flash attention is used as fused attention when
sequence_length > 128.

batch_size | sequence_length | with Fused Attention | without Fused
Attention | A100 Gain
-- | -- | -- | -- | --
1 | 8 | 0.93 | 1 | 7.0%
4 | 8 | 0.82 | 0.88 | 6.8%
8 | 8 | 0.84 | 0.88 | 4.5%
16 | 8 | 0.92 | 0.97 | 5.2%
32 | 8 | 1.15 | 1.17 | 1.7%
64 | 8 | 1.68 | 1.72 | 2.3%
128 | 8 | 2.76 | 2.78 | 0.7%
1 | 16 | 0.95 | 0.95 | 0.0%
4 | 16 | 0.83 | 0.88 | 5.7%
8 | 16 | 0.91 | 0.97 | 6.2%
16 | 16 | 1.12 | 1.17 | 4.3%
32 | 16 | 1.67 | 1.72 | 2.9%
64 | 16 | 2.73 | 2.76 | 1.1%
128 | 16 | 4.96 | 4.95 | -0.2%
1 | 32 | 0.94 | 0.88 | -6.8%
4 | 32 | 0.91 | 0.97 | 6.2%
8 | 32 | 1.12 | 1.17 | 4.3%
16 | 32 | 1.65 | 1.71 | 3.5%
32 | 32 | 2.69 | 2.76 | 2.5%
64 | 32 | 4.86 | 4.94 | 1.6%
128 | 32 | 9.35 | 9.38 | 0.3%
1 | 64 | 0.84 | 0.88 | 4.5%
4 | 64 | 1.1 | 1.17 | 6.0%
8 | 64 | 1.64 | 1.73 | 5.2%
16 | 64 | 2.66 | 2.77 | 4.0%
32 | 64 | 4.82 | 4.97 | 3.0%
64 | 64 | 9.23 | 9.4 | 1.8%
128 | 64 | 18.54 | 19.12 | 3.0%
1 | 128 | 0.91 | 0.98 | 7.1%
4 | 128 | 1.68 | 1.74 | 3.4%
8 | 128 | 2.71 | 2.83 | 4.2%
16 | 128 | 4.85 | 5.09 | 4.7%
32 | 128 | 9.32 | 9.69 | 3.8%
64 | 128 | 18.54 | 19.44 | 4.6%
128 | 128 | 36.86 | 38.47 | 4.2%
1 | 256 | 1.15 | 1.23 | 6.5%
4 | 256 | 2.71 | 2.95 | 8.1%
8 | 256 | 4.87 | 5.3 | 8.1%
16 | 256 | 9.32 | 10.23 | 8.9%
32 | 256 | 18.6 | 20.53 | 9.4%
64 | 256 | 36.93 | 40.41 | 8.6%
128 | 256 | 72.84 | 80.14 | 9.1%
1 | 512 | 1.68 | 1.96 | 14.3%
4 | 512 | 4.9 | 6.02 | 18.6%
8 | 512 | 9.4 | 11.59 | 18.9%
16 | 512 | 18.71 | 23.05 | 18.8%
32 | 512 | 37.13 | 45.46 | 18.3%
64 | 512 | 74.04 | 89.88 | 17.6%
128 | 512 | NA | NA | NA

* T4:

batch_size | sequence_length | with Fused Attention | with Unfused
Attention | T4 Gain
-- | -- | -- | -- | --
1 | 8 | 1.97 | 2.11 | 6.6%
4 | 8 | 2.2 | 2.25 | 2.2%
8 | 8 | 2.77 | 3.1 | 10.6%
16 | 8 | 4.17 | 4.2 | 0.7%
32 | 8 | 6.86 | 6.82 | -0.6%
64 | 8 | 14.88 | 14.92 | 0.3%
128 | 8 | 31.4 | 31.29 | -0.4%
1 | 16 | 1.61 | 1.71 | 5.8%
4 | 16 | 2.13 | 2.31 | 7.8%
8 | 16 | 3.38 | 3.67 | 7.9%
16 | 16 | 6.16 | 6.54 | 5.8%
32 | 16 | 14.16 | 14.76 | 4.1%
64 | 16 | 30.36 | 30.57 | 0.7%
128 | 16 | 63.14 | 63.57 | 0.7%
1 | 32 | 1.53 | 1.69 | 9.5%
4 | 32 | 3.34 | 3.66 | 8.7%
8 | 32 | 6.25 | 6.64 | 5.9%
16 | 32 | 14.12 | 14.9 | 5.2%
32 | 32 | 28.96 | 29.82 | 2.9%
64 | 32 | 61.07 | 61.77 | 1.1%
128 | 32 | 116.38 | 117.98 | 1.4%
1 | 64 | 2.01 | 2.21 | 9.0%
4 | 64 | 6.18 | 6.67 | 7.3%
8 | 64 | 13.72 | 14.49 | 5.3%
16 | 64 | 28.71 | 29.83 | 3.8%
32 | 64 | 58.65 | 60.68 | 3.3%
64 | 64 | 113.09 | 113.17 | 0.1%
128 | 64 | 205.21 | 209.4 | 2.0%
1 | 128 | 3.37 | 3.76 | 10.4%
4 | 128 | 13.54 | 14.85 | 8.8%
8 | 128 | 28.32 | 30.22 | 6.3%
16 | 128 | 58.16 | 62.09 | 6.3%
32 | 128 | 109.17 | 113.99 | 4.2%
64 | 128 | 198.9 | 207.1 | 4.0%
128 | 128 | 413.25 | 421.82 | 2.0%
1 | 256 | 6.33 | 7.05 | 10.2%
4 | 256 | 28.09 | 31.49 | 10.8%
8 | 256 | 57.47 | 62.76 | 8.4%
16 | 256 | 106.77 | 117.95 | 9.5%
32 | 256 | 197.02 | 208.58 | 5.5%
64 | 256 | 406.81 | 431.36 | 5.7%
128 | 256 | NA | NA | NA
1 | 512 | 13.84 | 16.32 | 15.2%
4 | 512 | NA | NA | NA
8 | 512 | NA | NA | NA
16 | 512 | NA | NA | NA
32 | 512 | NA | NA | NA
64 | 512 | NA | NA | NA
128 | 512 | NA | NA | NA

* V100:

batch_size | sequence_length | with Fused Attention | with Unfused
Attention | V100 Gain
-- | -- | -- | -- | --
1 | 8 | 1.31 | 1.6 | 18.1%
4 | 8 | 1.17 | 1.26 | 7.1%
8 | 8 | 1.43 | 1.79 | 20.1%
16 | 8 | 2.14 | 1.96 | -9.2%
32 | 8 | 2.91 | 3.08 | 5.5%
64 | 8 | 5.32 | 5.27 | -0.9%
128 | 8 | 9.34 | 8.97 | -4.1%
1 | 16 | 1.41 | 1.58 | 10.8%
4 | 16 | 1.38 | 1.49 | 7.4%
8 | 16 | 1.81 | 2.2 | 17.7%
16 | 16 | 2.8 | 2.83 | 1.1%
32 | 16 | 4.94 | 4.99 | 1.0%
64 | 16 | 8.88 | 8.84 | -0.5%
128 | 16 | 17.35 | 17.2 | -0.9%
1 | 32 | 1.38 | 1.77 | 22.0%
4 | 32 | 1.77 | 1.93 | 8.3%
8 | 32 | 2.71 | 2.86 | 5.2%
16 | 32 | 5.03 | 4.92 | -2.2%
32 | 32 | 8.8 | 8.79 | -0.1%
64 | 32 | 17.29 | 17.23 | -0.3%
128 | 32 | 33.27 | 33.1 | -0.5%
1 | 64 | 1.67 | 1.87 | 10.7%
4 | 64 | 2.69 | 2.76 | 2.5%
8 | 64 | 4.87 | 4.94 | 1.4%
16 | 64 | 8.73 | 8.81 | 0.9%
32 | 64 | 16.92 | 17.24 | 1.9%
64 | 64 | 33 | 33.38 | 1.1%
128 | 64 | 65.33 | 65.86 | 0.8%
1 | 128 | 2.03 | 2.22 | 8.6%
4 | 128 | 4.9 | 5.04 | 2.8%
8 | 128 | 8.76 | 8.81 | 0.6%
16 | 128 | 17.06 | 17.29 | 1.3%
32 | 128 | 33.25 | 33.56 | 0.9%
64 | 128 | 65.54 | 66.5 | 1.4%
128 | 128 | 130.44 | 131.44 | 0.8%
1 | 256 | 2.78 | 2.86 | 2.8%
4 | 256 | 8.75 | 9.04 | 3.2%
8 | 256 | 17 | 17.68 | 3.8%
16 | 256 | 33.19 | 34.32 | 3.3%
32 | 256 | 65.43 | 67.86 | 3.6%
64 | 256 | 129.92 | 134.68 | 3.5%
128 | 256 | NA | NA | NA
1 | 512 | 4.95 | 5.32 | 7.0%
4 | 512 | NA | NA | NA
8 | 512 | NA | NA | NA
16 | 512 | NA | NA | NA
32 | 512 | NA | NA | NA
64 | 512 | NA | NA | NA
128 | 512 | NA | NA | NA
2022-12-31 10:33:54 -08:00
..
external Let Cmake decide where to place abseil (#14057) 2022-12-23 12:08:13 -08:00
patches Update absl to the latest release (#13990) 2022-12-19 14:25:13 -08:00
tensorboard Improve dependency management (#13523) 2022-12-01 09:51:59 -08:00
adjust_global_compile_flags.cmake Multi-stream execution support (#13495) 2022-12-15 07:39:29 -08:00
CMakeLists.txt Fix deprecated-builtins (#14001) 2022-12-17 18:17:05 +08:00
CMakeSettings.json
codeconv.runsettings
deps.txt Update absl to the latest release (#13990) 2022-12-19 14:25:13 -08:00
EnableVisualStudioCodeAnalysis.props Fix SDL warnings in CPU EP (#9975) 2021-12-19 20:54:29 -08:00
gdk_toolchain.cmake Enable building with a GDK (#11126) 2022-04-07 15:06:31 -07:00
Info.plist.in
libonnxruntime.pc.cmake.in
nuget_helpers.cmake
onnxruntime.cmake Remove miscellaneous nuphar configs (#13070) 2022-09-26 13:41:28 -07:00
onnxruntime_codegen_tvm.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_common.cmake Enabling thread pool to be numa-aware (#13778) 2022-12-12 10:33:55 -08:00
onnxruntime_config.h.in [wasm] update emscripten v2.0.34 (#10391) 2022-01-26 14:46:02 -08:00
onnxruntime_csharp.cmake Enable nuget packages for on device training (#13637) 2022-12-05 14:54:09 -08:00
onnxruntime_eager.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_flatbuffers.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_framework.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_fuzz_test.cmake
onnxruntime_graph.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_ios.toolchain.cmake
onnxruntime_java.cmake Add linux and macos arm64 java aritifacts (#10981) 2022-03-25 16:23:17 -07:00
onnxruntime_java_unittests.cmake
onnxruntime_kernel_explorer.cmake Share TunableOp between CUDA and ROCM EP (#13560) 2022-11-11 13:56:44 +08:00
onnxruntime_language_interop_ops.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_mlas.cmake Switch GSL to MS GSL 4.0.0 (#13416) 2022-10-29 04:15:20 -07:00
onnxruntime_nodejs.cmake Add Node.js binding support to packaging pipeline (#9577) 2021-11-05 15:29:40 -07:00
onnxruntime_objectivec.cmake Remove SafeInt dependency from Objective-C API. (#13698) 2022-11-18 17:06:12 -08:00
onnxruntime_opschema_lib.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_optimizer.cmake Optimize computation orders (#13672) 2022-12-22 15:12:52 +08:00
onnxruntime_providers.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_pyop.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_python.cmake Improve dependency management (#13523) 2022-12-01 09:51:59 -08:00
onnxruntime_rocm_hipify.cmake Sampling op (#13426) 2022-12-22 17:34:12 -08:00
onnxruntime_session.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_snpe_provider.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_training.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
onnxruntime_unittests.cmake [CUDA] Update fused MHA to support flash attention and causal mask (#13953) 2022-12-31 10:33:54 -08:00
onnxruntime_util.cmake Improve dependency management (#13523) 2022-12-01 09:51:59 -08:00
onnxruntime_webassembly.cmake Fix usage of enable_training_ops and reduce ifdef complexity for training builds (#13888) 2022-12-14 08:32:46 -08:00
precompiled_header.cmake Fix Windows Store build (#8753) 2021-08-23 11:19:03 -07:00
Sdl.ruleset Update Sdl.ruleset to remove C26812 from the rules (#12695) 2022-09-01 20:05:20 -07:00
set_winapi_family_desktop.h
target_delayload.cmake Remove Windows Store specific code 2022-03-17 23:38:14 -07:00
uwp_stubs.h Fix Windows Store build (#8753) 2021-08-23 11:19:03 -07:00
wcos_rules_override.cmake
winml.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00
winml_cppwinrt.cmake Fix Windows Store build (#8753) 2021-08-23 11:19:03 -07:00
winml_sdk_helpers.cmake
winml_unittests.cmake Use target name for flatbuffers (#13991) 2022-12-20 11:44:02 -08:00