onnxruntime/tools
aciddelgado cbb29d80ff
GQA Rotary and Packed QKV with Flash (#18906)
### Description
These changes add rotary embedding and packed qkv input to gqa. As of
now, the changes are only supported with Flash-Attention (SM >= 80) but
should soon be supported with Memory Efficient Attention as well.



### Motivation and Context
With the fusion of rotary embedding into this Attention op, we hope to
observe some perf gain. The packed QKV should also provide some perf
gain in the context of certain models, like Llama2, that would benefit
from running ops on the fused QKV matrix, rather than the separate Q, K,
and V.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
2024-01-23 16:34:26 -08:00
..
android_custom_build Update NDK version to 26.1.10909125 (#18493) 2023-11-17 14:14:01 -08:00
ci_build GQA Rotary and Packed QKV with Flash (#18906) 2024-01-23 16:34:26 -08:00
doc Disable PERF* rules in ruff to allow better readability (#16834) 2023-07-25 15:38:22 -07:00
nuget Update DirectML nuget version to 1.13.1 (#19122) 2024-01-15 19:04:41 -08:00
perf_view fixed #16873 (#16932) 2023-09-26 09:57:01 -07:00
python Update to allow large models to be checked for mobile support. (#18357) 2023-11-17 07:20:16 +10:00
scripts Remove dnf update from docker build scripts (#17551) 2023-09-21 07:33:29 -07:00