onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-16 21:00:14 +00:00

History

aciddelgado cbb29d80ff GQA Rotary and Packed QKV with Flash (#18906 ) ### Description These changes add rotary embedding and packed qkv input to gqa. As of now, the changes are only supported with Flash-Attention (SM >= 80) but should soon be supported with Memory Efficient Attention as well. ### Motivation and Context With the fusion of rotary embedding into this Attention op, we hope to observe some perf gain. The packed QKV should also provide some perf gain in the context of certain models, like Llama2, that would benefit from running ops on the fused QKV matrix, rather than the separate Q, K, and V. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>		2024-01-23 16:34:26 -08:00
..
android_custom_build	Update NDK version to 26.1.10909125 (#18493 )	2023-11-17 14:14:01 -08:00
ci_build	GQA Rotary and Packed QKV with Flash (#18906 )	2024-01-23 16:34:26 -08:00
doc	Disable PERF* rules in ruff to allow better readability (#16834 )	2023-07-25 15:38:22 -07:00
nuget	Update DirectML nuget version to 1.13.1 (#19122 )	2024-01-15 19:04:41 -08:00
perf_view	fixed #16873 (#16932 )	2023-09-26 09:57:01 -07:00
python	Update to allow large models to be checked for mobile support. (#18357 )	2023-11-17 07:20:16 +10:00
scripts	Remove dnf update from docker build scripts (#17551 )	2023-09-21 07:33:29 -07:00