onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-04 04:07:22 +00:00

History

Ye Wang 2ee822d483 Extend memory efficient attention coverage in Attention/MHA cuda op (#15064 ) ### Description <!-- Describe your changes. --> 1. upgrade cutlass to 3.0 that containing attn_bias support. 2. extend Attention/MHA to use memory efficient attention when rel_pos_bias with [1, num_head, s, s] and 1d mask with [2 batch_size + 1] are present. new mask format introduction: MASK_1D_KEY_SEQ_LEN_START, [3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1], query_start[0], ..., query_start[batch_size - 1], query_end[batch_size - 1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size - 1]] e.g 2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this 1D mask is [3, 5, 0, 6, 12, 0, 6, 12] ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> It potentially benefits tnlrv6 and t5(encoder) --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>		2023-03-23 11:05:17 -07:00
..
android_custom_build	Update Gradle version (#14862 )	2023-03-08 12:22:06 -08:00
ci_build	Extend memory efficient attention coverage in Attention/MHA cuda op (#15064 )	2023-03-23 11:05:17 -07:00
doc	Format all python files under onnxruntime with black and isort (#11324 )	2022-04-26 09:35:16 -07:00
nuget	[DML EP] Upgrade DML to 1.10.1 (#14433 )	2023-01-25 21:07:10 -08:00
perf_view	fix json format (#11046 )	2022-03-30 16:15:33 -07:00
python	Move offline_tuning.py, so that the utility will be package with whl distribution (#15124 )	2023-03-23 15:24:41 +08:00