onnxruntime/onnxruntime/contrib_ops
petermcaughan febd5facce
Change head_size parameter dependent on qkv_hidden_size (#12933)
**Description**: Add qkv_hidden_size support in CUDA Attention Layer
implementation.

Changes include:

- Modify UT to test GPU and CPU implementation
- Add overload for CUDA kernel `AddBiasTransposeQKV` to support scenario
where V_HIDDEN_SIZE != QK_HIDDEN_SIZE
- Update variable names from `head_size` to `qkv_head_sizes[0]` or
`qkv_head_sizes[2]`
- Modify function definitions to allow communication of
`qkv_hidden_sizes` or `qkv_head_sizes`

Note that this feature is not supported in Rocm EP or quantized
attention right now.

**Motivation and Context**
- Why is this change required? What problem does it solve? The current
CUDA implementation of attention layer doesn't support the parameter
qkv_hidden_size added in the CPU implementation in PR
[8039](https://github.com/microsoft/onnxruntime/pull/8039)
- If it fixes an open issue, please link to the issue here.

Co-authored-by: Peter Mcaughan <petermca@microsoft.com>
2022-10-11 00:25:47 -07:00
..
cpu Decouple use_sequence_as_input_ids from has_hidden_states (#13130) 2022-09-29 22:45:52 -07:00
cuda Change head_size parameter dependent on qkv_hidden_size (#12933) 2022-10-11 00:25:47 -07:00
rocm Change head_size parameter dependent on qkv_hidden_size (#12933) 2022-10-11 00:25:47 -07:00