### Description
Based on https://github.com/microsoft/onnxruntime/pull/9700, and extend
it to ArgMin as well.
This pull request introduces several enhancements and fixes related to
the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The
changes ensure proper handling of these operators across different
versions and improve kernel registration and fallback mechanisms.
Key changes include:
#### Enhancements to `ArgMax` and `ArgMin` Operators:
* Added new kernel class registrations for `ArgMax` and `ArgMin` for
different data types and versions in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215)
[[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659)
[[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827)
[[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939)
[[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180)
* Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle
fallback to CPU when the `select_last_index` attribute is set to 1, as
CUDA does not support this attribute.
[[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622)
[[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674)
#### Macro and Kernel Registration Improvements:
* Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with
`REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and
`REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version
handling.
[[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29)
[[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46)
* Updated kernel registration for `ArgMax` and `ArgMin` to use the new
macros, ensuring proper version handling and support for different data
types.
#### Safety Checks:
* Added safety checks in the `ArgMax` and `ArgMin` classes to ensure
`select_last_index` is not set to 1, as it is not supported on CUDA.
[[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99)
[[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117)
#### Testing Enhancements:
* Added new tests for `ArgMax` and `ArgMin` operators to verify behavior
when `select_last_index` is set to 0, ensuring compatibility with both
CPU and CUDA execution providers.
[[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360)
[[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699)
### Motivation and Context
Improve CUDA kernel coverage for stable diffusion model and hence
improve its performance on CUDA
### Description
Add I/O binding example using onnx data type in python API summary. The
API is available since 1.20 release.
### Motivation and Context
Follow up of https://github.com/microsoft/onnxruntime/pull/22306 to add
some documentation.
### Description
This PR fixes an equation in the MatMulNBits op spec. The old formula is
stated as
```
[CeilDiv((N * n_blocks_per_col + 1) * bits, 8)]
```
but it should be stated as
```
[N * CeilDiv(n_blocks_per_col * bits, 8)]
```
or as
```
[N * FloorDiv((n_blocks_per_col + 1) * bits, 8)]
```
### Motivation and Context
For models such as ChatGLM where the column size is odd, the division
math can be off. For example:

With the old equation, the projections are calculated as follows.
```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
4,096 * CeilDiv((107 + 1) * 4, 8) = 4,096 * CeilDiv(108 * 4, 8) = 4,096 * 54 = 221,184
# Up projection
B = 13,696 x 32 x 64
zero_points = 219,136
N = 13,696
n_blocks_per_col = 32
13,696 * CeilDiv((32 + 1) * 4, 8) = 13,696 * CeilDiv(33 * 4, 8) = 13,696 * 17 = 232,832
```
With the new equation, the projections are calculated as follows.
```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
4,096 * CeilDiv(107 * 4, 8) = 4,096 * 54 = 221,184
# Up projection
B = 13,696 x 32 x 64
zero_points= 219,136
N = 13,696
n_blocks_per_col = 32
13,696 * CeilDiv(32 * 4, 8) = 13,696 * 16 = 219,136
```
`If` nodes can have sequence outputs. Those nodes are mapped to the DML
EP to be able to keep the outputs on the GPU, but they actually execute
on the CPU by selecting either the `then` subgraph or the `else`
subgraph.
Add `MLFloat16` support for:
- `LayerNormalization`
- `SimplifiedLayerNormalization`
- `SkipLayerNormalization`
- `SkipSimplifiedLayerNormalization`
There are existing `LayerNormTest` unit tests that cover the `MLFloat16`
functionality for `LayerNormalization` once `MLFloat16` is registered
(for example
[`LayerNormTest.LayerNorm_Scale_Float16Input`](91c916f9c6/onnxruntime/test/contrib_ops/layer_norm_op_test.cc (L112))).
Similarly, there are unit tests such as
[`SkipLayerNormTest.SkipLayerNormBatch1_Float16`](91c916f9c6/onnxruntime/test/contrib_ops/skiplayernorm_op_test.cc (L255))
that cover MLFloat16 inputs for `SkipLayerNormalization`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <you@example.com>
### Description
* Add lintrunner to requirements-lintrunner.txt
* Lock lintrunner and lintrunner-adapter version
* Update documentation
### Motivation and Context
The document is not up to date.
### Description
This PR will add support for Continuous Decoding for batch_size = 1
input. From now on, GQA can take arbitrary length input using seqlens_k
as total_sequence_length - 1 and the sequence length of qkv as
new_sequence_length.
**This change will not affect the default behavior of GQA**
### Motivation and Context
Prior to this change it was impossible to support sequence_length > 1
inputs when past context was given. This use case is essential to making
continuous decoding work, which is one of our current efforts in
ORT-GenAI.
### Description
Implement softcap for gqa.
### Motivation and Context
Fixes certain models like Gemma-2 which need softcap to work so they
don't output nan's.
### Description
1. Added CUDA EP support for blocked quantization in QuantizeLinear and
DequantizeLinear ops.
2. Currently CUDA EP blocked quantization only supports int4/uint4
quantized types and float32/float16 unquantized types.
3. Added CUDA EP support in QDQ selector/action transformer. CUDA EP is
only added to DQ + MatMul -> MatMulNBits rule. Other rules' EP support
are not changed.
### Motivation and Context
ONNX opset 21 introduced blocked quantization for Q/DQ opts. ORT
originally only supports CPU EP blocked quantization.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <you@example.com>
### Description
Softmax (formula 1) is like the following:
```math
y_{i} = \frac{exp(x_{i})}{\sum_{i} exp(x_{i})}
```
After applying softmax, each element will be in the range of $(0, 1)$,
and the elements will add up to 1, so that they can be interpreted as
probabilities.
However, in language model, softmax has two issues:
* When all elements are -inf (for example, a whole row is masked when a
query token is padding), the result is not defined since exp(-inf)=0 and
divided-by-zero is encountered in the above formula.
* Why do we need normalize in a way that each query word are treated as
equal important (each row has sum equals to1)?
**Smooth Softmax** (formula 2) is a modified version that introduces a
smooth factor like the following:
```math
s_{i} = \frac{exp(x_{i})}{1+ \sum_{i} exp(x_{i})}
```
This formula could tackle the above two issues:
* It could handle the special case that all elements are -inf: the
result $s_{i}$ is 0 for every element in such case.
* Sum of all elements $\sum_{i}{s_{i}} = \frac{\sum_{i}{exp(x_{i})}}{1+
\sum_{i} exp(x_{i})}$ is in the range of (0, 1), so that we can train
the model to assign different importance to different query words.
Since exponential is prone to overflow or underflow, to get stable
result, formula 3 can be used:
```math
s_{i} = \frac{exp(x_{i} + c)}{exp(c)+ \sum_{i} exp(x_{i} +c)}
```
c can be any value in theory. In practical, choice of constant c shall
avoid $exp(c)$ and $exp(x_{i} +c)$ overflow (or underflow) at the same
time. A reasonable choice is like formula 4:
```math
c=-\max_{i} \{ x_i \}
```
or apply a constraint that c <=0 like the following formula 5:
```math
c=-\max(0, \max_{i} \{ x_i \})
```
The latter one (formula 5) ensures that $s_{i}$ will fallback to formula
2 when all elements are negative.
For CPU provider, smooth softmax is implemented in MLAS. CPU
implementation uses formula 5.
@wangyems implemented the smooth softmax in flash attention for CUDA,
which requires Ampere or newer GPU. The implementation of smooth softmax
in flash attention uses formula 4.
---------
Co-authored-by: Ye Wang
### Description
<!-- Describe your changes. -->
### Motivation and Context
1. Python API doc needs to be merged from a fork, but 1ES self-hosted
pool is only for one github repo.
2. ubuntu-latest will be install numpy above 2.0 by default, and current
python API doc generation doesn't support it.
So I pin numpy < 2.0.0
---------
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.
- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script
Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
### Description
Add a gather that supports block-quantized input data.
### Motivation and Context
To support Web inference scenario with quantized vocabulary embeddings.
### Description
This PR registers the ReduceMin-20 operator to the DML EP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add OVEP features for 1.19
The PR has,
- Added support for EpCtx with ORT Session options for optimized
performance.
- Added bug fixes
- Support for OV 2024.3
---------
Co-authored-by: ubuntu <ubuntu@ubuntu-mtlp-118727.iind.intel.com>
Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Maheshkar <ankit.maheshkar@intel.com>
### Description
<!-- Describe your changes. -->
Introduces an ATen fallback for
`torch.nn.functional.scaled_dot_product_attention`. This operator was
introduced in torch 2.0 and, since then, has had many updates including
the implementation of memory efficient attention for V100 machines. The
current torchscript exporter exports a subgraph for attention which does
not provide the same memory savings that PyTorch's memory efficient
attention kernel provides. Allowing fallback to PyTorch ATen op for
attention helps mitigate memory spike issues for models leveraging
memory efficient attention.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Memory issues arose when integrating ONNX Runtime Training with AML
Stable Diffusion.
---------
Co-authored-by: root <prathikrao@microsoft.com>
Add SparseAttention cpu implementation.
- [x] Refactoring GQAAttentionBase
- [x] Add SparseAttention implementation
- [x] Add test cases
This is unfused version. Flash attention version will be added later.
### Description
Length checking is even more strict for packed batching input.
There are two cases for a batch of input_ids.
- padded seq with equal length of inputs.
```
|----********|
|------------|
|--------****|
|-***********|
```
- packed seqs with different length of input_ids
`|----|---------|----|-|`
The max_seq_length is either from graph_inputs or the position_ids.
While in most of cases, we will cache the max_seq_length of rotary_cache
in the model ans shared among all layers.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: kailums <kalu@microsoft.com>
### Description
<!-- Describe your changes. -->
Trilu<bool> is used by phi-3 when exported with torch.onnx.export.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- 4-bit QuantizeLinear(21). **Blocked quantization still missing (i.e.,
do not support the new `block_size` attribute)**
- 4-bit DequantizeLinear(21). **Blocked dequantization still missing
(i.e., do not support the new `block_size` attribute)**
- 4-bit Transpose(21).
- Update quantization tool with int4 types.
- Disable QDQ fusions for 4-bit types. See:
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
- MLAS 4-bit quantization kernels for intel, neon, powerpc.
##### Notes
To calculate a tensor's storage size, we normally get the number of
elements from the shape (i.e., `tensor_shape.Size()`) and multiply by
the size of a single element. This does not directly work for sub-byte
elements like int4 as each element in a `Tensor<Int4x2>` stores **two**
packed int4 elements in a byte. The `Tensor::
CalculateTensorStorageSize` should be called to perform the correct
calculation for any tensor element type.
### Motivation and Context
ONNX 1.16 added the int4 and uint4 types. This initial PR adds the int4
type to ORT and adds int4 implementations for the Quant, Dequant, and
Transpose ops on CPU EP. We still need to add int4 support for many ops
and execution providers. See the ONNX 1.16 release notes:
https://github.com/onnx/onnx/releases.
This PR includes the weight-stripped engine feature (thanks @moraxu for
the #20214) which is the major feature for TRT 10 integration.
Two TRT EP options are added:
- `trt_weight_stripped_engine_enable`: Enable weight-stripped engine
build and refit.
- `trt_onnx_model_folder_path`: In the quick load case using embedded
engine model / EPContext mode, the original onnx filename is in the
node's attribute, and this option specifies the directory of that onnx
file if needed.
Normal weight-stripped engine workflow:

Weight-stripped engine and quick load workflow:

see the doc [here
](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#tensorrt-ep-caches)for
more information about EPContext model.
---------
Co-authored-by: yf711 <yifanl@microsoft.com>
Co-authored-by: Ye Wang <52801275+wangyems@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
Co-authored-by: wejoncy <wejoncy@163.com>
Co-authored-by: Yi Zhang <zhanyi@microsoft.com>
Co-authored-by: Yi Zhang <your@email.com>
Co-authored-by: Pranav Sharma <prs@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: cao lei <jslhcl@gmail.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: inisis <46103969+inisis@users.noreply.github.com>
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
Co-authored-by: mo-ja <60505697+mo-ja@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Sumit Agarwal <sumitagarwal330@gmail.com>
Co-authored-by: Atanas Dimitrov <70822030+neNasko1@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Dhruv Matani <dhruvbird@gmail.com>
Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Xiaoyu <85524621+xiaoyu-work@users.noreply.github.com>
Co-authored-by: Xu Xing <xing.xu@intel.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Rachel Guo <35738743+YUNQIUGUO@users.noreply.github.com>
Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Andrew Fantino <15876180+afantino951@users.noreply.github.com>
Co-authored-by: Thomas Boby <thomas@boby.uk>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Michal Guzek <mguzek@nvidia.com>
Co-authored-by: George Wu <jywu@microsoft.com>
### Flash attn recompute
1. Allow PythonOp(FlashAttn) can be recomputed correctly.
45879ff5c2
2. Use JSON to pass the selected-to-recompute subgraphs.
3c374da678
#### Better Memory Efficiency
Customer model can run both PyTorch SPDA and Flash Attn, this PR make it
possible to let the Flash Attn path work with ORTModule layerwise
recompute. The peak drop from 45.xGB to 32.xGB if we only compare the
layers (not including other pieces, BTW there are few more optimization
targeting other pieces as well later).
#### Better Perf
Using Flash ATTN bring additionally 16% end to end time reduction, with
highly aligned loss curve.

#### Use JSON File to pass Recompute Plans
To overcome the limitation of max length of the strings defined in
session options.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The InsertGatherBeforeSceLoss optimization is enabled when the density
of label padding less than 90%. We need to check the density of the
label padding to decide whether enable the optimization.
Before this pr, we just check the inputs of graph and correlate one with
the SCE node by iterate graph from the SCE node back to one graph input.
This is hard to be general because there may be complicated pattern
between graph input and SCE node.
This pr check padding density by the direct input of SCE module rather
than the input of graph at the first graph execution when exporting onnx
graph.
And if the density < 90%, insert a flag PythonOp after the SCE node as:
```
SoftmaxCrossEntropy
|
PythonOp (func_name: FlagAndPrintDensity) (insert if density < 90%)
|
Following graph
```
When the InsertGatherBeforeSceLoss is invoked, it check if there is the
flag PythonOp(func_name: FlagAndPrintDensity) after the SCE node and if
it is, remove it and do the padding elimination optimization.
If the env of ORTMODULE_PRINT_INPUT_DENSITY is 1, we will print input
density each step by the PythonOp (func_name: FlagAndPrintDensity). In
this case the PythonOp will not be removed.
### Description
Make the operator more flexible:
(1) Decouple the max sequence length of rotary cache, kv cache and block
mask. They are allowed to have different values.
(2) Replace block_mask dense by CSR format (block_row_indices and
block_col_indices) to improve performance.
(3) Mark past_key and past_value as required inputs since we need them
to compute the shape of present_key and present_value.
### Motivation and Context
(1) LongRoPE has short and long rotary cache, which has different
length.
(2) Most users do not have enough GPU memory to run maximum sequence
length 128K. This change allows user to use smaller kv cache length to
test without hitting out of memory.
### Description
There was a bug with gqa on cpu where on token case, with batch_size >
1, and with past_present_share_buffer off, the output would occasionally
contain nans. this pr fixes that. it also updates documentation and
fixes posid gen for rotary in cuda in prompt case.
### Motivation and Context
this pr solves the GQA CPU bug as well as updates the documentation and
makes seqlens_k irrelevant for prompt case, which is useful to prevent
user error.
### Description
<!-- Describe your changes. -->
This PR registers DFT-20 to the DML EP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add CUDA implementation for block sparse attention for Phi-3-small.
Block sparse attention was proposed in [Sparse
Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also
adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different
sparse layout.
In Phi-3-small, the sparse layout is static, and works with
unidirectional (causal) attention.
Compared to dense attention, the benefit of block sparse is to speed up
both training and inference. It could save memory thus support longer
context length.
- [x] Add operator spec and shape inference
- [x] Symbolic shape inference
- [x] Refactor GroupQueryAttention to expose common kernels for kv cache
concatenation, q/k/v transpose etc.
- [x] Add cuda kernel to convert block mask to CSR format
- [x] Add cuda kernel to generate position ids
- [x] Add compile script and template files to convert triton kernel to
cubin and dispatcher.
- [x] Add triton kernel v1 for prompt
- [x] Add triton kernel v2 for token generation and support padding
- [x] Update IO Binding Helper to allow buffer sharing.
- [x] Test relevance
- [x] Test performance
### Performance
Test in A100-SXM4-80GB with `batch_size=4, num_heads=32,
max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16,
vert_stride=8, num_layout=8`
We compare sparse attention to corresponding GQA with local attention
windows size 1024, or GQA with dense causal.
Average latency in milliseconds (for fused attention kernel used in
prompt prefilling):
seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0465 | 0.0722 | 0.0641
128 | 0.0618 | 0.0787 | 0.0672
256 | 0.1086 | 0.1076 | 0.0943
512 | 0.2535 | 0.2487 | 0.1676
1024 | 0.7042 | 0.7050 | 0.3800
2048 | 2.4125 | 1.9316 | 0.8966
4096 | 8.9346 | 4.5699 | 2.1129
8192 | 40.5401 | 10.3508 | 5.1748
Average latency in milliseconds (for fused attention kernel used in
token generation:
past_seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0186 | 0.0186 | 0.0870
128 | 0.0408 | 0.0466 | 0.1165
256 | 0.0530 | 0.0592 | 0.0988
512 | 0.0445| 0.0447 | 0.1150
1024 | 0.0634 | 0.0640 | 0.1454
2048 | 0.1027 | 0.0637 | 0.1589
4096 | 0.1789 | 0.0631 | 0.1806
8192 | 0.3288 | 0.0655 | 0.2146
We can see that the kernel for token generation still have room to
improve.
#### Limitations
Only support right-side padding and unidirectional attention.
The following are not supported in the first version:
(1) Packed mode like PackedMultiHeadAttention where input has been
removed padding.
(2) paged attention.
(3) bidirectional attention.
(4) GPU compute capacity that is not 8.0, 8.6 and 8.9.
(5) Left side padding.
Some of these limitations will be removed in the future (may be in a new
operator).