### Description
DML Implementation for
[com.microsoft.MatMulIntegerToFloat](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.MatMulIntegerToFloat)
```
.\onnxruntime_test_all.exe --gtest_filter="*MatMulIntegerToFloat.*"
Note: Google Test filter = *MatMulIntegerToFloat.*
[==========] Running 22 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 22 tests from MatMulIntegerToFloat
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8S8 (620 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8S8 (497 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8S8 (488 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8S8 (503 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8U8 (495 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8U8 (488 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8U8 (492 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8X8 (502 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_S8U8 (452 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_S8U8 (454 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_S8U8 (446 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_S8U8 (508 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_NoBias_test_U8S8 (456 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_HasBias_test_U8S8 (455 ms)
[ RUN ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8
[ OK ] MatMulIntegerToFloat.NoZeroPoint_NoBias_test_U8S8 (447 ms)
[ RUN ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8
[ OK ] MatMulIntegerToFloat.HasZeroPoint_HasBias_test_U8S8 (465 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8U8 (111 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_U8S8 (115 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8S8 (114 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16_S8U8 (110 ms)
[ RUN ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16
[ OK ] MatMulIntegerToFloat.MatMulIntegerToFloat_FP16 (112 ms)
[ RUN ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint
[ OK ] MatMulIntegerToFloat.MatMulInteger_With_ZeroPoint (337 ms)
[----------] 22 tests from MatMulIntegerToFloat (8679 ms total)
[----------] Global test environment tear-down
[==========] 22 tests from 1 test suite ran. (8680 ms total)
[ PASSED ] 22 tests.
memleakdbg:
----- No memory leaks detected -----
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* `CalculateMatMulIntegerToFloat` to replace CPU EP run reference
* Added more FP32 testcases to isolate all input datatype combinations
* Added fixed input to `MatMulIntegerToFloat_FP16*` test cases as for
FP16 test cases.
* onnxruntime/test/testdata/matmul_integer_to_float.py` is capable of
generating FP16 models, but we do not produce any for now
I've added NHWC GridSample support to the CUDA EP to reduce the number
of layout transforms. Also I've enabled the full set of GridSampleTests
for all EPs. I've also added the GridSample OpSet 16 to the registered
kernels.
### Motivation and Context
This is the first PR is a series of enhancements of the CUDA EP
improving NHWC support to avoid costly layout transforms between NWHC
and NCHW nodes which are layout sensitive. Also testing was quite
rudimentary for the CUDA EP while it was great for the CPU path. I've
regenerated grid_sample_test.cc enabling tests for other platforms as
well. Those tests resurfaced #10607 again which is fixed as well.
### ONNX Gelu Op in Opset 20
Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op
1. Move CPU-GELU implmentation from
`onnxruntime/contrib_ops/cpu/activations.h/cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'none'.
2. Dumplicate some logic from
`onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'tanh'.
3. Register ONNX domain Gelu CPU kernel from opset 20 in
`onnxruntime/core/providers/cpu/cpu_execution_provider.cc`.
4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to
`onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and
`onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu`
respectively, as the implementation for approximate attribute to be
'tanh'.
5. Implement the logic for approximate attribute to be 'none' in
`onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`.
6. Register ONNX domain Gelu CUDA kernel from opset 20 in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
7. ROCM ep related changes.
8. Enrich the tests for ONNX domain Gelu in
`onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.
### Description
Sqrt does not have BF16 support yet. Adding that with this PR
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Adds bfloat16 as a supported dtype for SimplifiedLayerNormFusion which
will provide speedup for Llama-v2 on A100 using bfloat16 numerical
format.
_layernorm_optimized_training.onnx exported in bfloat16 vs. float16:_

### Repro Instructions
```python
from torch import nn
from onnxruntime.training.ortmodule import ORTModule, DebugOptions, LogLevel
import torch
dtype = torch.bfloat16
# dtype = torch.float16
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(784, 10, dtype=dtype)
self.layernorm = nn.LayerNorm([784], dtype=dtype)
def forward(self, x):
x = x.view(x.shape[0], -1)
x = self.layernorm(x)
x = self.fc(x)
return x
model = Net()
model = ORTModule(model, DebugOptions(save_onnx=True, onnx_prefix='layernorm', log_level=LogLevel.INFO))
model.to("cuda")
images = torch.randn((8, 28, 28), dtype=dtype).to("cuda")
output = model(images)
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ONNX Runtime integration with Llama-v2 family of LLMs.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
`ScatterElements` in opset 18 has been around for a while. However, the
highest opset supporting `ScatterElements` in ORT is 13. This PR
implement this op in CUDA EP by replacing `assignment` in the current
CDUA kernel with `atomic reduction` (e.g., atomic add, atomic max). A
series of fundamental atomic functions (e.g., atomic max for int8_t and
half) are implemented in `common.cuh`; the implementation is general
enough to cover old CUDA and new CUDA versions.
- The core changes are in `cuda/atomic/common.cuh` with very detailed
documentation including `bit-wise operation's visualization`. They are
also copied to `rocm/atomic/common.cuh` to support AMD GPU.
- `/cuda/tensor/gather_elements_impl.cu` contains small changes to call
the new atomic functions to support new `reduction` behavior in new
`ScatterElements`.
- New `ScatterElements` are defined in `rocm_execution_provider.cc` and
`cuda_execution_provider.cc`.
### Description
Implement Pad-18 for Cuda.
### Motivation and Context
Latest models converted by Dynamo fall back on CPU for Pad with
performance degradation.
This contributes to
https://github.com/microsoft/onnx-rewriter/issues/126
### Description
These changes add rotary embedding and packed qkv input to gqa. As of
now, the changes are only supported with Flash-Attention (SM >= 80) but
should soon be supported with Memory Efficient Attention as well.
### Motivation and Context
With the fusion of rotary embedding into this Attention op, we hope to
observe some perf gain. The packed QKV should also provide some perf
gain in the context of certain models, like Llama2, that would benefit
from running ops on the fused QKV matrix, rather than the separate Q, K,
and V.
---------
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
### Description
<!-- Describe your changes. -->
Add `temperature` as an input to WhisperBeamSearch op and initialize
correctly in parameter setup.
### Motivation and Context
Currently, temperature is included as an attribute to the BeamSearch op,
which doesn't let the model act dynamically in a single inference
session. By including this variable as an input, the temperature value
can be altered in any inference call (important for 1P teams)
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
### Description
<!-- Describe your changes. -->
Register DML operators for opset 19.
- Cast19
- Castlike19
- Constant19
- Equal19
- Identity19
- QuantizeLinear19
- DequantizeLinear19
- Reshape19
- Shape19
- Size
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: linnealovespie <linneamay@microsoft.com>
### Description
<!-- Describe your changes. -->
1. support causal mask in MHA cpu
2. support custom rotary_dim in rotary_emb
3. add bf16 for rotary_emb
4. fix a bug in attention rotary
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Implements LabelEncoder as per `ai.onnx.ml` opset 4 for the upcoming
ONNX 1.15 release. ~~This currently depends on a new ONNX release
candidate and so is marked as draft in the meantime.~~
### Motivation and Context
Closes https://github.com/microsoft/onnxruntime/issues/17602
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
reducemax/min have been updated in onnx(20). implement it in ort
### Motivation and Context
this is for ort1.17.0 release
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
dft is updated in opset20. implement it in ort
### Motivation and Context
this is for ort 1.17.0 release
Fixes#17723
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
<!-- Describe your changes. -->
Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for
QLoRA fine-tuning.
- On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16
dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16`
type which uses float for compute.
- I have validated the op in a llama2-7b training scenario. The losses
match pytorch training and the training throughput is better.
- Cannot add a bfloat16 case in the op unit test since casting BFloat16
to and from float multiple times during the test causes the required
tolerances to be unachievable.
The custom autograd function exporter in onnxruntime-training is updated
to support the latest version of bitsandbytes. They changed how the
`quant_state` is stored.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable QLoRA fine-tuning with bfloat16.
### Description
<!-- Describe your changes. -->
1. Introduce MoE CUDA op to ORT based on FT implementation.
2. Upgrade cutlass to 3.1.0 to avoid some build failures on Windows.
Remove patch file for cutlass 3.0.0.
3. Sharded MoE implementation will come with another PR
limitation: __CUDA_ARCH__ >= 700
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Registers BFloat16 datatype as valid input type for CUDA Neg Kernel.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
GQA now only works with Flash Attention with Attention Mask input,
allowing for batched input. Note: This PR Disables Memory Efficient
Attention, only allowing Flash Attention kernel to be used.
### Motivation and Context
Allows GQA to work with batched input.
---------
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
This is a graph implementation of RotaryEmbedding since there's no time
to add it to DML before 1.16.2, but it eventually should move into
DirectML since we're bandwidth-bound.
### Description
<!-- Describe your changes. -->
Adds bfloat16 as a valid input parameter type for where node for ONNX
opset 16+.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enabling `meta-llama/Llama-2-70b` to be finetuned with ONNX Runtime
training.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
* Add a new operator SkipGroupNorm to support skip and bias inputs.
* Update GroupNorm kernel to support number of channels used in SD XLrefiner.
* Add epsilon in kernel
* Add parity and performance test script
* Remove many limitations including max batch size, max number of groups, c % cPerBlock ==0 etc.
### Motivation and Context
Update GroupNorm to support SD XL Refiner and beyond.
### Description
Add support for Gemm with float 8 as a contrib op.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Opset 18 apply the "axes as input" change from ReduceSum to all the
other reduce ops. Our cuda kernel actually support it, but we didn't
enable it for opset18. This PR update the reduce ops' kernel
registration to enable the "axes as input" behavior for opset18.
As part of the fix, I also simplify the reduce op kernel registration
part. ORT doesn't require the kernel definition need to be exactly the
same as onnx op definition. For our case, which we share the same kernel
for all the reduce ops (from version 1 to version 18), we don't need to
maintain different version of kernel definitions. we can simplify it by
just using a single kernel definition for multiple versions. Although
for some cases, we might register more types for legacy versions, but it
is harmless. Framework is using schema to validate the graph, not kernel
definition.
---------
Co-authored-by: Cheng Tang <chenta@a100.crj0ad2y1kku1j4yxl4sj10o4e.gx.internal.cloudapp.net>
Co-authored-by: Cheng Tang <chenta@microsoft.com>
### Description
Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to
support quantization on weight.
This PR adds:
- schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating
point) and NF4 (4-bit NormalFloat) quantization on weight.
- a naive implementation for MatMulBnb4 on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for MatMulBnb4 and related benchmark
tool.
- tool to quantize model to FP4 or NF4.
### Description
<!-- Describe your changes. -->
Add a contrib op MatMulNBits and related toolchain to support
quantization on weight. This PR only adds support for 4bits. It:
- add schema for contrib op MatMulNBits which can support 1-7 bits
quantization on weight.
- a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for 4bits MatMulNBits and related
benchmark tool
- tool to quantization model with 4bits.
Next:
- add general and more efficient kernels for 4bits MatMulNBits on CPU
and GPU
Support cross qk in beam search for whisper model and related features
Make whisper exporting tools support cross qk and some related features,
* extra_decoding_ids
* no_speech_prob
Implement DTW kernel, unfold tensor kernel with unit test Several fix
related with multiple session running parallel, like:
* guard multihead_attention, fused_fp16_runner_
* some memory allocation with stream awareness
* add use_ep_level_unified_stream option
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released
### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Being able to leverage I/O binding for DML and registering `If` for the
DML EP allows us to avoid copying the past/present key/values back and
forth between the CPU and the GPU after every token.
This gives us a 25% performance increase for Dolly V2 with 128 tokens on
an RTX 4090.
### Description
The [ONNX
standard](https://github.com/onnx/onnx/blob/main/docs/Operators.md#type-constraints-181)
permits the `Unique` operator to have `double` input tensor element
type, however this was not supported in onnxruntime. This PR enables
this kernel.
### Motivation and Context
The lack of support for `float64` forces users currently to cast to
`float32` instead. This loss of precision can be severely problematic in
feature engineering pipelines downstream of the `Unique` operator. It
would be good to prevent this by updating ORT to reflect the standard
and support `double` input tensors.
---------
Signed-off-by: Aditya Goel <agoel4512@gmail.com>
This fixes the type lists used to register DML kernels for Microsoft
domain QuantizeLinear and DequantizeLinear. These previously did not
include FP16 and incorrectly used the same type list for both operators.
The new type lists are the same as opset 19 ONNX which aren't
implemented yet in the DML EP.