2021-06-02 07:47:40 +00:00
## Supported Operators and Data Types
2022-08-22 17:48:12 +00:00
*This file is automatically generated from the registered kernels by [this script ](https://github.com/microsoft/onnxruntime/blob/main/tools/python/gen_opkernel_doc.py ).
2021-06-02 07:47:40 +00:00
Do not modify directly.*
2019-08-15 01:12:24 +00:00
2021-06-02 07:47:40 +00:00
## Execution Providers
2019-08-15 01:12:24 +00:00
2021-06-02 07:47:40 +00:00
- [CPUExecutionProvider ](#cpuexecutionprovider )
- [CUDAExecutionProvider ](#cudaexecutionprovider )
2022-09-09 17:21:25 +00:00
- [DmlExecutionProvider ](#dmlexecutionprovider )
2021-06-02 07:47:40 +00:00
---------------
< a name = "cpuexecutionprovider" / >
2019-08-15 01:12:24 +00:00
## Operators implemented by CPUExecutionProvider
| Op Name | Parameters | OpSet Version | Types Supported |
|---------|------------|---------------|-----------------|
2021-06-02 07:47:40 +00:00
|**Operator Domain:** *ai.onnx* ||||
|Abs|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Acos|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float)|
|Acosh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float)|
|Add|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||13|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|Affine|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
2023-10-25 17:46:04 +00:00
|AffineGrid|*in* theta:**T1**< br > *in* size:**T2**< br > *out* grid:**T1**|20+|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|And|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)< br /> **T1** = tensor(bool)|
2022-01-12 22:12:56 +00:00
|ArgMax|*in* data:**T**< br > *out* reduced:**tensor(int64)**|13+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
2022-01-18 22:37:34 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
|||[1, 10]|**T** = tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|ArgMin|*in* data:**T**< br > *out* reduced:**tensor(int64)**|13+|**T** = tensor(double), tensor(float), tensor(int32)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(int32)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(float), tensor(int32)|
2021-06-02 07:47:40 +00:00
|Asin|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float)|
|Asinh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float)|
|Atan|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float)|
|Atanh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float)|
2023-05-15 17:46:24 +00:00
|AveragePool|*in* X:**T**< br > *out* Y:**T**|19+|**T** = tensor(float)|
|||[11, 18]|**T** = tensor(float)|
2020-09-02 22:07:50 +00:00
|||10|**T** = tensor(float)|
|||[7, 9]|**T** = tensor(float)|
2021-08-25 19:04:20 +00:00
|BatchNormalization|*in* X:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *in* input_mean:**U**< br > *in* input_var:**U**< br > *out* Y:**T**< br > *out* running_mean:**U**< br > *out* running_var:**U**< br >< br > or< br >< br > *in* X:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *in* mean:**T**< br > *in* var:**T**< br > *out* Y:**T**< br > *out* mean:**T**< br > *out* var:**T**< br > *out* saved_mean:**T**< br > *out* saved_var:**T**< br >< br > or< br >< br > *in* X:**T**< br > *in* scale:**T1**< br > *in* B:**T1**< br > *in* input_mean:**T2**< br > *in* input_var:**T2**< br > *out* Y:**T**< br > *out* running_mean:**T2**< br > *out* running_var:**T2**|15+|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(double), tensor(float)< br /> **T2** = tensor(double), tensor(float)|
|||14|**T** = tensor(double), tensor(float)< br /> **U** = tensor(double), tensor(float)|
2021-05-08 03:17:29 +00:00
|||[9, 13]|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[7, 8]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|BitShift|*in* X:**T**< br > *in* Y:**T**< br > *out* Z:**T**|11+|**T** = tensor(uint32), tensor(uint64), tensor(uint8)|
2023-01-24 00:42:18 +00:00
|BitwiseAnd|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|18+|**T** = tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|BitwiseNot|*in* X:**T**< br > *out* Y:**T**|18+|**T** = tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|BitwiseOr|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|18+|**T** = tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|BitwiseXor|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|18+|**T** = tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-06-27 00:26:55 +00:00
|BlackmanWindow|*in* size:**T1**< br > *out* output:**T2**|17+|**T1** = tensor(int32), tensor(int64)< br /> **T2** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Cast|*in* input:**T1**< br > *out* output:**T2**|21+|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[19, 20]|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[13, 18]|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[6, 12]|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-12-14 22:57:14 +00:00
|Ceil|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float)|
|||[6, 12]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|Celu|*in* X:**T**< br > *out* Y:**T**|12+|**T** = tensor(float)|
2023-04-04 20:44:50 +00:00
|Clip|*in* input:**T**< br > *in* min:**T**< br > *in* max:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint32), tensor(uint64), tensor(uint8)|
|||12|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||11|**T** = tensor(float)|
|||[6, 10]|**T** = tensor(float)|
2023-01-25 20:23:00 +00:00
|Col2Im|*in* input:**T**< br > *in* image_shape:**tensor(int64)**< br > *in* block_shape:**tensor(int64)**< br > *out* output:**T**|18+|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Compress|*in* input:**T**< br > *in* condition:**T1**< br > *out* output:**T**|11+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
2020-09-02 22:07:50 +00:00
|||[9, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|Concat|*in* inputs:**T**< br > *out* concat_result:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||[4, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|ConcatFromSequence|*in* input_sequence:**S**< br > *out* concat_result:**T**|11+|**S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|ConstantOfShape|*in* input:**T1**< br > *out* output:**T2**|21+|**T1** = tensor(int64)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||20|**T1** = tensor(int64)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-04-25 18:28:34 +00:00
|||[9, 19]|**T1** = tensor(int64)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Conv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|11+|**T** = tensor(float)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|ConvInteger|*in* x:**T1**< br > *in* w:**T2**< br > *in* x_zero_point:**T1**< br > *in* w_zero_point:**T2**< br > *out* y:**T3**|10+|**T1** = tensor(uint8)< br /> **T2** = tensor(uint8)< br /> **T3** = tensor(int32)|
|ConvTranspose|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|11+|**T** = tensor(float)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Cos|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float)|
|Cosh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float)|
|Crop|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float)|
|CumSum|*in* x:**T**< br > *in* axis:**T2**< br > *out* y:**T**|14+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T2** = tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[11, 13]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T2** = tensor(int32), tensor(int64)|
2023-12-19 18:42:54 +00:00
|DFT|*in* input:**T1**< br > *in* dft_length:**T2**< br > *in* axis:**tensor(int64)**< br > *out* output:**T1**< br >< br > or< br >< br > *in* input:**T1**< br > *in* dft_length:**T2**< br > *out* output:**T1**|20+|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(int32), tensor(int64)|
|||[17, 19]|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(int32), tensor(int64)|
2021-07-09 08:00:22 +00:00
|DepthToSpace|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float)|
|||[11, 12]|**T** = tensor(double), tensor(float)|
|||[1, 10]|**T** = tensor(double), tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|DequantizeLinear|*in* x:**T**< br > *in* x_scale:**tensor(float)**< br > *in* x_zero_point:**T**< br > *out* y:**tensor(float)**< br >< br > or< br >< br > *in* x:**T1**< br > *in* x_scale:**T2**< br > *in* x_zero_point:**T1**< br > *out* y:**T2**|21+|**T1** = tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint8)< br /> **T2** = tensor(float), tensor(float16)|
|||[19, 20]|**T1** = tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int32), tensor(int8), tensor(uint8)< br /> **T2** = tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[13, 18]|**T** = tensor(int32), tensor(int8), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[10, 12]|**T** = tensor(int32), tensor(int8), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Det|*in* X:**T**< br > *out* Y:**T**|11+|**T** = tensor(float)|
|Div|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||13|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|Dropout|*in* data:**T**< br > *in* ratio:**T1**< br > *in* training_mode:**T2**< br > *out* output:**T**< br > *out* mask:**T2**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**< br > *out* mask:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**< br > *out* mask:**T1**|13+|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(double), tensor(float)< br /> **T2** = tensor(bool)|
2021-04-26 20:38:40 +00:00
|||12|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(double), tensor(float)< br /> **T2** = tensor(bool)|
|||[10, 11]|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(bool)|
2022-02-18 06:55:32 +00:00
|||[7, 9]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|DynamicQuantizeLinear|*in* x:**T1**< br > *out* y:**T2**< br > *out* y_scale:**tensor(float)**< br > *out* y_zero_point:**T2**|11+|**T2** = tensor(uint8)|
|DynamicSlice|*in* data:**T**< br > *in* starts:**Tind**< br > *in* ends:**Tind**< br > *in* axes:**Tind**< br > *out* output:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|Einsum|*in* Inputs:**T**< br > *out* Output:**T**|12+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|Elu|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(float)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|Equal|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|19+|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(string)< br /> **T1** = tensor(bool)|
|||[13, 18]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
|||[7, 10]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|Erf|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float)|
2021-04-26 20:38:40 +00:00
|||[9, 12]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Exp|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|Expand|*in* input:**T**< br > *in* shape:**tensor(int64)**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[8, 12]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|EyeLike|*in* input:**T1**< br > *out* output:**T2**|9+|**T1** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(uint64)< br /> **T2** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(uint64)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Flatten|*in* input:**T**< br > *out* output:**T**|21+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 20]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||[9, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[1, 8]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-12-14 22:57:14 +00:00
|Floor|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float)|
|||[6, 12]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|GRU|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**|14+|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(int32)|
|||[7, 13]|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(int32)|
|Gather|*in* data:**T**< br > *in* indices:**Tind**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|GatherElements|*in* data:**T**< br > *in* indices:**Tind**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2022-02-18 06:55:32 +00:00
|GatherND|*in* data:**T**< br > *in* indices:**tensor(int64)**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **indices** = tensor(int64)|
|||12|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **indices** = tensor(int64)|
|||11|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **indices** = tensor(int64)|
2024-02-23 03:05:16 +00:00
|Gelu|*in* X:**T**< br > *out* Y:**T**|20+|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Gemm|*in* A:**T**< br > *in* B:**T**< br > *in* C:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float)|
|||[9, 10]|**T** = tensor(double), tensor(float)|
|||[7, 8]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|GlobalAveragePool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|GlobalLpPool|*in* X:**T**< br > *out* Y:**T**|2+|**T** = tensor(float)|
|GlobalMaxPool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|Greater|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|13+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
2021-04-26 20:38:40 +00:00
|||[9, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
|||[7, 8]|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(bool)|
2022-03-08 17:18:39 +00:00
|GreaterOrEqual|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|16+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
|||[12, 15]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
2023-11-07 18:42:41 +00:00
|GridSample|*in* X:**T1**< br > *in* grid:**T2**< br > *out* Y:**T1**|20+|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(double), tensor(float)|
|||[16, 19]|**T1** = tensor(float)< br /> **T2** = tensor(float)|
2022-06-27 00:26:55 +00:00
|HammingWindow|*in* size:**T1**< br > *out* output:**T2**|17+|**T1** = tensor(int32), tensor(int64)< br /> **T2** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|HannWindow|*in* size:**T1**< br > *out* output:**T2**|17+|**T1** = tensor(int32), tensor(int64)< br /> **T2** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|HardSigmoid|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(float)|
|Hardmax|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(float)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Identity|*in* input:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**V**< br > *out* output:**V**|21+|**V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[19, 20]|**V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[16, 18]|**V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-11-04 22:01:42 +00:00
|||[14, 15]|**V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||13|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[1, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|If|*in* cond:**B**< br > *out* outputs:**V**|21+|**B** = tensor(bool)< br /> **V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[19, 20]|**B** = tensor(bool)< br /> **V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[16, 18]|**B** = tensor(bool)< br /> **V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-11-04 22:01:42 +00:00
|||[13, 15]|**B** = tensor(bool)< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|ImageScaler|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float)|
|InstanceNormalization|*in* input:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *out* output:**T**|6+|**T** = tensor(float)|
2024-03-05 21:33:01 +00:00
|IsInf|*in* X:**T1**< br > *out* Y:**T2**|20+|**T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)< br /> **T2** = tensor(bool)|
2023-10-24 17:58:54 +00:00
|||[10, 19]|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(bool)|
2024-03-07 23:46:11 +00:00
|IsNaN|*in* X:**T1**< br > *out* Y:**T2**|20+|**T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)< br /> **T2** = tensor(bool)|
2023-10-24 17:58:54 +00:00
|||[13, 19]|**T1** = tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
2022-12-14 22:57:14 +00:00
|||[9, 12]|**T1** = tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|LRN|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float)|
2021-04-26 20:38:40 +00:00
|||[1, 12]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|LSTM|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *in* initial_c:**T**< br > *in* P:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**< br > *out* Y_c:**T**|14+|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(int32)|
|||[7, 13]|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(int32)|
Update Attention operator to support separated Q/K/V inputs (#13410)
### Description
Allow separated Q, K and V inputs to support cross attention:
* Q: [batch_size, sequence_length, hidden_size]
* K: [batch_size, kv_sequence_length, hidden_size]
* V: [batch_size, kv_sequence_length, v_hidden_size]
* Output: [batch_size, sequence_length, v_hidden_size]
To use separated Q/K/V inputs, the input tensor is for query, and two
optional inputs are added for key and value. Weights for input
projection is not included for now, so the MatMul of input projection
shall be done out of Attention operator, but Add bias is included for
performance consideration.
2022-10-25 18:51:06 +00:00
|LayerNormalization|*in* X:**T**< br > *in* Scale:**T**< br > *in* B:**T**< br > *out* Y:**T**< br > *out* Mean:**U**< br > *out* InvStdDev:**U**< br >< br > or< br >< br > *in* X:**T**< br > *in* Scale:**V**< br > *in* B:**V**< br > *out* Y:**V**< br > *out* Mean:**U**< br > *out* InvStdDev:**U**|17+|**T** = tensor(double), tensor(float)< br /> **U** = tensor(float)|
|||[1, 16]|**T** = tensor(double), tensor(float)< br /> **U** = tensor(double), tensor(float)< br /> **V** = tensor(double), tensor(float)|
2022-03-08 17:18:39 +00:00
|LeakyRelu|*in* X:**T**< br > *out* Y:**T**|16+|**T** = tensor(float)|
|||[6, 15]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Less|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|13+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
2021-04-26 20:38:40 +00:00
|||[9, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
|||[7, 8]|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(bool)|
2022-03-08 17:18:39 +00:00
|LessOrEqual|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|16+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
|||[12, 15]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|Log|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|LogSoftmax|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(double), tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Loop|*in* M:**I**< br > *in* cond:**B**< br > *in* v_initial:**V**< br > *out* v_final_and_scan_outputs:**V**|21+|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[19, 20]|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[16, 18]|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-11-04 22:01:42 +00:00
|||[13, 15]|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|LpNormalization|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(double), tensor(float)|
2023-01-26 07:14:56 +00:00
|LpPool|*in* X:**T**< br > *out* Y:**T**|18+|**T** = tensor(float)|
|||[11, 17]|**T** = tensor(float)|
2020-09-02 22:07:50 +00:00
|||[2, 10]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|MatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-04-26 20:38:40 +00:00
|||[9, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2020-09-02 22:07:50 +00:00
|||[1, 8]|**T** = tensor(double), tensor(float)|
2021-12-10 19:33:19 +00:00
|MatMulInteger|*in* A:**T1**< br > *in* B:**T2**< br > *in* a_zero_point:**T1**< br > *in* b_zero_point:**T2**< br > *out* Y:**T3**|10+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(int32)|
2021-06-02 07:47:40 +00:00
|Max|*in* data_0:**T**< br > *out* max:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-04-26 20:38:40 +00:00
|||12|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||[8, 11]|**T** = tensor(double), tensor(float)|
2020-09-02 22:07:50 +00:00
|||[6, 7]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|MaxPool|*in* X:**T**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T**< br > *out* Y:**T**< br > *out* Indices:**I**|12+|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float), tensor(int8), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||[8, 11]|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float)|
|||[1, 7]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|MaxRoiPool|*in* X:**T**< br > *in* rois:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|MaxUnpool|*in* X:**T1**< br > *in* I:**T2**< br > *in* output_shape:**T2**< br > *out* output:**T1**|11+|**T1** = tensor(float)< br /> **T2** = tensor(int64)|
2020-09-02 22:07:50 +00:00
|||[9, 10]|**T1** = tensor(float)< br /> **T2** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|Mean|*in* data_0:**T**< br > *out* mean:**T**|13+|**T** = tensor(float)|
2021-04-26 20:38:40 +00:00
|||[8, 12]|**T** = tensor(float)|
2020-09-02 22:07:50 +00:00
|||[6, 7]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|MeanVarianceNormalization|*in* X:**T**< br > *out* Y:**T**< br >< br > or< br >< br > *in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float)|
2021-04-26 20:38:40 +00:00
|||[9, 12]|**T** = tensor(float)|
2020-09-02 22:07:50 +00:00
|||[1, 8]|**T** = tensor(float)|
2022-06-27 00:26:55 +00:00
|MelWeightMatrix|*in* num_mel_bins:**T1**< br > *in* dft_length:**T1**< br > *in* sample_rate:**T1**< br > *in* lower_edge_hertz:**T2**< br > *in* upper_edge_hertz:**T2**< br > *out* output:**T3**|17+|**T1** = tensor(int32), tensor(int64)< br /> **T2** = tensor(float)< br /> **T3** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Min|*in* data_0:**T**< br > *out* min:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-04-26 20:38:40 +00:00
|||12|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||[8, 11]|**T** = tensor(double), tensor(float)|
2020-09-02 22:07:50 +00:00
|||[6, 7]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Mod|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[10, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Mul|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||13|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|Multinomial|*in* input:**T1**< br > *out* output:**T2**|7+|**T1** = tensor(float)< br /> **T2** = tensor(int32), tensor(int64)|
|Neg|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8)|
2021-04-26 20:38:40 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8)|
2021-06-02 07:47:40 +00:00
|NonZero|*in* X:**T**< br > *out* Y:**tensor(int64)**|13+|**T** = tensor(bool), tensor(float), tensor(int32), tensor(int64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[9, 12]|**T** = tensor(bool), tensor(float), tensor(int32), tensor(int64), tensor(uint8)|
2022-02-18 06:55:32 +00:00
|Not|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|OneHot|*in* indices:**T1**< br > *in* depth:**T2**< br > *in* values:**T3**< br > *out* output:**T3**|11+|**T1** = tensor(float), tensor(int32), tensor(int64)< br /> **T2** = tensor(float), tensor(int32), tensor(int64)< br /> **T3** = tensor(float), tensor(int32), tensor(int64), tensor(string)|
2020-09-02 22:07:50 +00:00
|||[9, 10]|**T1** = tensor(float), tensor(int32), tensor(int64)< br /> **T2** = tensor(float), tensor(int32), tensor(int64)< br /> **T3** = tensor(float), tensor(int32), tensor(int64), tensor(string)|
2021-11-04 22:01:42 +00:00
|Optional|*in* input:**V**< br > *out* output:**O**|15+|**O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8))< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-01-09 18:26:16 +00:00
|OptionalGetElement|*in* input:**O**< br > *out* output:**V**|18+|**O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[15, 17]|**O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8))< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|OptionalHasElement|*in* input:**O**< br > *out* output:**B**|18+|**B** = tensor(bool)< br /> **O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[15, 17]|**B** = tensor(bool)< br /> **O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8))|
2021-06-02 07:47:40 +00:00
|Or|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)< br /> **T1** = tensor(bool)|
2022-03-08 17:18:39 +00:00
|PRelu|*in* X:**T**< br > *in* slope:**T**< br > *out* Y:**T**|16+|**T** = tensor(float)|
|||[9, 15]|**T** = tensor(float)|
2021-04-26 20:38:40 +00:00
|||[7, 8]|**T** = tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Pad|*in* data:**T**< br > *in* pads:**tensor(int64)**< br > *in* constant_value:**T**< br > *in* axes:**Tind**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *in* pads:**tensor(int64)**< br > *in* constant_value:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**|21+|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[19, 20]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-05-15 17:46:24 +00:00
|||18|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-01-23 20:14:35 +00:00
|||[13, 17]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||[2, 10]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|ParametricSoftplus|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
2021-08-25 19:04:20 +00:00
|Pow|*in* X:**T**< br > *in* Y:**T**< br > *out* Z:**T**< br >< br > or< br >< br > *in* X:**T**< br > *in* Y:**T1**< br > *out* Z:**T**|15+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[13, 14]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||12|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T1** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[7, 11]|**T** = tensor(double), tensor(float)|
2021-12-10 19:33:19 +00:00
|QLinearConv|*in* x:**T1**< br > *in* x_scale:**tensor(float)**< br > *in* x_zero_point:**T1**< br > *in* w:**T2**< br > *in* w_scale:**tensor(float)**< br > *in* w_zero_point:**T2**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T3**< br > *in* B:**T4**< br > *out* y:**T3**|10+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(int8), tensor(uint8)< br /> **T4** = tensor(int32)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|QLinearMatMul|*in* a:**T1**< br > *in* a_scale:**TS**< br > *in* a_zero_point:**T1**< br > *in* b:**T2**< br > *in* b_scale:**TS**< br > *in* b_zero_point:**T2**< br > *in* y_scale:**TS**< br > *in* y_zero_point:**T3**< br > *out* y:**T3**< br >< br > or< br >< br > *in* a:**T1**< br > *in* a_scale:**tensor(float)**< br > *in* a_zero_point:**T1**< br > *in* b:**T2**< br > *in* b_scale:**tensor(float)**< br > *in* b_zero_point:**T2**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T3**< br > *out* y:**T3**|10+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(int8), tensor(uint8)|
|QuantizeLinear|*in* x:**T1**< br > *in* y_scale:**T1**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**< br >< br > or< br >< br > *in* x:**T1**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**|21+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int8), tensor(uint16), tensor(uint8)|
|||[19, 20]|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int8), tensor(uint8)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[13, 18]|**T1** = tensor(float)< br /> **T2** = tensor(int8), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[10, 12]|**T1** = tensor(float)< br /> **T2** = tensor(int8), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|RNN|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**|14+|**T** = tensor(float)< br /> **T1** = tensor(int32)|
|||[7, 13]|**T** = tensor(float)< br /> **T1** = tensor(int32)|
|RandomNormal|*out* output:**T**|1+|**T** = tensor(double), tensor(float)|
|RandomNormalLike|*in* input:**T1**< br > *out* output:**T2**|1+|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(double), tensor(float)|
|RandomUniform|*out* output:**T**|1+|**T** = tensor(double), tensor(float)|
|RandomUniformLike|*in* input:**T1**< br > *out* output:**T2**|1+|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(double), tensor(float)|
|Range|*in* start:**T**< br > *in* limit:**T**< br > *in* delta:**T**< br > *out* output:**T**|11+|**T** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64)|
|Reciprocal|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float)|
2023-04-05 16:19:43 +00:00
|ReduceL1|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[13, 17]|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[11, 12]|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[1, 10]|**T** = tensor(float), tensor(int32), tensor(int64)|
|ReduceL2|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[13, 17]|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[11, 12]|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[1, 10]|**T** = tensor(float), tensor(int32), tensor(int64)|
|ReduceLogSum|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[13, 17]|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[11, 12]|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[1, 10]|**T** = tensor(float), tensor(int32), tensor(int64)|
|ReduceLogSumExp|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[13, 17]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2024-01-05 01:41:01 +00:00
|ReduceMax|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|20+|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
|||[18, 19]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
2023-01-09 18:26:16 +00:00
|||[13, 17]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||12|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
|||11|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2023-01-09 18:26:16 +00:00
|ReduceMean|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(int32)|
|||[13, 17]|**T** = tensor(double), tensor(float), tensor(int32)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(int32)|
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(int32)|
2024-01-05 01:41:01 +00:00
|ReduceMin|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|20+|**T** = tensor(bool), tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
|||[18, 19]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
2023-01-09 18:26:16 +00:00
|||[13, 17]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||12|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
|||11|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2023-01-09 18:26:16 +00:00
|ReduceProd|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(int32), tensor(int64)|
|||[13, 17]|**T** = tensor(float), tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(float), tensor(int32), tensor(int64)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(float), tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|ReduceSum|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|13+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2023-04-05 16:19:43 +00:00
|ReduceSumSquare|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[13, 17]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2024-01-11 23:50:07 +00:00
|RegexFullMatch|*in* X:**T1**< br > *out* Y:**T2**|20+|**T1** = tensor(string)< br /> **T2** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|Relu|*in* X:**T**< br > *out* Y:**T**|14+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int8)|
2021-04-26 20:38:40 +00:00
|||13|**T** = tensor(double), tensor(float)|
|||[6, 12]|**T** = tensor(double), tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Reshape|*in* data:**T**< br > *in* shape:**tensor(int64)**< br > *out* reshaped:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reshaped:**T**|21+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
|||[19, 20]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[14, 18]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
2021-04-26 20:38:40 +00:00
|||13|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
|||[5, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
|||[1, 4]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-05-01 17:49:17 +00:00
|Resize|*in* X:**T**< br > *in* scales:**tensor(float)**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T1**< br > *in* roi:**T2**< br > *in* scales:**tensor(float)**< br > *in* sizes:**tensor(int64)**< br > *out* Y:**T1**|19+|**T1** = tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
|||18|**T1** = tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
2023-01-09 18:26:16 +00:00
|||[13, 17]|**T1** = tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
2021-12-17 23:36:09 +00:00
|||[11, 12]|**T1** = tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
|||10|**T** = tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|ReverseSequence|*in* input:**T**< br > *in* sequence_lens:**tensor(int64)**< br > *out* Y:**T**|10+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-03-08 05:10:55 +00:00
|RoiAlign|*in* X:**T1**< br > *in* rois:**T1**< br > *in* batch_indices:**T2**< br > *out* Y:**T1**|16+|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(int64)|
|||[10, 15]|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|Round|*in* X:**T**< br > *out* Y:**T**|11+|**T** = tensor(double), tensor(float), tensor(float16)|
2022-06-27 00:26:55 +00:00
|STFT|*in* signal:**T1**< br > *in* frame_step:**T2**< br > *in* window:**T1**< br > *in* frame_length:**T2**< br > *out* output:**T1**|17+|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|Scale|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float)|
|ScaledTanh|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Scan|*in* initial_state_and_scan_inputs:**V**< br > *out* final_state_and_scan_outputs:**V**< br >< br > or< br >< br > *in* sequence_lens:**I**< br > *in* initial_state_and_scan_inputs:**V**< br > *out* final_state_and_scan_outputs:**V**|21+|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[19, 20]|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[16, 18]|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-03-08 17:18:39 +00:00
|||[11, 15]|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-02-18 06:55:32 +00:00
|||[9, 10]|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||8|**I** = tensor(int64)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Scatter|*in* data:**T**< br > *in* indices:**Tind**< br > *in* updates:**T**< br > *out* output:**T**|[9, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2023-01-19 21:54:20 +00:00
|ScatterElements|*in* data:**T**< br > *in* indices:**Tind**< br > *in* updates:**T**< br > *out* output:**T**|18+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||[16, 17]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2022-03-08 05:10:55 +00:00
|||[13, 15]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2023-01-19 21:54:20 +00:00
|ScatterND|*in* data:**T**< br > *in* indices:**tensor(int64)**< br > *in* updates:**T**< br > *out* output:**T**|18+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[16, 17]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-03-08 05:10:55 +00:00
|||[13, 15]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Selu|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(float)|
|SequenceAt|*in* input_sequence:**S**< br > *in* position:**I**< br > *out* tensor:**T**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))< br /> **T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|SequenceConstruct|*in* inputs:**T**< br > *out* output_sequence:**S**|11+|**S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))< br /> **T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|SequenceEmpty|*out* output:**S**|11+|**S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceErase|*in* input_sequence:**S**< br > *in* position:**I**< br > *out* output_sequence:**S**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceInsert|*in* input_sequence:**S**< br > *in* tensor:**T**< br > *in* position:**I**< br > *out* output_sequence:**S**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceLength|*in* input_sequence:**S**< br > *out* length:**I**|11+|**I** = tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Shape|*in* data:**T**< br > *out* shape:**T1**|21+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
|||[19, 20]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|||[15, 18]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-08-25 19:04:20 +00:00
|||[13, 14]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[1, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|Shrink|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Sigmoid|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|Sign|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[9, 12]|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-03-29 13:22:04 +00:00
|SimplifiedLayerNormalization|*in* X:**T**< br > *in* scale:**V**< br > *out* Y:**V**< br > *out* inv_std_var:**U**|1+|**T** = tensor(double), tensor(float)< br /> **U** = tensor(double), tensor(float)< br /> **V** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|Sin|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(double), tensor(float)|
|Sinh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Size|*in* data:**T**< br > *out* size:**T1**|21+|**T** = tensor(bool), tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
|||[19, 20]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2023-08-11 21:48:53 +00:00
|||[13, 18]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[1, 12]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|Slice|*in* data:**T**< br > *in* starts:**Tind**< br > *in* ends:**Tind**< br > *in* axes:**Tind**< br > *in* steps:**Tind**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2020-09-02 22:07:50 +00:00
|||10|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||[1, 9]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Softmax|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|Softplus|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|Softsign|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float)|
2021-07-09 08:00:22 +00:00
|SpaceToDepth|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float)|
|||[1, 12]|**T** = tensor(double), tensor(float)|
2023-01-11 22:14:10 +00:00
|Split|*in* input:**T**< br > *in* split:**T**< br > *out* outputs...:**T**< br >< br > or< br >< br > *in* input:**T**< br > *in* split:**tensor(int64)**< br > *out* outputs:**T**< br >< br > or< br >< br > *in* input:**T**< br > *out* outputs:**T**|18+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 17]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-12-12 17:29:15 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[2, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-11-29 18:44:59 +00:00
|SplitToSequence|*in* input:**T**< br > *in* split:**I**< br > *out* output_sequence:**S**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))< br /> **T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(string)|
2021-06-02 07:47:40 +00:00
|Sqrt|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Squeeze|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* squeezed:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* squeezed:**T**|21+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 20]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-01-11 18:01:43 +00:00
|StringConcat|*in* X:**T**< br > *in* Y:**T**< br > *out* Z:**T**|20+|**T** = tensor(string)|
2022-02-18 06:55:32 +00:00
|StringNormalizer|*in* X:**tensor(string)**< br > *out* Y:**tensor(string)**|10+|**X** = tensor(string)|
2024-01-12 17:46:23 +00:00
|StringSplit|*in* X:**T1**< br > *out* Y:**T2**< br > *out* Z:**T3**|20+|**T1** = tensor(string)< br /> **T2** = tensor(string)< br /> **T3** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|Sub|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||13|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|Sum|*in* data_0:**T**< br > *out* sum:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[8, 12]|**T** = tensor(double), tensor(float)|
|||[6, 7]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|Tan|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float)|
|Tanh|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float)|
2021-04-26 20:38:40 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|TfIdfVectorizer|*in* X:**T**< br > *out* Y:**T1**|9+|**T** = tensor(int32), tensor(int64), tensor(string)< br /> **T1** = tensor(float)|
|ThresholdedRelu|*in* X:**T**< br > *out* Y:**T**|10+|**T** = tensor(float)|
2020-09-02 22:07:50 +00:00
|||[1, 9]|**T** = tensor(float)|
2023-02-16 22:59:44 +00:00
|Tile|*in* input:**T**< br > *in* repeats:**T1**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**T**< br > *in* tiles:**T**< br > *in* axis:**T**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
|||[6, 12]|**T** = tensor(bool), tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|TopK|*in* X:**T**< br > *in* K:**tensor(int64)**< br > *out* Values:**T**< br > *out* Indices:**I**< br >< br > or< br >< br > *in* X:**T**< br > *out* Values:**T**< br > *out* Indices:**I**|11+|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2021-04-26 20:38:40 +00:00
|||10|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float)|
|||[1, 9]|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Transpose|*in* data:**T**< br > *out* transposed:**T**|21+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 20]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[1, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Trilu|*in* input:**T**< br > *in* k:**tensor(int64)**< br > *out* output:**T**|14+|**T** = tensor(double), tensor(float), tensor(int64)|
2023-07-12 03:24:14 +00:00
|Unique|*in* X:**T**< br > *out* Y:**T**< br > *out* indices:**tensor(int64)**< br > *out* inverse_indices:**tensor(int64)**< br > *out* counts:**tensor(int64)**|11+|**T** = tensor(double), tensor(float), tensor(int64), tensor(int8), tensor(string)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|Unsqueeze|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* expanded:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* expanded:**T**|21+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 20]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-04-26 20:38:40 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2020-09-02 22:07:50 +00:00
|||[1, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-12-17 23:36:09 +00:00
|Upsample|*in* X:**T**< br > *in* scales:**tensor(float)**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T**< br > *out* Y:**T**|9|**T** = tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
|||[7, 8]|**T** = tensor(float), tensor(int32), tensor(int8), tensor(uint8)|
2022-03-08 05:10:55 +00:00
|Where|*in* condition:**B**< br > *in* X:**T**< br > *in* Y:**T**< br > *out* output:**T**|16+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(string), tensor(uint8)|
|||[9, 15]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(string), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Xor|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)< br /> **T1** = tensor(bool)|
| |
| |
|**Operator Domain:** *ai.onnx.ml* ||||
|ArrayFeatureExtractor|*in* X:**T**< br > *in* Y:**tensor(int64)**< br > *out* Z:**T**|1+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(string)|
|Binarizer|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|CastMap|*in* X:**T1**< br > *out* Y:**T2**|1+|**T1** = map(int64,tensor(float)), map(int64,tensor(string))< br /> **T2** = tensor(float), tensor(int64), tensor(string)|
|CategoryMapper|*in* X:**T1**< br > *out* Y:**T2**|1+|**T1** = tensor(int64), tensor(string)< br /> **T2** = tensor(int64), tensor(string)|
|DictVectorizer|*in* X:**T1**< br > *out* Y:**T2**|1+|**T1** = map(int64,tensor(double)), map(int64,tensor(float)), map(int64,tensor(string)), map(string,tensor(double)), map(string,tensor(float)), map(string,tensor(int64))< br /> **T2** = tensor(double), tensor(float), tensor(int64), tensor(string)|
|FeatureVectorizer|*in* X:**T1**< br > *out* Y:**tensor(float)**|1+|**T1** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|Imputer|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(int64)|
2024-01-12 20:43:44 +00:00
|LabelEncoder|*in* X:**T1**< br > *out* Y:**T2**|4+|**T1** = tensor(double), tensor(float), tensor(int64), tensor(string)< br /> **T2** = tensor(double), tensor(float), tensor(int16), tensor(int64), tensor(string)|
|||[2, 3]|**T1** = tensor(float), tensor(int64), tensor(string)< br /> **T2** = tensor(float), tensor(int64), tensor(string)|
2021-06-02 07:47:40 +00:00
|||1|**T1** = tensor(int64), tensor(string)< br /> **T2** = tensor(int64), tensor(string)|
|LinearClassifier|*in* X:**T1**< br > *out* Y:**T2**< br > *out* Z:**tensor(float)**|1+|**T1** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T2** = tensor(int64), tensor(string)|
|LinearRegressor|*in* X:**T**< br > *out* Y:**tensor(float)**|1+|**T** = tensor(float)|
|Normalizer|*in* X:**T**< br > *out* Y:**tensor(float)**|1+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|OneHotEncoder|*in* X:**T**< br > *out* Y:**tensor(float)**|1+|**T** = tensor(double), tensor(float), tensor(int64), tensor(string)|
|SVMClassifier|*in* X:**T1**< br > *out* Y:**T2**< br > *out* Z:**tensor(float)**|1+|**T1** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T2** = tensor(int64), tensor(string)|
|SVMRegressor|*in* X:**T**< br > *out* Y:**tensor(float)**|1+|**T** = tensor(float)|
|Scaler|*in* X:**T**< br > *out* Y:**tensor(float)**|1+|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
2022-03-30 10:53:12 +00:00
|TreeEnsembleClassifier|*in* X:**T1**< br > *out* Y:**T2**< br > *out* Z:**tensor(float)**|3+|**T1** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T2** = tensor(int64), tensor(string)|
|||[1, 2]|**T1** = tensor(double), tensor(float), tensor(int32), tensor(int64)< br /> **T2** = tensor(int64), tensor(string)|
|TreeEnsembleRegressor|*in* X:**T**< br > *out* Y:**tensor(float)**|3+|**T** = tensor(double), tensor(float)|
|||[1, 2]|**T** = tensor(double), tensor(float)|
2021-06-02 07:47:40 +00:00
|ZipMap|*in* X:**tensor(float)**< br > *out* Z:**T**|1+|**T** = seq(map(int64,tensor(float))), seq(map(string,tensor(float)))|
2019-08-15 01:12:24 +00:00
| |
| |
2020-09-02 22:07:50 +00:00
|**Operator Domain:** *com.microsoft* ||||
2023-02-07 19:51:06 +00:00
|Attention|*in* input:**T**< br > *in* weights:**T**< br > *in* bias:**T**< br > *in* mask_index:**M**< br > *in* past:**T**< br > *in* relative_position_bias:**T**< br > *in* past_sequence_length:**M**< br > *out* output:**T**< br > *out* present:**T**|1+|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|AttnLSTM|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *in* initial_c:**T**< br > *in* P:**T**< br > *in* QW:**T**< br > *in* MW:**T**< br > *in* V:**T**< br > *in* M:**T**< br > *in* memory_seq_lens:**T1**< br > *in* AW:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**< br > *out* Y_c:**T**|1+|**T** = tensor(double), tensor(float)< br /> **T1** = tensor(int32)|
2023-05-17 04:40:00 +00:00
|BeamSearch|*in* input_ids:**F**< br > *in* max_length:**I**< br > *in* min_length:**I**< br > *in* num_beams:**I**< br > *in* num_return_sequences:**I**< br > *in* length_penalty:**T**< br > *in* repetition_penalty:**T**< br > *in* vocab_mask:**M**< br > *in* prefix_vocab_mask:**M**< br > *in* attention_mask:**I**< br > *in* decoder_input_ids:**I**< br > *in* logits_processor:**I**< br > *out* sequences:**I**< br > *out* sequences_scores:**T**< br > *out* scores:**T**|1+|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|BiasGelu|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|1+|**T** = tensor(float)|
2021-10-20 02:53:56 +00:00
|BifurcationDetector|*in* src_tokens:**T**< br > *in* cur_tokens:**T**< br > *in* prev_suffix_match_idx:**T**< br > *in* pred_tokens:**T**< br > *out* tokens:**T**< br > *out* suffix_match_idx:**T**|1+|**T** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|CDist|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|1+|**T** = tensor(double), tensor(float)|
|ConvTransposeWithDynamicPads|*in* X:**T**< br > *in* W:**T**< br > *in* Pads:**tensor(int64)**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
2022-02-18 06:55:32 +00:00
|CropAndResize|*in* X:**T1**< br > *in* rois:**T1**< br > *in* batch_indices:**T2**< br > *in* crop_size:**T2**< br > *out* Y:**T1**|1+|**T1** = tensor(float)< br /> **T2** = tensor(int32)|
[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015)
### Description
- Adds 16-bit integer support to:
- Quantization kernel implementations: Intel, Neon, and Power intrinsics
- DequantizeLinear and QuantizeLinear contrib ops
- QNN EP Quantize and Dequantize operators
- Python quantization scripts
- Disables QDQ fusions for most 16-bit QDQ node groups (need to add
16-bit support to QLinear* ops)
- Retains support for dropping QDQ nodes from Split, Gather, Reshape,
Transpose, Squeeze, and Unsqueeze node groups.
Sample python code to generate QDQ model with 16-bit activations and
8-bit weights:
```python
quantize_static(
input_model_path,
output_model_path,
data_reader,
quant_format=args.quant_format,
per_channel=args.per_channel,
activation_type=QuantType.QUInt16,
weight_type=QuantType.QUInt8,
extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True},
)
```
Note that enabling the `UseQDQContribOps` extra option is not strictly
necessary. If the 16bit types are used without enabling
`UseQDQContribOps`, the QDQ ops domains are overridden to
'com.microsoft', and a warning is printed to stdout.
### Automated Tests
MLAS/CPU EP:
- [x] 16-bit QuantizeLinear computation
- [x] 16-bit DequantizeLinear computation
Optimizer:
- [x] Transpose QDQ fusion
- [x] Gather QDQ fusion
- [x] Reshape QDQ fusion
- [x] Squeeze QDQ fusion
- [x] Unsqueeze QDQ fusion
- [x] Split drop QDQ
- [x] DoubleQDQPairRemover
- [x] Transpose optimization
- [x] EnsureUniqueDQForNodeUnit
- [x] Common subexpression elimination (DQ not removed)
- [x] Constant folding
QNN EP:
- [x] Conv 16-bit activations, 8-bit weights
- [x] MatMul 16-bit activations, 8-bit weights
- [x] Unary 16-bit QDQ ops
- [x] Binary 16-bit QDQ ops
Quantization tool:
- [x] Test creation of 16-bit QDQ model
### Motivation and Context
Support mixed precision (8bit weights, 16bit activations) models.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-09-18 16:43:34 +00:00
|DequantizeLinear|*in* x:**T1**< br > *in* x_scale:**T2**< br > *in* x_zero_point:**T1**< br > *out* y:**T2**|1+|**T1** = tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint8)< br /> **T2** = tensor(float)|
2021-06-02 07:47:40 +00:00
|DynamicQuantizeLSTM|*in* X:**T**< br > *in* W:**T2**< br > *in* R:**T2**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *in* initial_c:**T**< br > *in* P:**T**< br > *in* W_scale:**T**< br > *in* W_zero_point:**T2**< br > *in* R_scale:**T**< br > *in* R_zero_point:**T2**< br > *out* Y:**T**< br > *out* Y_h:**T**< br > *out* Y_c:**T**|1+|**T** = tensor(float)< br /> **T1** = tensor(int32)< br /> **T2** = tensor(int8), tensor(uint8)|
|DynamicQuantizeMatMul|*in* A:**T1**< br > *in* B:**T2**< br > *in* b_scale:**T1**< br > *in* b_zero_point:**T2**< br > *in* bias:**T1**< br > *out* Y:**T1**|1+|**T1** = tensor(float)< br /> **T2** = tensor(int8), tensor(uint8)|
2021-10-28 18:06:26 +00:00
|EmbedLayerNormalization|*in* input_ids:**T1**< br > *in* segment_ids:**T1**< br > *in* word_embedding:**T**< br > *in* position_embedding:**T**< br > *in* segment_embedding:**T**< br > *in* gamma:**T**< br > *in* beta:**T**< br > *in* mask:**T1**< br > *in* position_ids:**T1**< br > *out* output:**T**< br > *out* mask_index:**T1**< br > *out* embedding_sum:**T**|1+|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|ExpandDims|*in* X:**T**< br > *in* axis:**tensor(int32)**< br > *out* Y:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **axis** = tensor(int32)|
|FastGelu|*in* X:**T**< br > *in* bias:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|FusedConv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *in* Z:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|FusedGemm|*in* A:**T**< br > *in* B:**T**< br > *in* C:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|FusedMatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|GatherND|*in* data:**T**< br > *in* indices:**Tind**< br > *out* output:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|Gelu|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
2022-10-21 22:00:18 +00:00
|GreedySearch|*in* input_ids:**I**< br > *in* max_length:**I**< br > *in* min_length:**I**< br > *in* repetition_penalty:**T**< br > *in* vocab_mask:**I**< br > *in* prefix_vocab_mask:**I**< br > *in* attention_mask:**I**< br > *out* sequences:**I**|1+|**T** = tensor(float)|
2022-02-18 06:55:32 +00:00
|GridSample|*in* X:**T1**< br > *in* Grid:**T1**< br > *out* Y:**T2**|1+|**T1** = tensor(float)< br /> **T2** = tensor(float)|
2024-04-23 02:57:05 +00:00
|GroupQueryAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* past_key:**T**< br > *in* past_value:**T**< br > *in* seqlens_k:**M**< br > *in* total_sequence_length:**M**< br > *in* cos_cache:**T**< br > *in* sin_cache:**T**< br > *out* output:**T**< br > *out* present_key:**T**< br > *out* present_value:**T**|1+|**M** = tensor(int32)< br /> **T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Inverse|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
2023-10-25 22:34:58 +00:00
|MatMulBnb4|*in* A:**T1**< br > *in* B:**T2**< br > *in* absmax:**T1**< br > *out* Y:**T1**|1+|**T1** = tensor(float)< br /> **T2** = tensor(uint8)|
2023-08-07 19:23:55 +00:00
|MatMulFpQ4|*in* A:**T1**< br > *in* B:**T2**< br > *in* B_shape:**T3**< br > *out* Y:**T1**|1+|**T1** = tensor(float)< br /> **T2** = tensor(uint8)< br /> **T3** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|MatMulInteger16|*in* A:**T1**< br > *in* B:**T2**< br > *out* Y:**T3**|1+|**T1** = tensor(int16)< br /> **T2** = tensor(int16)< br /> **T3** = tensor(int32)|
2021-12-10 19:33:19 +00:00
|MatMulIntegerToFloat|*in* A:**T1**< br > *in* B:**T2**< br > *in* a_scale:**T3**< br > *in* b_scale:**T3**< br > *in* a_zero_point:**T1**< br > *in* b_zero_point:**T2**< br > *in* bias:**T3**< br > *out* Y:**T3**|1+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(float)|
2024-03-05 03:45:45 +00:00
|MatMulNBits|*in* A:**T1**< br > *in* B:**T2**< br > *in* scales:**T1**< br > *in* zero_points:**T3**< br > *in* g_idx:**T4**< br > *out* Y:**T1**|1+|**T1** = tensor(float)< br /> **T2** = tensor(uint8)< br /> **T3** = tensor(float), tensor(uint8)< br /> **T4** = tensor(int32)|
2022-09-20 21:24:59 +00:00
|MaxpoolWithMask|*in* X:**T**< br > *in* M:**tensor(int32)**< br > *out* Y:**T**|1+|**T** = tensor(float)|
Whisper Model Optimization (#15473)
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).
Some of the added optimizations include:
- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
- For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
- For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
- CPU:
- Different Q and KV sequence lengths
- Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
- CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
- CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
- Different Q, K, V input shapes
- Pass past key and past value directly as key and value
- CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)
### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```
Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline
import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/
directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'
# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
directory,
use_io_binding=(device == 'cuda'),
provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=(-1 if device == 'cpu' else 0),
)
# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```
Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920
### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)
This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427
This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
2023-04-19 00:13:54 +00:00
|MultiHeadAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* bias:**T**< br > *in* key_padding_mask:**M**< br > *in* relative_position_bias:**T**< br > *in* past_key:**T**< br > *in* past_value:**T**< br > *out* output:**T**< br > *out* present_key:**T**< br > *out* present_value:**T**|1+|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|MurmurHash3|*in* X:**T1**< br > *out* Y:**T2**|1+|**T1** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(string), tensor(uint32), tensor(uint64)< br /> **T2** = tensor(int32), tensor(uint32)|
2021-06-21 17:21:48 +00:00
|NGramRepeatBlock|*in* input_ids:**Tid**< br > *in* scores:**T**< br > *out* scores_out:**T**|1+|**T** = tensor(float)< br /> **Tid** = tensor(int64)|
2021-11-30 02:43:43 +00:00
|NhwcMaxPool|*in* x:**T**< br > *out* y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Pad|*in* data:**T**< br > *in* pads:**tensor(int64)**< br > *in* value:**T**< br > *out* output:**T**|1+|**T** = tensor(float)|
|QAttention|*in* input:**T1**< br > *in* weight:**T2**< br > *in* bias:**T3**< br > *in* input_scale:**T3**< br > *in* weight_scale:**T3**< br > *in* mask_index:**T4**< br > *in* input_zero_point:**T1**< br > *in* weight_zero_point:**T2**< br > *in* past:**T3**< br > *out* output:**T3**< br > *out* present:**T3**|1+|**T1** = tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(float)< br /> **T4** = tensor(int32)|
2021-06-25 22:51:43 +00:00
|QEmbedLayerNormalization|*in* input_ids:**T1**< br > *in* segment_ids:**T1**< br > *in* word_embedding_quant:**T2**< br > *in* position_embedding_quant:**T2**< br > *in* segment_embedding:**T2**< br > *in* gamma_quant:**T2**< br > *in* beta_quant:**T2**< br > *in* mask:**T1**< br > *in* word_embedding_scale:**T**< br > *in* position_embedding_scale:**T**< br > *in* segment_embedding_scale:**T**< br > *in* gamma_scale:**T**< br > *in* beta_scale:**T**< br > *in* word_embedding_zero_point:**T2**< br > *in* position_embedding_zero_point:**T2**< br > *in* segment_embedding_zero_point:**T2**< br > *in* gamma_zero_point:**T2**< br > *in* beta_zero_point:**T2**< br > *out* layernorm_out:**T**< br > *out* mask_index_out:**T1**|1+|**T** = tensor(float)|
2022-02-02 18:35:29 +00:00
|QGemm|*in* A:**TA**< br > *in* a_scale:**T**< br > *in* a_zero_point:**TA**< br > *in* B:**TB**< br > *in* b_scale:**T**< br > *in* b_zero_point:**TB**< br > *in* C:**TC**< br > *in* y_scale:**T**< br > *in* y_zero_point:**TYZ**< br > *out* Y:**TY**|1+|**T** = tensor(float)< br /> **TA** = tensor(int8), tensor(uint8)< br /> **TB** = tensor(int8), tensor(uint8)< br /> **TC** = tensor(int32)< br /> **TY** = tensor(float), tensor(int8), tensor(uint8)< br /> **TYZ** = tensor(int8), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|QLinearAdd|*in* A:**T**< br > *in* A_scale:**tensor(float)**< br > *in* A_zero_point:**T**< br > *in* B:**T**< br > *in* B_scale:**tensor(float)**< br > *in* B_zero_point:**T**< br > *in* C_scale:**tensor(float)**< br > *in* C_zero_point:**T**< br > *out* C:**T**|1+|**T** = tensor(int8), tensor(uint8)|
2021-12-10 19:33:19 +00:00
|QLinearConv|*in* x:**T1**< br > *in* x_scale:**tensor(float)**< br > *in* x_zero_point:**T1**< br > *in* w:**T2**< br > *in* w_scale:**tensor(float)**< br > *in* w_zero_point:**T2**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T3**< br > *in* B:**T4**< br > *out* y:**T3**|1+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(int8), tensor(uint8)< br /> **T4** = tensor(int32)|
2021-06-02 07:47:40 +00:00
|QLinearLeakyRelu|*in* X:**T**< br > *in* X_scale:**tensor(float)**< br > *in* X_zero_point:**T**< br > *in* Y_scale:**tensor(float)**< br > *in* Y_zero_point:**T**< br > *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|QLinearMul|*in* A:**T**< br > *in* A_scale:**tensor(float)**< br > *in* A_zero_point:**T**< br > *in* B:**T**< br > *in* B_scale:**tensor(float)**< br > *in* B_zero_point:**T**< br > *in* C_scale:**tensor(float)**< br > *in* C_zero_point:**T**< br > *out* C:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|QLinearSigmoid|*in* X:**T**< br > *in* X_scale:**tensor(float)**< br > *in* X_zero_point:**T**< br > *in* Y_scale:**tensor(float)**< br > *in* Y_zero_point:**T**< br > *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
2022-08-10 02:52:02 +00:00
|QLinearSoftmax|*in* X:**T**< br > *in* X_scale:**tensor(float)**< br > *in* x_zero_point:**T**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T**< br > *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
2022-12-12 21:27:47 +00:00
|QLinearWhere|*in* condition:**B**< br > *in* X:**T**< br > *in* x_scale:**TF**< br > *in* x_zero_point:**T**< br > *in* Y:**T**< br > *in* y_scale:**TF**< br > *in* y_zero_point:**T**< br > *in* z_scale:**TF**< br > *in* z_zero_point:**T**< br > *out* Z:**T**|1+|**T** = tensor(int8), tensor(uint8)|
[QNN/CPU EP] Add 16-bit Quantize/Dequantize contrib ops (#17015)
### Description
- Adds 16-bit integer support to:
- Quantization kernel implementations: Intel, Neon, and Power intrinsics
- DequantizeLinear and QuantizeLinear contrib ops
- QNN EP Quantize and Dequantize operators
- Python quantization scripts
- Disables QDQ fusions for most 16-bit QDQ node groups (need to add
16-bit support to QLinear* ops)
- Retains support for dropping QDQ nodes from Split, Gather, Reshape,
Transpose, Squeeze, and Unsqueeze node groups.
Sample python code to generate QDQ model with 16-bit activations and
8-bit weights:
```python
quantize_static(
input_model_path,
output_model_path,
data_reader,
quant_format=args.quant_format,
per_channel=args.per_channel,
activation_type=QuantType.QUInt16,
weight_type=QuantType.QUInt8,
extra_options={"DedicatedQDQPair": True, "ForceQuantizeNoInputCheck": True, "UseQDQContribOps": True},
)
```
Note that enabling the `UseQDQContribOps` extra option is not strictly
necessary. If the 16bit types are used without enabling
`UseQDQContribOps`, the QDQ ops domains are overridden to
'com.microsoft', and a warning is printed to stdout.
### Automated Tests
MLAS/CPU EP:
- [x] 16-bit QuantizeLinear computation
- [x] 16-bit DequantizeLinear computation
Optimizer:
- [x] Transpose QDQ fusion
- [x] Gather QDQ fusion
- [x] Reshape QDQ fusion
- [x] Squeeze QDQ fusion
- [x] Unsqueeze QDQ fusion
- [x] Split drop QDQ
- [x] DoubleQDQPairRemover
- [x] Transpose optimization
- [x] EnsureUniqueDQForNodeUnit
- [x] Common subexpression elimination (DQ not removed)
- [x] Constant folding
QNN EP:
- [x] Conv 16-bit activations, 8-bit weights
- [x] MatMul 16-bit activations, 8-bit weights
- [x] Unary 16-bit QDQ ops
- [x] Binary 16-bit QDQ ops
Quantization tool:
- [x] Test creation of 16-bit QDQ model
### Motivation and Context
Support mixed precision (8bit weights, 16bit activations) models.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-09-18 16:43:34 +00:00
|QuantizeLinear|*in* x:**T1**< br > *in* y_scale:**T1**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**|1+|**T1** = tensor(float)< br /> **T2** = tensor(int16), tensor(int8), tensor(uint16), tensor(uint8)|
QuickGelu Fusion (#12417)
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.
For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

After, FW takes 115us, BW takes 139us, which is much faster.

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]
After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-10-28 10:12:07 +00:00
|QuickGelu|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Range|*in* start:**T**< br > *in* limit:**T**< br > *in* delta:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64)|
2023-10-23 20:00:56 +00:00
|RotaryEmbedding|*in* input:**T**< br > *in* position_ids:**M**< br > *in* cos_cache:**T**< br > *in* sin_cache:**T**< br > *out* output:**T**|1+|**M** = tensor(int64)< br /> **T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|SampleOp|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
2023-01-12 22:15:26 +00:00
|Sampling|*in* input_ids:**I**< br > *in* max_length:**I**< br > *in* min_length:**I**< br > *in* repetition_penalty:**T**< br > *in* vocab_mask:**I**< br > *in* prefix_vocab_mask:**I**< br > *in* attention_mask:**I**< br > *in* presence_mask:**I**< br > *in* seed:**I**< br > *out* sequences:**I**< br > *out* filtered_logits:**T**|1+|**T** = tensor(float)|
2023-01-06 15:27:10 +00:00
|SkipLayerNormalization|*in* input:**T**< br > *in* skip:**T**< br > *in* gamma:**T**< br > *in* beta:**T**< br > *in* bias:**T**< br > *out* output:**T**< br > *out* mean:**U**< br > *out* inv_std_var:**U**< br > *out* input_skip_bias_sum:**T**|1+|**T** = tensor(double), tensor(float)|
2023-10-23 20:00:56 +00:00
|SkipSimplifiedLayerNormalization|*in* input:**T**< br > *in* skip:**T**< br > *in* gamma:**T**< br > *in* bias:**T**< br > *out* output:**T**< br > *out* mean:**U**< br > *out* inv_std_var:**U**< br > *out* input_skip_bias_sum:**T**|1+|**T** = tensor(double), tensor(float)|
2021-07-22 22:24:36 +00:00
|SparseToDenseMatMul|*in* A:**T**< br > *in* B:**T1**< br > *out* Y:**T1**|1+|**T** = sparse_tensor(double), sparse_tensor(float), sparse_tensor(int32), sparse_tensor(int64), sparse_tensor(uint32), sparse_tensor(uint64)< br /> **T1** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-06-02 07:47:40 +00:00
|Tokenizer|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(string)|
|TransposeMatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|Trilu|*in* X:**T**< br > *in* k:**tensor(int64)**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(int64)|
|Unique|*in* x:**T**< br > *out* y:**T**< br > *out* idx:**tensor(int64)**< br > *out* counts:**tensor(int64)**|1+|**T** = tensor(float)|
2024-01-23 21:44:34 +00:00
|WhisperBeamSearch|*in* input_ids:**F**< br > *in* max_length:**I**< br > *in* min_length:**I**< br > *in* num_beams:**I**< br > *in* num_return_sequences:**I**< br > *in* length_penalty:**T**< br > *in* repetition_penalty:**T**< br > *in* vocab_mask:**M**< br > *in* prefix_vocab_mask:**M**< br > *in* attention_mask:**I**< br > *in* decoder_input_ids:**I**< br > *in* logits_processor:**I**< br > *in* cross_qk_layer_head:**I**< br > *in* extra_decoding_ids:**I**< br > *in* temperature:**T**< br > *out* sequences:**I**< br > *out* sequences_scores:**T**< br > *out* scores:**T**< br > *out* cross_qk:**V**< br > *out* non_speech_probs:**T**|1+|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|WordConvEmbedding|*in* Sequence:**T**< br > *in* W:**T1**< br > *in* B:**T1**< br > *in* C:**T1**< br > *out* Y:**T1**|1+|**T** = tensor(int32)< br /> **T1** = tensor(float)|
2019-08-15 01:12:24 +00:00
| |
| |
2020-09-02 22:07:50 +00:00
|**Operator Domain:** *com.microsoft.nchwc* ||||
2021-06-02 07:47:40 +00:00
|AveragePool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|Conv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *in* Sum:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|GlobalAveragePool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|GlobalMaxPool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|MaxPool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|ReorderInput|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|ReorderOutput|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|Upsample|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
2019-08-15 01:12:24 +00:00
| |
| |
2021-05-08 03:17:29 +00:00
2021-06-02 07:47:40 +00:00
< a name = "cudaexecutionprovider" / >
2021-05-08 03:17:29 +00:00
## Operators implemented by CUDAExecutionProvider
| Op Name | Parameters | OpSet Version | Types Supported |
|---------|------------|---------------|-----------------|
2021-06-02 07:47:40 +00:00
|**Operator Domain:** *ai.onnx* ||||
|Abs|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Add|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-05-08 03:17:29 +00:00
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-06-02 07:47:40 +00:00
|Affine|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|And|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)< br /> **T1** = tensor(bool)|
2023-10-26 23:57:21 +00:00
|ArgMax|*in* data:**T**< br > *out* reduced:**tensor(int64)**|[1, 11]|**T** = tensor(double), tensor(float), tensor(float16)|
|ArgMin|*in* data:**T**< br > *out* reduced:**tensor(int64)**|[1, 11]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|AveragePool|*in* X:**T**< br > *out* Y:**T**|11+|**T** = tensor(double), tensor(float), tensor(float16)|
2022-02-18 06:55:32 +00:00
|||10|**T** = tensor(double), tensor(float), tensor(float16)|
|||[7, 9]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-08-25 19:04:20 +00:00
|BatchNormalization|*in* X:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *in* input_mean:**U**< br > *in* input_var:**U**< br > *out* Y:**T**< br > *out* running_mean:**U**< br > *out* running_var:**U**< br >< br > or< br >< br > *in* X:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *in* mean:**T**< br > *in* var:**T**< br > *out* Y:**T**< br > *out* mean:**T**< br > *out* var:**T**< br > *out* saved_mean:**T**< br > *out* saved_var:**T**< br >< br > or< br >< br > *in* X:**T**< br > *in* scale:**T1**< br > *in* B:**T1**< br > *in* input_mean:**T2**< br > *in* input_var:**T2**< br > *out* Y:**T**< br > *out* running_mean:**T2**< br > *out* running_var:**T2**|15+|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(double), tensor(float), tensor(float16)|
|||14|**T** = tensor(double), tensor(float), tensor(float16)< br /> **U** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|||[9, 13]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|Cast|*in* input:**T1**< br > *out* output:**T2**|19+|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e5m2), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e5m2), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 18]|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e5m2), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[9, 12]|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e5m2), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[6, 8]|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e5m2), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Ceil|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Clip|*in* input:**T**< br > *in* min:**T**< br > *in* max:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int64), tensor(int8), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||12|**T** = tensor(double), tensor(float), tensor(float16), tensor(int64), tensor(int8), tensor(uint64), tensor(uint8)|
|||11|**T** = tensor(float)|
|||[6, 10]|**T** = tensor(float)|
2021-06-02 07:47:40 +00:00
|Compress|*in* input:**T**< br > *in* condition:**T1**< br > *out* output:**T**|11+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
2021-05-08 03:17:29 +00:00
|||[9, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|Concat|*in* inputs:**T**< br > *out* concat_result:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[4, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-07 22:30:26 +00:00
|ConcatFromSequence|*in* input_sequence:**S**< br > *out* concat_result:**T**|11+|**S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
2021-06-02 07:47:40 +00:00
|ConstantOfShape|*in* input:**T1**< br > *out* output:**T2**|9+|**T1** = tensor(int64)< br /> **T2** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Conv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|11+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|ConvTranspose|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|11+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Cos|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(double), tensor(float), tensor(float16)|
|Crop|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|CumSum|*in* x:**T**< br > *in* axis:**T2**< br > *out* y:**T**|14+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T2** = tensor(int32), tensor(int64)|
2021-05-08 03:17:29 +00:00
|||[11, 13]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T2** = tensor(int32), tensor(int64)|
2021-07-09 08:00:22 +00:00
|DepthToSpace|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16)|
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|DequantizeLinear|*in* x:**T**< br > *in* x_scale:**tensor(float)**< br > *in* x_zero_point:**T**< br > *out* y:**tensor(float)**< br >< br > or< br >< br > *in* x:**T1**< br > *in* x_scale:**T2**< br > *in* x_zero_point:**T1**< br > *out* y:**T2**|19+|**T1** = tensor(float8e4m3fn), tensor(float8e5m2), tensor(int8), tensor(uint8)< br /> **T2** = tensor(float), tensor(float16)|
|||[13, 18]|**T** = tensor(int8), tensor(uint8)|
|||[10, 12]|**T** = tensor(int8), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Div|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-05-08 03:17:29 +00:00
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-06-02 07:47:40 +00:00
|Dropout|*in* data:**T**< br > *in* ratio:**T1**< br > *in* training_mode:**T2**< br > *out* output:**T**< br > *out* mask:**T2**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**< br > *out* mask:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**< br > *out* mask:**T1**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
2021-05-08 03:17:29 +00:00
|||12|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
|||[10, 11]|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(bool)|
|||[7, 9]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|DynamicSlice|*in* data:**T**< br > *in* starts:**Tind**< br > *in* ends:**Tind**< br > *in* axes:**Tind**< br > *out* output:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|Einsum|*in* Inputs:**T**< br > *out* Output:**T**|12+|**T** = tensor(double), tensor(float), tensor(float16)|
|Elu|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(double), tensor(float), tensor(float16)|
|Equal|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T1** = tensor(bool)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||[7, 10]|**T** = tensor(bool), tensor(int32), tensor(int64)|
2024-04-25 18:28:34 +00:00
|Erf|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[9, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2024-04-25 18:28:34 +00:00
|Exp|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Expand|*in* input:**T**< br > *in* shape:**tensor(int64)**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[8, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|EyeLike|*in* input:**T1**< br > *out* output:**T2**|9+|**T1** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(uint64)< br /> **T2** = tensor(double), tensor(float), tensor(int32), tensor(int64), tensor(uint64)|
|Flatten|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[9, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[1, 8]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Floor|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|GRU|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**|14+|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(int32)|
|||[7, 13]|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(int32)|
|Gather|*in* data:**T**< br > *in* indices:**Tind**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||[1, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|GatherElements|*in* data:**T**< br > *in* indices:**Tind**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2022-02-18 06:55:32 +00:00
|GatherND|*in* data:**T**< br > *in* indices:**tensor(int64)**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int64)< br /> **indices** = tensor(int64)|
|||12|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int64)< br /> **indices** = tensor(int64)|
|||11|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int64)< br /> **indices** = tensor(int64)|
2024-02-23 03:05:16 +00:00
|Gelu|*in* X:**T**< br > *out* Y:**T**|20+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Gemm|*in* A:**T**< br > *in* B:**T**< br > *in* C:**T**< br > *out* Y:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
|||[9, 10]|**T** = tensor(double), tensor(float), tensor(float16)|
|||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|GlobalAveragePool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|GlobalMaxPool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|Greater|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T1** = tensor(bool)|
2021-05-08 03:17:29 +00:00
|||[9, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16)|
2022-03-08 17:18:39 +00:00
|GreaterOrEqual|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|16+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T1** = tensor(bool)|
|||[12, 15]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T1** = tensor(bool)|
2024-02-23 03:47:15 +00:00
|GridSample|*in* X:**T1**< br > *in* grid:**T2**< br > *out* Y:**T1**|16+|**T1** = tensor(float)< br /> **T2** = tensor(float)|
2021-06-02 07:47:40 +00:00
|HardSigmoid|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(double), tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|Identity|*in* input:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**V**< br > *out* output:**V**|19+|**V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[14, 18]|**V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|||13|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[1, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|If|*in* cond:**B**< br > *out* outputs:**V**|19+|**B** = tensor(bool)< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 18]|**B** = tensor(bool)< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[1, 10]|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|ImageScaler|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|InstanceNormalization|*in* input:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *out* output:**T**|6+|**T** = tensor(double), tensor(float), tensor(float16)|
2024-03-05 21:33:01 +00:00
|IsInf|*in* X:**T1**< br > *out* Y:**T2**|20+|**T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)< br /> **T2** = tensor(bool)|
|||[10, 19]|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(bool)|
2024-03-07 23:46:11 +00:00
|IsNaN|*in* X:**T1**< br > *out* Y:**T2**|20+|**T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)< br /> **T2** = tensor(bool)|
|||[13, 19]|**T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
|||[9, 12]|**T1** = tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|LRN|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[1, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|LSTM|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *in* initial_c:**T**< br > *in* P:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**< br > *out* Y_c:**T**|14+|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(int32)|
|||[7, 13]|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(int32)|
Update Attention operator to support separated Q/K/V inputs (#13410)
### Description
Allow separated Q, K and V inputs to support cross attention:
* Q: [batch_size, sequence_length, hidden_size]
* K: [batch_size, kv_sequence_length, hidden_size]
* V: [batch_size, kv_sequence_length, v_hidden_size]
* Output: [batch_size, sequence_length, v_hidden_size]
To use separated Q/K/V inputs, the input tensor is for query, and two
optional inputs are added for key and value. Weights for input
projection is not included for now, so the MatMul of input projection
shall be done out of Attention operator, but Add bias is included for
performance consideration.
2022-10-25 18:51:06 +00:00
|LayerNormalization|*in* X:**T**< br > *in* Scale:**T**< br > *in* B:**T**< br > *out* Y:**T**< br > *out* Mean:**U**< br > *out* InvStdDev:**U**< br >< br > or< br >< br > *in* X:**T**< br > *in* Scale:**V**< br > *in* B:**V**< br > *out* Y:**V**< br > *out* Mean:**U**< br > *out* InvStdDev:**U**|17+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **U** = tensor(float)|
|||[1, 16]|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **U** = tensor(double), tensor(float)< br /> **V** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2022-03-08 17:18:39 +00:00
|LeakyRelu|*in* X:**T**< br > *out* Y:**T**|16+|**T** = tensor(double), tensor(float), tensor(float16)|
|||[6, 15]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Less|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T1** = tensor(bool)|
2021-05-08 03:17:29 +00:00
|||[9, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16)|
2022-03-08 17:18:39 +00:00
|LessOrEqual|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|16+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T1** = tensor(bool)|
|||[12, 15]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T1** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|Log|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|LogSoftmax|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|Loop|*in* M:**I**< br > *in* cond:**B**< br > *in* v_initial:**V**< br > *out* v_final_and_scan_outputs:**V**|19+|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 18]|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[1, 10]|**B** = tensor(bool)< br /> **I** = tensor(int64)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|MatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[9, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
|||[1, 8]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|MatMulInteger|*in* A:**T1**< br > *in* B:**T2**< br > *in* a_zero_point:**T1**< br > *in* b_zero_point:**T2**< br > *out* Y:**T3**|10+|**T1** = tensor(int8)< br /> **T2** = tensor(int8)< br /> **T3** = tensor(int32)|
|Max|*in* data_0:**T**< br > *out* max:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-05-08 03:17:29 +00:00
|||12|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||[6, 11]|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2022-02-18 06:55:32 +00:00
|MaxPool|*in* X:**T**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T**< br > *out* Y:**T**< br > *out* Indices:**I**|12+|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float), tensor(float16), tensor(int8), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||11|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float), tensor(float16)|
|||10|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float), tensor(float16)|
|||[8, 9]|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float), tensor(float16)|
2022-02-18 06:55:32 +00:00
|||[1, 7]|**T** = tensor(double), tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|MemcpyFromHost|*in* X:**T**< br > *out* Y:**T**|1+|**T** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|MemcpyToHost|*in* X:**T**< br > *out* Y:**T**|1+|**T** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(float8e4m3fn)), seq(tensor(float8e4m3fnuz)), seq(tensor(float8e5m2)), seq(tensor(float8e5m2fnuz)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Min|*in* data_0:**T**< br > *out* min:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-05-08 03:17:29 +00:00
|||12|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||[6, 11]|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2022-08-09 05:05:40 +00:00
|Mod|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||[10, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-06-02 07:47:40 +00:00
|Mul|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-05-08 03:17:29 +00:00
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2023-11-09 02:32:12 +00:00
|Neg|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8)|
2022-03-24 23:35:45 +00:00
|NonZero|*in* X:**T**< br > *out* Y:**tensor(int64)**|13+|**T** = tensor(bool), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint8)|
|||[9, 12]|**T** = tensor(bool), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint8)|
2022-02-18 06:55:32 +00:00
|Not|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(bool)|
2021-06-02 07:47:40 +00:00
|OneHot|*in* indices:**T1**< br > *in* depth:**T2**< br > *in* values:**T3**< br > *out* output:**T3**|11+|**T1** = tensor(int32), tensor(int64)< br /> **T2** = tensor(int32), tensor(int64)< br /> **T3** = tensor(float), tensor(float16), tensor(int64)|
|Or|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)< br /> **T1** = tensor(bool)|
2022-03-08 17:18:39 +00:00
|PRelu|*in* X:**T**< br > *in* slope:**T**< br > *out* Y:**T**|16+|**T** = tensor(double), tensor(float), tensor(float16)|
|||[9, 15]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16)|
2024-01-25 02:12:04 +00:00
|Pad|*in* data:**T**< br > *in* pads:**tensor(int64)**< br > *in* constant_value:**T**< br > *in* axes:**Tind**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *in* pads:**tensor(int64)**< br > *in* constant_value:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**|18+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
|||[13, 17]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[2, 10]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|ParametricSoftplus|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-08-25 19:04:20 +00:00
|Pow|*in* X:**T**< br > *in* Y:**T**< br > *out* Z:**T**< br >< br > or< br >< br > *in* X:**T**< br > *in* Y:**T1**< br > *out* Z:**T**|15+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)< br /> **T1** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
|||[13, 14]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)< br /> **T1** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
2021-05-08 03:17:29 +00:00
|||12|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)< br /> **T1** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
|||[7, 11]|**T** = tensor(double), tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|QuantizeLinear|*in* x:**T1**< br > *in* y_scale:**T1**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**< br >< br > or< br >< br > *in* x:**T1**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**|19+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(float8e4m3fn), tensor(float8e5m2), tensor(int8), tensor(uint8)|
|||[13, 18]|**T1** = tensor(float)< br /> **T2** = tensor(int8), tensor(uint8)|
|||[10, 12]|**T1** = tensor(float)< br /> **T2** = tensor(int8), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|RNN|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**|14+|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(int32)|
|||[7, 13]|**T** = tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(int32)|
2021-11-19 00:18:34 +00:00
|RandomNormal|*out* output:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|RandomNormalLike|*in* input:**T1**< br > *out* output:**T2**|1+|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(double), tensor(float), tensor(float16)|
|RandomUniform|*out* output:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|RandomUniformLike|*in* input:**T1**< br > *out* output:**T2**|1+|**T1** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Range|*in* start:**T**< br > *in* limit:**T**< br > *in* delta:**T**< br > *out* output:**T**|11+|**T** = tensor(double), tensor(float), tensor(int16), tensor(int32), tensor(int64)|
|Reciprocal|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2023-10-26 23:57:21 +00:00
|ReduceL1|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32)|
|ReduceL2|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32)|
|ReduceLogSum|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16)|
|ReduceLogSumExp|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16)|
|ReduceMax|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
|ReduceMean|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32)|
|ReduceMin|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(int8), tensor(uint8)|
|ReduceProd|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32)|
2021-06-02 07:47:40 +00:00
|ReduceSum|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
2023-10-26 23:57:21 +00:00
|||[1, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
|ReduceSumSquare|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(double), tensor(float), tensor(float16)|
|||[1, 17]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Relu|*in* X:**T**< br > *out* Y:**T**|14+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
|||13|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|Reshape|*in* data:**T**< br > *in* shape:**tensor(int64)**< br > *out* reshaped:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reshaped:**T**|19+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
|||[14, 18]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|||13|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
2021-05-08 03:17:29 +00:00
|||[5, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **shape** = tensor(int64)|
|||[1, 4]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-02-29 22:46:42 +00:00
|Resize|*in* X:**T**< br > *in* scales:**tensor(float)**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T1**< br > *in* roi:**T2**< br > *in* scales:**tensor(float)**< br > *in* sizes:**tensor(int64)**< br > *out* Y:**T1**|18+|**T1** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(uint8)|
|||[13, 17]|**T1** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T1** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(uint8)|
|||10|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|ReverseSequence|*in* input:**T**< br > *in* sequence_lens:**tensor(int64)**< br > *out* Y:**T**|10+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-02-18 06:55:32 +00:00
|RoiAlign|*in* X:**T1**< br > *in* rois:**T1**< br > *in* batch_indices:**T2**< br > *out* Y:**T1**|10+|**T1** = tensor(double), tensor(float)< br /> **T2** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|Round|*in* X:**T**< br > *out* Y:**T**|11+|**T** = tensor(double), tensor(float), tensor(float16)|
|ScaledTanh|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|Scan|*in* initial_state_and_scan_inputs:**V**< br > *out* final_state_and_scan_outputs:**V**< br >< br > or< br >< br > *in* sequence_lens:**I**< br > *in* initial_state_and_scan_inputs:**V**< br > *out* final_state_and_scan_outputs:**V**|19+|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[16, 18]|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-03-08 17:18:39 +00:00
|||[11, 15]|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-02-18 06:55:32 +00:00
|||[9, 10]|**V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||8|**I** = tensor(int64)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Scatter|*in* data:**T**< br > *in* indices:**Tind**< br > *in* updates:**T**< br > *out* output:**T**|[9, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2024-01-30 17:18:50 +00:00
|ScatterElements|*in* data:**T**< br > *in* indices:**Tind**< br > *in* updates:**T**< br > *out* output:**T**|18+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||[16, 17]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||[13, 15]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2024-04-24 12:08:50 +00:00
|ScatterND|*in* data:**T**< br > *in* indices:**tensor(int64)**< br > *in* updates:**T**< br > *out* output:**T**|18+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[16, 17]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 15]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Selu|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-07 22:30:26 +00:00
|SequenceAt|*in* input_sequence:**S**< br > *in* position:**I**< br > *out* tensor:**T**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))< br /> **T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|SequenceConstruct|*in* inputs:**T**< br > *out* output_sequence:**S**|11+|**S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))< br /> **T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|SequenceEmpty|*out* output:**S**|11+|**S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceErase|*in* input_sequence:**S**< br > *in* position:**I**< br > *out* output_sequence:**S**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceInsert|*in* input_sequence:**S**< br > *in* tensor:**T**< br > *in* position:**I**< br > *out* output_sequence:**S**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceLength|*in* input_sequence:**S**< br > *out* length:**I**|11+|**I** = tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
Introduce float 8 types (#14731)
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-05-30 20:25:58 +00:00
|Shape|*in* data:**T**< br > *out* shape:**T1**|19+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
|||[15, 18]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-08-25 19:04:20 +00:00
|||[13, 14]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-05-08 03:17:29 +00:00
|||[1, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-06-02 07:47:40 +00:00
|Shrink|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Sigmoid|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2023-08-29 04:03:58 +00:00
|Sign|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
SimplifiedLayerNormalization Fusion BFloat16 support for Llama-v2 on A100 (#18898)
### Description
<!-- Describe your changes. -->
Adds bfloat16 as a supported dtype for SimplifiedLayerNormFusion which
will provide speedup for Llama-v2 on A100 using bfloat16 numerical
format.
_layernorm_optimized_training.onnx exported in bfloat16 vs. float16:_

### Repro Instructions
```python
from torch import nn
from onnxruntime.training.ortmodule import ORTModule, DebugOptions, LogLevel
import torch
dtype = torch.bfloat16
# dtype = torch.float16
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(784, 10, dtype=dtype)
self.layernorm = nn.LayerNorm([784], dtype=dtype)
def forward(self, x):
x = x.view(x.shape[0], -1)
x = self.layernorm(x)
x = self.fc(x)
return x
model = Net()
model = ORTModule(model, DebugOptions(save_onnx=True, onnx_prefix='layernorm', log_level=LogLevel.INFO))
model.to("cuda")
images = torch.randn((8, 28, 28), dtype=dtype).to("cuda")
output = model(images)
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ONNX Runtime integration with Llama-v2 family of LLMs.
---------
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-02-14 18:05:16 +00:00
|SimplifiedLayerNormalization|*in* X:**T**< br > *in* scale:**V**< br > *out* Y:**V**< br > *out* inv_std_var:**U**|1+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **U** = tensor(double), tensor(float)< br /> **V** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Sin|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(double), tensor(float), tensor(float16)|
|Size|*in* data:**T**< br > *out* size:**T1**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2021-05-08 03:17:29 +00:00
|||[1, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2022-02-18 06:55:32 +00:00
|Slice|*in* data:**T**< br > *in* starts:**Tind**< br > *in* ends:**Tind**< br > *in* axes:**Tind**< br > *in* steps:**Tind**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||10|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||[1, 9]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Softmax|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
|||[1, 10]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Softplus|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|Softsign|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-07-09 08:00:22 +00:00
|SpaceToDepth|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16)|
|||[1, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2023-01-11 22:14:10 +00:00
|Split|*in* input:**T**< br > *in* split:**T**< br > *out* outputs...:**T**< br >< br > or< br >< br > *in* input:**T**< br > *in* split:**tensor(int64)**< br > *out* outputs:**T**< br >< br > or< br >< br > *in* input:**T**< br > *out* outputs:**T**|18+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[13, 17]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[2, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-02-15 02:07:51 +00:00
|Sqrt|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Squeeze|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* squeezed:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* squeezed:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[1, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Sub|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-05-08 03:17:29 +00:00
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2021-06-02 07:47:40 +00:00
|Sum|*in* data_0:**T**< br > *out* sum:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[8, 12]|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
|||[6, 7]|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Tanh|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|ThresholdedRelu|*in* X:**T**< br > *out* Y:**T**|10+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-05-08 03:17:29 +00:00
|||1+|**T** = tensor(double), tensor(float), tensor(float16)|
2021-06-02 07:47:40 +00:00
|Tile|*in* input:**T**< br > *in* repeats:**T1**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**T**< br > *in* tiles:**T**< br > *in* axis:**T**< br > *out* output:**T**|13+|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)< br /> **T1** = tensor(int64)|
2021-05-08 03:17:29 +00:00
|||[6, 12]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)< br /> **T1** = tensor(int64)|
2023-02-07 17:03:14 +00:00
|TopK|*in* X:**T**< br > *in* K:**tensor(int64)**< br > *out* Values:**T**< br > *out* Indices:**I**< br >< br > or< br >< br > *in* X:**T**< br > *out* Values:**T**< br > *out* Indices:**I**|11+|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
|||10|**I** = tensor(int64)< br /> **T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
|||[1, 9]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64)|
2021-06-02 07:47:40 +00:00
|Transpose|*in* data:**T**< br > *out* transposed:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[1, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-11-09 19:20:17 +00:00
|Trilu|*in* input:**T**< br > *in* k:**tensor(int64)**< br > *out* output:**T**|14+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Unsqueeze|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* expanded:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* expanded:**T**|13+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[11, 12]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||[1, 10]|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Upsample|*in* X:**T**< br > *in* scales:**tensor(float)**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T**< br > *out* Y:**T**|9|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(uint8)|
2021-05-08 03:17:29 +00:00
|||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(uint8)|
2023-11-02 19:23:20 +00:00
|Where|*in* condition:**B**< br > *in* X:**T**< br > *in* Y:**T**< br > *out* output:**T**|16+|**B** = tensor(bool)< br /> **T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint8)|
2022-03-08 17:18:39 +00:00
|||[9, 15]|**B** = tensor(bool)< br /> **T** = tensor(double), tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint8)|
2021-06-02 07:47:40 +00:00
|Xor|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)< br /> **T1** = tensor(bool)|
2021-05-08 03:17:29 +00:00
| |
| |
2022-10-27 21:20:48 +00:00
|**Operator Domain:** *com.microsoft* ||||
2023-02-07 19:51:06 +00:00
|Attention|*in* input:**T**< br > *in* weights:**T**< br > *in* bias:**T**< br > *in* mask_index:**M**< br > *in* past:**T**< br > *in* relative_position_bias:**T**< br > *in* past_sequence_length:**M**< br > *out* output:**T**< br > *out* present:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-05-17 04:40:00 +00:00
|BeamSearch|*in* input_ids:**F**< br > *in* max_length:**I**< br > *in* min_length:**I**< br > *in* num_beams:**I**< br > *in* num_return_sequences:**I**< br > *in* length_penalty:**T**< br > *in* repetition_penalty:**T**< br > *in* vocab_mask:**M**< br > *in* prefix_vocab_mask:**M**< br > *in* attention_mask:**I**< br > *in* decoder_input_ids:**I**< br > *in* logits_processor:**I**< br > *out* sequences:**I**< br > *out* sequences_scores:**T**< br > *out* scores:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-02-14 20:46:50 +00:00
|BiasAdd|*in* X:**T**< br > *in* bias:**T**< br > *in* skip:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|BiasDropout|*in* data:**T**< br > *in* bias:**T**< br > *in* residual:**T**< br > *in* ratio:**T1**< br > *in* training_mode:**T2**< br > *out* output:**T**< br > *out* mask:**T2**|1+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
|BiasGelu|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|1+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
|BiasSoftmax|*in* data:**T**< br > *in* bias:**T**< br > *out* output:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
2023-02-03 07:43:51 +00:00
|BiasSplitGelu|*in* X:**T**< br > *in* bias:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|BitmaskBiasDropout|*in* data:**T**< br > *in* bias:**T**< br > *in* residual:**T**< br > *in* ratio:**T1**< br > *in* training_mode:**T2**< br > *out* output:**T**< br > *out* mask:**T3**|1+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)< br /> **T3** = tensor(uint32)|
|BitmaskDropout|*in* data:**T**< br > *in* ratio:**T1**< br > *in* training_mode:**T2**< br > *out* output:**T**< br > *out* mask:**T3**|1+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T1** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(bool)< br /> **T3** = tensor(uint32)|
|ComplexMul|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|1+|**T** = tensor(float), tensor(float16)|
|ComplexMulConj|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|1+|**T** = tensor(float), tensor(float16)|
|ConvTransposeWithDynamicPads|*in* X:**T**< br > *in* W:**T**< br > *in* Pads:**tensor(int64)**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|DecoderAttention|*in* query:**T**< br > *in* key:**T**< br > *in* q_weight:**T**< br > *in* kv_weight:**T**< br > *in* bias:**T**< br > *in* key_padding_mask:**B**< br > *in* key_cache:**T**< br > *in* value_cache:**T**< br > *in* static_kv:**B**< br > *in* use_past:**B**< br > *in* has_layer_state:**B**< br > *in* has_key_padding_mask:**B**< br > *out* output:**T**< br > *out* new_key_cache:**T**< br > *out* new_value_cache:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-10-13 18:47:15 +00:00
|DecoderMaskedMultiHeadAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* mask_index:**M**< br > *in* relative_position_bias:**T**< br > *in* past_key:**T**< br > *in* past_value:**T**< br > *in* past_sequence_length:**M**< br > *in* beam_width:**M**< br > *in* cache_indirection:**M**< br > *in* bias:**T**< br > *out* output:**T**< br > *out* present_key:**T**< br > *out* present_value:**T**< br > *out* qk:**V**|1+|**T** = tensor(float), tensor(float16)|
2023-03-23 19:31:38 +00:00
|DecoderMaskedSelfAttention|*in* input:**T**< br > *in* weights:**T**< br > *in* bias:**T**< br > *in* mask_index:**M**< br > *in* past:**T**< br > *in* relative_position_bias:**T**< br > *in* past_sequence_length:**M**< br > *in* beam_width:**M**< br > *in* cache_indirection:**M**< br > *out* output:**T**< br > *out* present:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|DequantizeLinear|*in* x:**T1**< br > *in* x_scale:**T2**< br > *in* x_zero_point:**T1**< br > *out* y:**T2**|1+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(float16)|
|DequantizeWithOrder|*in* input:**Q**< br > *in* scale_input:**S**< br > *out* output:**F**|1+|**F** = tensor(float), tensor(float16)< br /> **Q** = tensor(int8)< br /> **S** = tensor(float)|
2023-10-13 18:47:15 +00:00
|DynamicTimeWarping|*in* input:**F**< br > *out* output:**I**|1+|**F** = tensor(float)< br /> **I** = tensor(int32)|
2022-10-27 21:20:48 +00:00
|EmbedLayerNormalization|*in* input_ids:**T1**< br > *in* segment_ids:**T1**< br > *in* word_embedding:**T**< br > *in* position_embedding:**T**< br > *in* segment_embedding:**T**< br > *in* gamma:**T**< br > *in* beta:**T**< br > *in* mask:**T1**< br > *in* position_ids:**T1**< br > *out* output:**T**< br > *out* mask_index:**T1**< br > *out* embedding_sum:**T**|1+|**T** = tensor(float), tensor(float16)|
|FastGelu|*in* X:**T**< br > *in* bias:**T**< br > *out* Y:**T**|1+|**T** = tensor(bfloat16), tensor(float), tensor(float16)|
|FusedConv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *in* Z:**T**< br > *out* Y:**T**|1+|**T** = tensor(float)|
|FusedMatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
2023-08-01 23:39:09 +00:00
|GatedRelativePositionBias|*in* query_layer:**T**< br > *in* query_bias:**T**< br > *in* rel_pos:**T**< br > *in* weight:**T**< br > *in* bias:**T**< br > *in* eco_a:**T**< br > *in* token_offset:**M**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|Gelu|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
2023-10-27 12:33:55 +00:00
|GemmFloat8|*in* A:**TA**< br > *in* B:**TB**< br > *in* C:**TC**< br > *in* scaleA:**TS**< br > *in* scaleB:**TS**< br > *in* scaleY:**TS**< br > *out* Y:**TR**|1+|**TA** = tensor(bfloat16), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e5m2)< br /> **TB** = tensor(bfloat16), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e5m2)< br /> **TR** = tensor(bfloat16), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e5m2)< br /> **TS** = tensor(float)|
2024-04-16 22:31:56 +00:00
|GemmaRotaryEmbedding|*in* emb:**U**< br > *in* q:**T**< br > *in* q_rot:**T**< br > *in* k:**T**< br > *in* k_rot:**T**< br > *out* output1:**T**< br > *out* output2:**T**|1+|**T** = tensor(float16)< br /> **U** = tensor(float)|
2022-10-27 21:20:48 +00:00
|GreedySearch|*in* input_ids:**I**< br > *in* max_length:**I**< br > *in* min_length:**I**< br > *in* repetition_penalty:**T**< br > *in* vocab_mask:**I**< br > *in* prefix_vocab_mask:**I**< br > *in* attention_mask:**I**< br > *out* sequences:**I**|1+|**T** = tensor(float), tensor(float16)|
|GridSample|*in* X:**T1**< br > *in* Grid:**T1**< br > *out* Y:**T2**|1+|**T1** = tensor(float)< br /> **T2** = tensor(float)|
2023-02-03 07:43:51 +00:00
|GroupNorm|*in* X:**T**< br > *in* gamma:**M**< br > *in* beta:**M**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2024-01-24 00:34:26 +00:00
|GroupQueryAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* past_key:**T**< br > *in* past_value:**T**< br > *in* seqlens_k:**M**< br > *in* total_sequence_length:**M**< br > *in* cos_cache:**T**< br > *in* sin_cache:**T**< br > *out* output:**T**< br > *out* present_key:**T**< br > *out* present_value:**T**|1+|**M** = tensor(int32)< br /> **T** = tensor(bfloat16), tensor(float16)|
2022-10-27 21:20:48 +00:00
|Inverse|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|Irfft|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
|LongformerAttention|*in* input:**T**< br > *in* weight:**T**< br > *in* bias:**T**< br > *in* mask:**T**< br > *in* global_weight:**T**< br > *in* global_bias:**T**< br > *in* global:**G**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-11-20 17:52:58 +00:00
|MatMulBnb4|*in* A:**T1**< br > *in* B:**T2**< br > *in* absmax:**T1**< br > *out* Y:**T1**|1+|**T1** = tensor(bfloat16), tensor(float), tensor(float16)< br /> **T2** = tensor(uint8)|
2024-03-05 03:45:45 +00:00
|MatMulNBits|*in* A:**T1**< br > *in* B:**T2**< br > *in* scales:**T1**< br > *in* zero_points:**T3**< br > *in* g_idx:**T4**< br > *out* Y:**T1**|1+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(uint8)|
2024-03-20 04:28:15 +00:00
|MoE|*in* input:**T**< br > *in* router_probs:**T**< br > *in* fc1_experts_weights:**T**< br > *in* fc1_experts_bias:**T**< br > *in* fc2_experts_weights:**T**< br > *in* fc2_experts_bias:**T**< br > *in* fc3_experts_weights:**T**< br > *in* fc3_experts_bias:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-03-13 21:29:16 +00:00
|MultiHeadAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* bias:**T**< br > *in* key_padding_mask:**M**< br > *in* relative_position_bias:**T**< br > *in* past_key:**T**< br > *in* past_value:**T**< br > *out* output:**T**< br > *out* present_key:**T**< br > *out* present_value:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|NGramRepeatBlock|*in* input_ids:**Tid**< br > *in* scores:**T**< br > *out* scores_out:**T**|1+|**T** = tensor(float)< br /> **Tid** = tensor(int64)|
2023-02-03 07:43:51 +00:00
|NhwcConv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-03-21 19:59:29 +00:00
|PackedAttention|*in* input:**T**< br > *in* weights:**T**< br > *in* bias:**T**< br > *in* token_offset:**M**< br > *in* cumulative_sequence_length:**M**< br > *in* relative_position_bias:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-08-01 22:30:41 +00:00
|PackedMultiHeadAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* bias:**T**< br > *in* token_offset:**M**< br > *in* cumulative_sequence_length:**M**< br > *in* relative_position_bias:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|QAttention|*in* input:**T1**< br > *in* weight:**T2**< br > *in* bias:**T3**< br > *in* input_scale:**T3**< br > *in* weight_scale:**T3**< br > *in* mask_index:**T4**< br > *in* input_zero_point:**T1**< br > *in* weight_zero_point:**T2**< br > *in* past:**T3**< br > *out* output:**T3**< br > *out* present:**T3**|1+|**T1** = tensor(int8)< br /> **T2** = tensor(int8)< br /> **T3** = tensor(float), tensor(float16)< br /> **T4** = tensor(int32)|
2024-03-29 17:24:19 +00:00
|QMoE|*in* input:**T**< br > *in* router_probs:**T**< br > *in* fc1_experts_weights:**T1**< br > *in* fc1_scales:**T**< br > *in* fc1_experts_bias:**T**< br > *in* fc2_experts_weights:**T1**< br > *in* fc2_scales:**T**< br > *in* fc2_experts_bias:**T**< br > *in* fc3_experts_weights:**T1**< br > *in* fc3_scales:**T**< br > *in* fc3_experts_bias:**T**< br > *out* output:**T**|1+|**T** = tensor(float16)< br /> **T1** = tensor(uint8)|
2023-02-07 19:51:06 +00:00
|QOrderedAttention|*in* input:**Q**< br > *in* scale_input:**S**< br > *in* scale_Q_gemm:**S**< br > *in* scale_K_gemm:**S**< br > *in* scale_V_gemm:**S**< br > *in* Q_weight:**Q**< br > *in* K_weight:**Q**< br > *in* V_weight:**Q**< br > *in* scale_Q_weight:**S**< br > *in* scale_K_weight:**S**< br > *in* scale_V_weight:**S**< br > *in* Q_bias:**S**< br > *in* K_bias:**S**< br > *in* V_bias:**S**< br > *in* scale_QKT_gemm:**S**< br > *in* scale_QKT_softmax:**S**< br > *in* scale_values_gemm:**S**< br > *in* mask_index:**G**< br > *in* past:**Q**< br > *in* relative_position_bias:**S**< br > *out* output:**Q**|1+|**G** = tensor(int32)< br /> **Q** = tensor(int8)< br /> **S** = tensor(float)|
2022-10-27 21:20:48 +00:00
|QOrderedGelu|*in* X:**Q**< br > *in* scale_X:**S**< br > *in* scale_Y:**S**< br > *out* Y:**Q**|1+|**Q** = tensor(int8)< br /> **S** = tensor(float)|
|QOrderedLayerNormalization|*in* X:**Q**< br > *in* scale_X:**S**< br > *in* scale:**F**< br > *in* B:**F**< br > *in* scale_Y:**S**< br > *out* Y:**Q**|1+|**F** = tensor(float), tensor(float16)< br /> **Q** = tensor(int8)< br /> **S** = tensor(float)|
|QOrderedLongformerAttention|*in* input:**Q**< br > *in* scale_input:**S**< br > *in* weight:**Q**< br > *in* scale_weight:**S**< br > *in* bias:**S**< br > *in* scale_bias:**S**< br > *in* scale_qkv_gemm:**S**< br > *in* mask:**F**< br > *in* global_weight:**Q**< br > *in* scale_global_weight:**S**< br > *in* global_bias:**S**< br > *in* scale_global_gemm:**S**< br > *in* global:**G**< br > *in* scale_output:**S**< br > *out* output:**Q**|1+|**F** = tensor(float16)< br /> **G** = tensor(int32)< br /> **Q** = tensor(int8)< br /> **S** = tensor(float)|
|QOrderedMatMul|*in* A:**Q**< br > *in* scale_A:**S**< br > *in* B:**Q**< br > *in* scale_B:**S**< br > *in* scale_Y:**S**< br > *in* bias:**S**< br > *in* C:**Q**< br > *in* scale_C:**S**< br > *out* Y:**Q**|1+|**Q** = tensor(int8)< br /> **S** = tensor(float)|
|QuantizeLinear|*in* x:**T1**< br > *in* y_scale:**T1**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**|1+|**T1** = tensor(float16)< br /> **T2** = tensor(int8), tensor(uint8)|
|QuantizeWithOrder|*in* input:**F**< br > *in* scale_input:**S**< br > *out* output:**Q**|1+|**F** = tensor(float), tensor(float16)< br /> **Q** = tensor(int8)< br /> **S** = tensor(float)|
QuickGelu Fusion (#12417)
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.
For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

After, FW takes 115us, BW takes 139us, which is much faster.

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]
After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2022-10-28 10:12:07 +00:00
|QuickGelu|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
2023-01-07 01:32:58 +00:00
|RelativePositionBias|*in* bias_table:**T**< br > *in* query_length:**U**< br > *in* key_length:**U**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-11-22 18:00:23 +00:00
|RemovePadding|*in* input:**T**< br > *in* sequence_token_count:**M**< br > *out* output:**T**< br > *out* token_offset:**M**< br > *out* cumulated_seq_len:**M**< br > *out* max_seq_len:**M**|1+|**T** = tensor(float), tensor(float16)|
|RestorePadding|*in* input:**T**< br > *in* token_offset:**M**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|Rfft|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|
2024-01-22 18:17:11 +00:00
|RotaryEmbedding|*in* input:**T**< br > *in* position_ids:**M**< br > *in* cos_cache:**T**< br > *in* sin_cache:**T**< br > *out* output:**T**|1+|**M** = tensor(int64)< br /> **T** = tensor(bfloat16), tensor(float), tensor(float16)|
2023-01-12 22:15:26 +00:00
|Sampling|*in* input_ids:**I**< br > *in* max_length:**I**< br > *in* min_length:**I**< br > *in* repetition_penalty:**T**< br > *in* vocab_mask:**I**< br > *in* prefix_vocab_mask:**I**< br > *in* attention_mask:**I**< br > *in* presence_mask:**I**< br > *in* seed:**I**< br > *out* sequences:**I**< br > *out* filtered_logits:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-10-31 17:27:20 +00:00
|SkipGroupNorm|*in* X:**T**< br > *in* gamma:**M**< br > *in* beta:**M**< br > *in* skip:**T**< br > *in* bias:**T**< br > *out* Y:**T**< br > *out* S:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-01-06 15:27:10 +00:00
|SkipLayerNormalization|*in* input:**T**< br > *in* skip:**T**< br > *in* gamma:**T**< br > *in* beta:**T**< br > *in* bias:**T**< br > *out* output:**T**< br > *out* mean:**U**< br > *out* inv_std_var:**U**< br > *out* input_skip_bias_sum:**T**|1+|**T** = tensor(float), tensor(float16)|
|SkipSimplifiedLayerNormalization|*in* input:**T**< br > *in* skip:**T**< br > *in* gamma:**T**< br > *in* bias:**T**< br > *out* output:**T**< br > *out* mean:**U**< br > *out* inv_std_var:**U**< br > *out* input_skip_bias_sum:**T**|1+|**T** = tensor(float), tensor(float16)|
[CUDA] Add SparseAttention operator for Phi-3-small (#20216)
### Description
Add CUDA implementation for block sparse attention for Phi-3-small.
Block sparse attention was proposed in [Sparse
Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also
adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different
sparse layout.
In Phi-3-small, the sparse layout is static, and works with
unidirectional (causal) attention.
Compared to dense attention, the benefit of block sparse is to speed up
both training and inference. It could save memory thus support longer
context length.
- [x] Add operator spec and shape inference
- [x] Symbolic shape inference
- [x] Refactor GroupQueryAttention to expose common kernels for kv cache
concatenation, q/k/v transpose etc.
- [x] Add cuda kernel to convert block mask to CSR format
- [x] Add cuda kernel to generate position ids
- [x] Add compile script and template files to convert triton kernel to
cubin and dispatcher.
- [x] Add triton kernel v1 for prompt
- [x] Add triton kernel v2 for token generation and support padding
- [x] Update IO Binding Helper to allow buffer sharing.
- [x] Test relevance
- [x] Test performance
### Performance
Test in A100-SXM4-80GB with `batch_size=4, num_heads=32,
max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16,
vert_stride=8, num_layout=8`
We compare sparse attention to corresponding GQA with local attention
windows size 1024, or GQA with dense causal.
Average latency in milliseconds (for fused attention kernel used in
prompt prefilling):
seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0465 | 0.0722 | 0.0641
128 | 0.0618 | 0.0787 | 0.0672
256 | 0.1086 | 0.1076 | 0.0943
512 | 0.2535 | 0.2487 | 0.1676
1024 | 0.7042 | 0.7050 | 0.3800
2048 | 2.4125 | 1.9316 | 0.8966
4096 | 8.9346 | 4.5699 | 2.1129
8192 | 40.5401 | 10.3508 | 5.1748
Average latency in milliseconds (for fused attention kernel used in
token generation:
past_seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0186 | 0.0186 | 0.0870
128 | 0.0408 | 0.0466 | 0.1165
256 | 0.0530 | 0.0592 | 0.0988
512 | 0.0445| 0.0447 | 0.1150
1024 | 0.0634 | 0.0640 | 0.1454
2048 | 0.1027 | 0.0637 | 0.1589
4096 | 0.1789 | 0.0631 | 0.1806
8192 | 0.3288 | 0.0655 | 0.2146
We can see that the kernel for token generation still have room to
improve.
#### Limitations
Only support right-side padding and unidirectional attention.
The following are not supported in the first version:
(1) Packed mode like PackedMultiHeadAttention where input has been
removed padding.
(2) paged attention.
(3) bidirectional attention.
(4) GPU compute capacity that is not 8.0, 8.6 and 8.9.
(5) Left side padding.
Some of these limitations will be removed in the future (may be in a new
operator).
2024-04-30 16:06:29 +00:00
|SparseAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* past_key:**T**< br > *in* past_value:**T**< br > *in* block_mask:**M**< br > *in* total_sequence_length:**M**< br > *in* key_total_sequence_lengths:**M**< br > *in* cos_cache:**T**< br > *in* sin_cache:**T**< br > *out* output:**T**< br > *out* present_key:**T**< br > *out* present_value:**T**|1+|**M** = tensor(int32)< br /> **T** = tensor(bfloat16), tensor(float16)|
2022-10-27 21:20:48 +00:00
|TransposeMatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(bfloat16), tensor(double), tensor(float), tensor(float16)|
|Trilu|*in* X:**T**< br > *in* k:**tensor(int64)**< br > *out* Y:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-10-13 18:47:15 +00:00
|UnfoldTensor|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-01-23 21:44:34 +00:00
|WhisperBeamSearch|*in* input_ids:**F**< br > *in* max_length:**I**< br > *in* min_length:**I**< br > *in* num_beams:**I**< br > *in* num_return_sequences:**I**< br > *in* length_penalty:**T**< br > *in* repetition_penalty:**T**< br > *in* vocab_mask:**M**< br > *in* prefix_vocab_mask:**M**< br > *in* attention_mask:**I**< br > *in* decoder_input_ids:**I**< br > *in* logits_processor:**I**< br > *in* cross_qk_layer_head:**I**< br > *in* extra_decoding_ids:**I**< br > *in* temperature:**T**< br > *out* sequences:**I**< br > *out* sequences_scores:**T**< br > *out* scores:**T**< br > *out* cross_qk:**V**< br > *out* non_speech_probs:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
| |
| |
2022-09-09 17:21:25 +00:00
< a name = "dmlexecutionprovider" / >
## Operators implemented by DmlExecutionProvider
| Op Name | Parameters | OpSet Version | Types Supported |
|---------|------------|---------------|-----------------|
|**Operator Domain:** *ai.onnx* ||||
2022-10-28 03:11:49 +00:00
|Abs|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8)|
|||6+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8)|
2022-09-09 17:21:25 +00:00
|Acos|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float), tensor(float16)|
|Acosh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float), tensor(float16)|
|Add|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||7+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Affine|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|And|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)|
|ArgMax|*in* data:**T**< br > *out* reduced:**tensor(int64)**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|ArgMin|*in* data:**T**< br > *out* reduced:**tensor(int64)**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Asin|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float), tensor(float16)|
|Asinh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float), tensor(float16)|
|Atan|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float), tensor(float16)|
|Atanh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float), tensor(float16)|
2024-01-04 19:27:03 +00:00
|AveragePool|*in* X:**T**< br > *out* Y:**T**|19+|**T** = tensor(float), tensor(float16)|
|||11+|**T** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|||10+|**T** = tensor(float), tensor(float16)|
|||7+|**T** = tensor(float), tensor(float16)|
|BatchNormalization|*in* X:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *in* input_mean:**U**< br > *in* input_var:**U**< br > *out* Y:**T**< br > *out* running_mean:**U**< br > *out* running_var:**U**< br >< br > or< br >< br > *in* X:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *in* mean:**T**< br > *in* var:**T**< br > *out* Y:**T**< br > *out* mean:**T**< br > *out* var:**T**< br > *out* saved_mean:**T**< br > *out* saved_var:**T**< br >< br > or< br >< br > *in* X:**T**< br > *in* scale:**T1**< br > *in* B:**T1**< br > *in* input_mean:**T2**< br > *in* input_var:**T2**< br > *out* Y:**T**< br > *out* running_mean:**T2**< br > *out* running_var:**T2**|15+|**T** = tensor(float), tensor(float16)|
|||14+|**T** = tensor(float), tensor(float16)|
|||9+|**T** = tensor(float), tensor(float16)|
|||7+|**T** = tensor(float), tensor(float16)|
|BitShift|*in* X:**T**< br > *in* Y:**T**< br > *out* Z:**T**|11+|**T** = tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-05-17 20:27:49 +00:00
|BitwiseAnd|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|18+|**T** = tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|BitwiseNot|*in* X:**T**< br > *out* Y:**T**|18+|**T** = tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|BitwiseOr|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|18+|**T** = tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|BitwiseXor|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|18+|**T** = tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-01-22 23:37:09 +00:00
|Cast|*in* input:**T1**< br > *out* output:**T2**|19+|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||9+|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||6+|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-01-22 23:37:09 +00:00
|CastLike|*in* input:**T1**< br > *in* target_type:**T2**< br > *out* output:**T2**|19+|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||15+|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|Ceil|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|Celu|*in* X:**T**< br > *out* Y:**T**|12+|**T** = tensor(float), tensor(float16)|
|Clip|*in* input:**T**< br > *in* min:**T**< br > *in* max:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|Concat|*in* inputs:**T**< br > *out* concat_result:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||4+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-22 02:08:28 +00:00
|ConcatFromSequence|*in* input_sequence:**S**< br > *out* concat_result:**T**|11+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|ConstantOfShape|*in* input:**T1**< br > *out* output:**T2**|9+|**T1** = tensor(int64)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Conv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
|ConvInteger|*in* x:**T1**< br > *in* w:**T2**< br > *in* x_zero_point:**T1**< br > *in* w_zero_point:**T2**< br > *out* y:**T3**|10+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(int32)|
|ConvTranspose|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
|Cos|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float), tensor(float16)|
|Cosh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float), tensor(float16)|
|Crop|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
|CumSum|*in* x:**T**< br > *in* axis:**T2**< br > *out* y:**T**|14+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||11+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2024-05-02 18:08:39 +00:00
|DFT|*in* input:**T1**< br > *in* dft_length:**T2**< br > *in* axis:**tensor(int64)**< br > *out* output:**T1**< br >< br > or< br >< br > *in* input:**T1**< br > *in* dft_length:**T2**< br > *out* output:**T1**|20+|**T1** = tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(int32), tensor(int64)|
|||17+|**T1** = tensor(double), tensor(float), tensor(float16)< br /> **T2** = tensor(int32), tensor(int64)|
2022-09-09 17:21:25 +00:00
|DepthToSpace|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-01-22 23:37:09 +00:00
|DequantizeLinear|*in* x:**T**< br > *in* x_scale:**tensor(float)**< br > *in* x_zero_point:**T**< br > *out* y:**tensor(float)**< br >< br > or< br >< br > *in* x:**T1**< br > *in* x_scale:**T2**< br > *in* x_zero_point:**T1**< br > *out* y:**T2**|19+|**T1** = tensor(int32), tensor(int8), tensor(uint8)< br /> **T2** = tensor(float), tensor(float16)|
|||13+|**T** = tensor(int32), tensor(int8), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||10+|**T** = tensor(int32), tensor(int8), tensor(uint8)|
2023-06-08 20:49:39 +00:00
|Div|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||7+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|Dropout|*in* data:**T**< br > *in* ratio:**T1**< br > *in* training_mode:**T2**< br > *out* output:**T**< br > *out* mask:**T2**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**< br > *out* mask:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**< br > *out* mask:**T1**|7+|**T** = tensor(float), tensor(float16)|
2024-01-04 19:27:03 +00:00
|DynamicQuantizeLinear|*in* x:**T1**< br > *out* y:**T2**< br > *out* y_scale:**tensor(float)**< br > *out* y_zero_point:**T2**|11+|**T1** = tensor(float)< br /> **T2** = tensor(int8), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|Einsum|*in* Inputs:**T**< br > *out* Output:**T**|12+|**T** = tensor(float), tensor(float16)|
|Elu|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(float), tensor(float16)|
2024-01-22 23:37:09 +00:00
|Equal|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|19+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
|||7+|**T** = tensor(float), tensor(float16)< br /> **T1** = tensor(bool)|
|Erf|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16)|
|||9+|**T** = tensor(float), tensor(float16)|
|Exp|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|Expand|*in* input:**T**< br > *in* shape:**tensor(int64)**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||8+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|EyeLike|*in* input:**T1**< br > *out* output:**T2**|9+|**T1** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Flatten|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||9+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Floor|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|GRU|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**|14+|**T** = tensor(float), tensor(float16)|
|||7+|**T** = tensor(float), tensor(float16)|
|Gather|*in* data:**T**< br > *in* indices:**Tind**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|GatherElements|*in* data:**T**< br > *in* indices:**Tind**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|GatherND|*in* data:**T**< br > *in* indices:**tensor(int64)**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||12+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Gemm|*in* A:**T**< br > *in* B:**T**< br > *in* C:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16)|
|||11+|**T** = tensor(float), tensor(float16)|
|||9+|**T** = tensor(float), tensor(float16)|
|||7+|**T** = tensor(float), tensor(float16)|
|GlobalAveragePool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|GlobalLpPool|*in* X:**T**< br > *out* Y:**T**|2+|**T** = tensor(float), tensor(float16)|
|GlobalMaxPool|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|Greater|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
|||9+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
|||7+|**T** = tensor(float), tensor(float16)< br /> **T1** = tensor(bool)|
2022-12-21 17:05:12 +00:00
|GreaterOrEqual|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|16+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
2023-05-05 22:59:33 +00:00
|GridSample|*in* X:**T1**< br > *in* grid:**T2**< br > *out* Y:**T1**|16+|**T1** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T2** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|HardSigmoid|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(float), tensor(float16)|
|Hardmax|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16)|
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
2024-01-22 23:37:09 +00:00
|Identity|*in* input:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**V**< br > *out* output:**V**|19+|**V** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||16+|**V** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-22 02:08:28 +00:00
|||14+|**V** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-08-01 02:45:59 +00:00
|If|*in* cond:**B**< br > *out* outputs:**V**|19+|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||16+|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||7+|**B** = tensor(bool)< br /> **V** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|ImageScaler|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
|InstanceNormalization|*in* input:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *out* output:**T**|6+|**T** = tensor(float), tensor(float16)|
2024-04-22 19:01:59 +00:00
|IsInf|*in* X:**T1**< br > *out* Y:**T2**|20+|**T1** = tensor(float)< br /> **T2** = tensor(bool)|
|||10+|**T1** = tensor(float)< br /> **T2** = tensor(bool)|
|IsNaN|*in* X:**T1**< br > *out* Y:**T2**|20+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
|||13+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
2022-09-09 17:21:25 +00:00
|||9+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(bool)|
|LRN|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
|LSTM|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *in* initial_c:**T**< br > *in* P:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**< br > *out* Y_c:**T**|14+|**T** = tensor(float), tensor(float16)|
|||7+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|LayerNormalization|*in* X:**T**< br > *in* Scale:**T**< br > *in* B:**T**< br > *out* Y:**T**< br > *out* Mean:**U**< br > *out* InvStdDev:**U**< br >< br > or< br >< br > *in* X:**T**< br > *in* Scale:**V**< br > *in* B:**V**< br > *out* Y:**V**< br > *out* Mean:**U**< br > *out* InvStdDev:**U**|17+|**T** = tensor(float), tensor(float16)< br /> **U** = tensor(float)|
2024-04-19 05:17:31 +00:00
|||1+|**T** = tensor(float), tensor(float16)< br /> **U** = tensor(float), tensor(float16)< br /> **V** = tensor(float), tensor(float16)|
2022-12-21 17:05:12 +00:00
|LeakyRelu|*in* X:**T**< br > *out* Y:**T**|16+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|Less|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
|||9+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
|||7+|**T** = tensor(float), tensor(float16)< br /> **T1** = tensor(bool)|
2022-12-21 17:05:12 +00:00
|LessOrEqual|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|16+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(bool)|
2022-09-09 17:21:25 +00:00
|Log|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|LogSoftmax|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16)|
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
|LpNormalization|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
2024-01-04 19:27:03 +00:00
|LpPool|*in* X:**T**< br > *out* Y:**T**|18+|**T** = tensor(float), tensor(float16)|
|||11+|**T** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|||2+|**T** = tensor(float), tensor(float16)|
|MatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16)|
|||9+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
|MatMulInteger|*in* A:**T1**< br > *in* B:**T2**< br > *in* a_zero_point:**T1**< br > *in* b_zero_point:**T2**< br > *out* Y:**T3**|10+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(int32)|
|Max|*in* data_0:**T**< br > *out* max:**T**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||8+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|MaxPool|*in* X:**T**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T**< br > *out* Y:**T**< br > *out* Indices:**I**|12+|**I** = tensor(int64)< br /> **T** = tensor(float), tensor(float16), tensor(int8), tensor(uint8)|
|||11+|**I** = tensor(int64)< br /> **T** = tensor(float), tensor(float16), tensor(int8), tensor(uint8)|
|||10+|**I** = tensor(int64)< br /> **T** = tensor(float), tensor(float16), tensor(int8), tensor(uint8)|
|||8+|**I** = tensor(int64)< br /> **T** = tensor(float), tensor(float16), tensor(int8), tensor(uint8)|
|||1+|**T** = tensor(float), tensor(float16)|
|MaxRoiPool|*in* X:**T**< br > *in* rois:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|MaxUnpool|*in* X:**T1**< br > *in* I:**T2**< br > *in* output_shape:**T2**< br > *out* output:**T1**|11+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(int64)|
|||9+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(int64)|
|Mean|*in* data_0:**T**< br > *out* mean:**T**|13+|**T** = tensor(float), tensor(float16)|
|||8+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|MeanVarianceNormalization|*in* X:**T**< br > *out* Y:**T**< br >< br > or< br >< br > *in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16)|
|||9+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-22 02:08:28 +00:00
|MemcpyFromHost|*in* X:**T**< br > *out* Y:**T**|1+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|MemcpyToHost|*in* X:**T**< br > *out* Y:**T**|1+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|Min|*in* data_0:**T**< br > *out* min:**T**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||8+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|Mod|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint8)|
|||10+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint8)|
|Mul|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||7+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Neg|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8)|
|||6+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8)|
2023-06-08 20:49:39 +00:00
|NonZero|*in* X:**T**< br > *out* Y:**tensor(int64)**|13+|**T** = tensor(bool), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint8)|
|||9+|**T** = tensor(bool), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|Not|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(bool)|
|OneHot|*in* indices:**T1**< br > *in* depth:**T2**< br > *in* values:**T3**< br > *out* output:**T3**|11+|**T1** = tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T3** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||9+|**T1** = tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)< br /> **T2** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T3** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-05-15 16:53:35 +00:00
|OptionalGetElement|*in* input:**O**< br > *out* output:**V**|18+|**O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||15+|**O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8))< br /> **V** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|OptionalHasElement|*in* input:**O**< br > *out* output:**B**|18+|**B** = tensor(bool)< br /> **O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8)), seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(string)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||15+|**B** = tensor(bool)< br /> **O** = optional(seq(tensor(bfloat16))), optional(seq(tensor(bool))), optional(seq(tensor(double))), optional(seq(tensor(float))), optional(seq(tensor(float16))), optional(seq(tensor(int16))), optional(seq(tensor(int32))), optional(seq(tensor(int64))), optional(seq(tensor(int8))), optional(seq(tensor(string))), optional(seq(tensor(uint16))), optional(seq(tensor(uint32))), optional(seq(tensor(uint64))), optional(seq(tensor(uint8))), optional(tensor(bfloat16)), optional(tensor(bool)), optional(tensor(double)), optional(tensor(float)), optional(tensor(float16)), optional(tensor(int16)), optional(tensor(int32)), optional(tensor(int64)), optional(tensor(int8)), optional(tensor(string)), optional(tensor(uint16)), optional(tensor(uint32)), optional(tensor(uint64)), optional(tensor(uint8))|
2022-09-09 17:21:25 +00:00
|Or|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)|
2022-12-21 17:05:12 +00:00
|PRelu|*in* X:**T**< br > *in* slope:**T**< br > *out* Y:**T**|16+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8)|
|||9+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8)|
2022-09-09 17:21:25 +00:00
|||7+|**T** = tensor(float), tensor(float16)|
2023-05-24 01:25:36 +00:00
|Pad|*in* data:**T**< br > *in* pads:**tensor(int64)**< br > *in* constant_value:**T**< br > *in* axes:**Tind**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *in* pads:**tensor(int64)**< br > *in* constant_value:**T**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**|18+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||2+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|ParametricSoftplus|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|Pow|*in* X:**T**< br > *in* Y:**T**< br > *out* Z:**T**< br >< br > or< br >< br > *in* X:**T**< br > *in* Y:**T1**< br > *out* Z:**T**|15+|**T** = tensor(float), tensor(float16), tensor(int32)< br /> **T1** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint8)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int32)< br /> **T1** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint8)|
|||12+|**T** = tensor(float), tensor(float16), tensor(int32)< br /> **T1** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint8)|
|||7+|**T** = tensor(float), tensor(float16)|
|QLinearConv|*in* x:**T1**< br > *in* x_scale:**tensor(float)**< br > *in* x_zero_point:**T1**< br > *in* w:**T2**< br > *in* w_scale:**tensor(float)**< br > *in* w_zero_point:**T2**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T3**< br > *in* B:**T4**< br > *out* y:**T3**|10+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(int8), tensor(uint8)< br /> **T4** = tensor(int32)|
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md
ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0
#### Updated ops for CPU EP:
- DequantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block dequantization support
- QuantizeLinear(21)
- Added int16 and uint16 support + various optimizer tests
- Missing int4 and uint4 support
- Missing block quantization support
- Cast(21)
- Missing int4 and uint4 support
- CastLike(21)
- Missing int4 and uint4 support
- ConstantOfShape(21)
- Missing int4 and uint4 support
- Identity(21)
- Missing int4 and uint4 support
- If(21)
- Missing int4 and uint4 support
- Loop(21)
- Missing int4 and uint4 support
- Reshape(21)
- Missing int4 and uint4 support
- Scan(21)
- Missing int4 and uint4 support
- Shape(21)
- Missing int4 and uint4 support
- Size(21)
- Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Disabled tests
#### ORT Training
orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops
#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8
#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)
---------
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 16:46:49 +00:00
|QLinearMatMul|*in* a:**T1**< br > *in* a_scale:**TS**< br > *in* a_zero_point:**T1**< br > *in* b:**T2**< br > *in* b_scale:**TS**< br > *in* b_zero_point:**T2**< br > *in* y_scale:**TS**< br > *in* y_zero_point:**T3**< br > *out* y:**T3**< br >< br > or< br >< br > *in* a:**T1**< br > *in* a_scale:**tensor(float)**< br > *in* a_zero_point:**T1**< br > *in* b:**T2**< br > *in* b_scale:**tensor(float)**< br > *in* b_zero_point:**T2**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T3**< br > *out* y:**T3**|10+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(int8), tensor(uint8)|
2024-01-22 23:37:09 +00:00
|QuantizeLinear|*in* x:**T1**< br > *in* y_scale:**T1**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**< br >< br > or< br >< br > *in* x:**T1**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**|19+|**T1** = tensor(float), tensor(float16), tensor(int32)< br /> **T2** = tensor(int8), tensor(uint8)|
|||13+|**T1** = tensor(float), tensor(int32)< br /> **T2** = tensor(int8), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||10+|**T1** = tensor(float), tensor(int32)< br /> **T2** = tensor(int8), tensor(uint8)|
|RNN|*in* X:**T**< br > *in* W:**T**< br > *in* R:**T**< br > *in* B:**T**< br > *in* sequence_lens:**T1**< br > *in* initial_h:**T**< br > *out* Y:**T**< br > *out* Y_h:**T**|14+|**T** = tensor(float), tensor(float16)|
|||7+|**T** = tensor(float), tensor(float16)|
|Range|*in* start:**T**< br > *in* limit:**T**< br > *in* delta:**T**< br > *out* output:**T**|11+|**T** = tensor(float), tensor(int16), tensor(int32), tensor(int64)|
|Reciprocal|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
2023-04-28 03:32:11 +00:00
|ReduceL1|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||1+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2023-04-28 03:32:11 +00:00
|ReduceL2|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13+|**T** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
2023-04-28 03:32:11 +00:00
|ReduceLogSum|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13+|**T** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
2023-04-28 03:32:11 +00:00
|ReduceLogSumExp|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13+|**T** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
2024-04-22 19:01:59 +00:00
|ReduceMax|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|20+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2023-04-28 03:32:11 +00:00
|||13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
2023-04-28 03:32:11 +00:00
|ReduceMean|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13+|**T** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
2023-04-28 03:32:11 +00:00
|ReduceMin|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||12+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
2023-04-28 03:32:11 +00:00
|ReduceProd|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||1+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|ReduceSum|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|13+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||11+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||1+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2023-04-28 03:32:11 +00:00
|ReduceSumSquare|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* reduced:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reduced:**T**|18+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|||1+|**T** = tensor(float), tensor(float16), tensor(int32), tensor(int64), tensor(uint32), tensor(uint64)|
|Relu|*in* X:**T**< br > *out* Y:**T**|14+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8)|
|||13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
2024-01-22 23:37:09 +00:00
|Reshape|*in* data:**T**< br > *in* shape:**tensor(int64)**< br > *out* reshaped:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* reshaped:**T**|19+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||14+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||5+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-01-04 19:27:03 +00:00
|Resize|*in* X:**T**< br > *in* scales:**tensor(float)**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T1**< br > *in* roi:**T2**< br > *in* scales:**tensor(float)**< br > *in* sizes:**tensor(int64)**< br > *out* Y:**T1**|13+|**T1** = tensor(float), tensor(float16), tensor(int8), tensor(uint8)< br /> **T2** = tensor(float), tensor(float16)|
|||11+|**T1** = tensor(float), tensor(float16), tensor(int8), tensor(uint8)< br /> **T2** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|||10+|**T** = tensor(float), tensor(float16)|
|ReverseSequence|*in* input:**T**< br > *in* sequence_lens:**tensor(int64)**< br > *out* Y:**T**|10+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-05-10 04:56:41 +00:00
|RoiAlign|*in* X:**T1**< br > *in* rois:**T1**< br > *in* batch_indices:**T2**< br > *out* Y:**T1**|16+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(int32), tensor(int64)|
|||10+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(int32), tensor(int64)|
2022-09-09 17:21:25 +00:00
|Round|*in* X:**T**< br > *out* Y:**T**|11+|**T** = tensor(float), tensor(float16)|
2023-02-24 05:12:22 +00:00
|STFT|*in* signal:**T1**< br > *in* frame_step:**T2**< br > *in* window:**T1**< br > *in* frame_length:**T2**< br > *out* output:**T1**|17+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(int32), tensor(int64)|
2022-09-09 17:21:25 +00:00
|ScaledTanh|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
|Scatter|*in* data:**T**< br > *in* indices:**Tind**< br > *in* updates:**T**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||9+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2023-02-06 18:01:02 +00:00
|ScatterElements|*in* data:**T**< br > *in* indices:**Tind**< br > *in* updates:**T**< br > *out* output:**T**|16+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2023-02-01 17:46:37 +00:00
|||13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
2023-01-12 18:39:25 +00:00
|ScatterND|*in* data:**T**< br > *in* indices:**tensor(int64)**< br > *in* updates:**T**< br > *out* output:**T**|16+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Selu|*in* X:**T**< br > *out* Y:**T**|6+|**T** = tensor(float), tensor(float16)|
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-22 02:08:28 +00:00
|SequenceAt|*in* input_sequence:**S**< br > *in* position:**I**< br > *out* tensor:**T**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))< br /> **T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|SequenceConstruct|*in* inputs:**T**< br > *out* output_sequence:**S**|11+|**S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))< br /> **T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|SequenceEmpty|*out* output:**S**|11+|**S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceErase|*in* input_sequence:**S**< br > *in* position:**I**< br > *out* output_sequence:**S**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceInsert|*in* input_sequence:**S**< br > *in* tensor:**T**< br > *in* position:**I**< br > *out* output_sequence:**S**|11+|**I** = tensor(int32), tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
|SequenceLength|*in* input_sequence:**S**< br > *out* length:**I**|11+|**I** = tensor(int64)< br /> **S** = seq(tensor(bfloat16)), seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8))|
2024-01-22 23:37:09 +00:00
|Shape|*in* data:**T**< br > *out* shape:**T1**|19+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
|||15+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-22 02:08:28 +00:00
|||13+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
|||1+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2022-09-09 17:21:25 +00:00
|Shrink|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint8)|
|Sigmoid|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
2022-10-28 03:11:49 +00:00
|Sign|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||9+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2024-04-19 05:17:31 +00:00
|SimplifiedLayerNormalization|*in* X:**T**< br > *in* scale:**V**< br > *out* Y:**V**< br > *out* inv_std_var:**U**|1+|**T** = tensor(float), tensor(float16)< br /> **U** = tensor(float), tensor(float16)< br /> **V** = tensor(float), tensor(float16)|
2022-09-09 17:21:25 +00:00
|Sin|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float), tensor(float16)|
|Sinh|*in* input:**T**< br > *out* output:**T**|9+|**T** = tensor(float), tensor(float16)|
2024-01-22 23:37:09 +00:00
|Size|*in* data:**T**< br > *out* size:**T1**|19+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
|||13+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442)
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
2023-02-22 02:08:28 +00:00
|||1+|**T** = seq(tensor(bool)), seq(tensor(double)), seq(tensor(float)), seq(tensor(float16)), seq(tensor(int16)), seq(tensor(int32)), seq(tensor(int64)), seq(tensor(int8)), seq(tensor(uint16)), seq(tensor(uint32)), seq(tensor(uint64)), seq(tensor(uint8)), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **T1** = tensor(int64)|
2022-09-09 17:21:25 +00:00
|Slice|*in* data:**T**< br > *in* starts:**Tind**< br > *in* ends:**Tind**< br > *in* axes:**Tind**< br > *in* steps:**Tind**< br > *out* output:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||10+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)< br /> **Tind** = tensor(int32), tensor(int64)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Softmax|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16)|
|||11+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
|Softplus|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|Softsign|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
|SpaceToDepth|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2023-05-16 18:58:19 +00:00
|Split|*in* input:**T**< br > *in* split:**T**< br > *out* outputs...:**T**< br >< br > or< br >< br > *in* input:**T**< br > *in* split:**tensor(int64)**< br > *out* outputs:**T**< br >< br > or< br >< br > *in* input:**T**< br > *out* outputs:**T**|18+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||2+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Sqrt|*in* X:**T**< br > *out* Y:**T**|13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|Squeeze|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* squeezed:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* squeezed:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Sub|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|14+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||13+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||7+|**T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Sum|*in* data_0:**T**< br > *out* sum:**T**|13+|**T** = tensor(float), tensor(float16)|
|||8+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|Tan|*in* input:**T**< br > *out* output:**T**|7+|**T** = tensor(float), tensor(float16)|
|Tanh|*in* input:**T**< br > *out* output:**T**|13+|**T** = tensor(float), tensor(float16)|
|||6+|**T** = tensor(float), tensor(float16)|
|ThresholdedRelu|*in* X:**T**< br > *out* Y:**T**|10+|**T** = tensor(float), tensor(float16)|
|||1+|**T** = tensor(float), tensor(float16)|
|Tile|*in* input:**T**< br > *in* repeats:**T1**< br > *out* output:**T**< br >< br > or< br >< br > *in* input:**T**< br > *in* tiles:**T**< br > *in* axis:**T**< br > *out* output:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||6+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|TopK|*in* X:**T**< br > *in* K:**tensor(int64)**< br > *out* Values:**T**< br > *out* Indices:**I**< br >< br > or< br >< br > *in* X:**T**< br > *out* Values:**T**< br > *out* Indices:**I**|11+|**I** = tensor(int64)< br /> **T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||10+|**I** = tensor(int64)< br /> **T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**I** = tensor(int64)< br /> **T** = tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Transpose|*in* data:**T**< br > *out* transposed:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Trilu|*in* input:**T**< br > *in* k:**tensor(int64)**< br > *out* output:**T**|14+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Unsqueeze|*in* data:**T**< br > *in* axes:**tensor(int64)**< br > *out* expanded:**T**< br >< br > or< br >< br > *in* data:**T**< br > *out* expanded:**T**|13+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||11+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||1+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Upsample|*in* X:**T**< br > *in* scales:**tensor(float)**< br > *out* Y:**T**< br >< br > or< br >< br > *in* X:**T**< br > *out* Y:**T**|10+|**T** = tensor(float), tensor(float16)|
|||9+|**T** = tensor(float), tensor(float16)|
|||7+|**T** = tensor(float), tensor(float16)|
2022-12-21 17:05:12 +00:00
|Where|*in* condition:**B**< br > *in* X:**T**< br > *in* Y:**T**< br > *out* output:**T**|16+|**B** = tensor(bool)< br /> **T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|||9+|**B** = tensor(bool)< br /> **T** = tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
2022-09-09 17:21:25 +00:00
|Xor|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T1**|7+|**T** = tensor(bool)|
| |
| |
2021-05-08 03:17:29 +00:00
|**Operator Domain:** *com.microsoft* ||||
2023-02-07 19:51:06 +00:00
|Attention|*in* input:**T**< br > *in* weights:**T**< br > *in* bias:**T**< br > *in* mask_index:**M**< br > *in* past:**T**< br > *in* relative_position_bias:**T**< br > *in* past_sequence_length:**M**< br > *out* output:**T**< br > *out* present:**T**|1+|**M** = tensor(int32)< br /> **T** = tensor(float), tensor(float16)|
2023-04-10 21:46:33 +00:00
|BiasAdd|*in* X:**T**< br > *in* bias:**T**< br > *in* skip:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-12-01 17:23:19 +00:00
|BiasGelu|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-04-11 15:30:37 +00:00
|BiasSplitGelu|*in* X:**T**< br > *in* bias:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|ConvTransposeWithDynamicPads|*in* X:**T**< br > *in* W:**T**< br > *in* Pads:**tensor(int64)**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-06-16 01:21:56 +00:00
|DequantizeLinear|*in* x:**T1**< br > *in* x_scale:**T2**< br > *in* x_zero_point:**T1**< br > *out* y:**T2**|1+|**T1** = tensor(int32), tensor(int8), tensor(uint8)< br /> **T2** = tensor(float), tensor(float16)|
2024-03-08 23:35:10 +00:00
|DynamicQuantizeMatMul|*in* A:**T1**< br > *in* B:**T2**< br > *in* b_scale:**T1**< br > *in* b_zero_point:**T2**< br > *in* bias:**T1**< br > *out* Y:**T1**|1+|**T1** = tensor(float)< br /> **T2** = tensor(int8), tensor(uint8)|
2022-12-13 21:23:53 +00:00
|EmbedLayerNormalization|*in* input_ids:**T1**< br > *in* segment_ids:**T1**< br > *in* word_embedding:**T**< br > *in* position_embedding:**T**< br > *in* segment_embedding:**T**< br > *in* gamma:**T**< br > *in* beta:**T**< br > *in* mask:**T1**< br > *in* position_ids:**T1**< br > *out* output:**T**< br > *out* mask_index:**T1**< br > *out* embedding_sum:**T**|1+|**T** = tensor(float), tensor(float16)|
2024-04-11 21:40:28 +00:00
|FastGelu|*in* X:**T**< br > *in* bias:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|FusedMatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-05-19 02:37:12 +00:00
|FusedMatMulActivation|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
|Gelu|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-03-27 19:52:53 +00:00
|GroupNorm|*in* X:**T**< br > *in* gamma:**M**< br > *in* beta:**M**< br > *out* Y:**T**|1+|**M** = tensor(float), tensor(float16)< br /> **T** = tensor(float), tensor(float16)|
2024-04-19 17:25:29 +00:00
|GroupQueryAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* past_key:**T**< br > *in* past_value:**T**< br > *in* seqlens_k:**M**< br > *in* total_sequence_length:**M**< br > *in* cos_cache:**T**< br > *in* sin_cache:**T**< br > *out* output:**T**< br > *out* present_key:**T**< br > *out* present_value:**T**|1+|**M** = tensor(int32)< br /> **T** = tensor(float), tensor(float16)|
2024-03-04 19:55:35 +00:00
|MatMulIntegerToFloat|*in* A:**T1**< br > *in* B:**T2**< br > *in* a_scale:**T3**< br > *in* b_scale:**T3**< br > *in* a_zero_point:**T1**< br > *in* b_zero_point:**T2**< br > *in* bias:**T3**< br > *out* Y:**T3**|1+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(float), tensor(float16)|
2024-04-19 22:05:37 +00:00
|MatMulNBits|*in* A:**T1**< br > *in* B:**T2**< br > *in* scales:**T1**< br > *in* zero_points:**T3**< br > *in* g_idx:**T4**< br > *out* Y:**T1**|1+|**T1** = tensor(float), tensor(float16)< br /> **T2** = tensor(uint8)|
2023-05-19 22:07:14 +00:00
|MultiHeadAttention|*in* query:**T**< br > *in* key:**T**< br > *in* value:**T**< br > *in* bias:**T**< br > *in* key_padding_mask:**M**< br > *in* relative_position_bias:**T**< br > *in* past_key:**T**< br > *in* past_value:**T**< br > *out* output:**T**< br > *out* present_key:**T**< br > *out* present_value:**T**|1+|**M** = tensor(int32)< br /> **T** = tensor(float), tensor(float16)|
2023-04-11 06:16:09 +00:00
|NhwcConv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2024-03-11 17:44:34 +00:00
|QAttention|*in* input:**T1**< br > *in* weight:**T2**< br > *in* bias:**T3**< br > *in* input_scale:**T3**< br > *in* weight_scale:**T3**< br > *in* mask_index:**T4**< br > *in* input_zero_point:**T1**< br > *in* weight_zero_point:**T2**< br > *in* past:**T3**< br > *out* output:**T3**< br > *out* present:**T3**|1+|**T1** = tensor(int8), tensor(uint8)< br /> **T2** = tensor(int8), tensor(uint8)< br /> **T3** = tensor(float), tensor(float16)< br /> **T4** = tensor(int32)|
2022-10-27 21:20:48 +00:00
|QLinearAdd|*in* A:**T**< br > *in* A_scale:**tensor(float)**< br > *in* A_zero_point:**T**< br > *in* B:**T**< br > *in* B_scale:**tensor(float)**< br > *in* B_zero_point:**T**< br > *in* C_scale:**tensor(float)**< br > *in* C_zero_point:**T**< br > *out* C:**T**|1+|**T** = tensor(int8), tensor(uint8)|
2024-01-04 19:27:03 +00:00
|QLinearAveragePool|*in* X:**T**< br > *in* x_scale:**tensor(float)**< br > *in* x_zero_point:**T**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T**< br > *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
|QLinearConcat|*in* Y_scale:**TF**< br > *in* Y_zero_point:**T8**< br > *in* inputs:**TV**< br > *out* Y:**T8**|1+|**T8** = tensor(int8), tensor(uint8)< br /> **TF** = tensor(float)< br /> **TV** = tensor(float), tensor(int8), tensor(uint8)|
|QLinearGlobalAveragePool|*in* X:**T**< br > *in* x_scale:**tensor(float)**< br > *in* x_zero_point:**T**< br > *in* y_scale:**tensor(float)**< br > *in* y_zero_point:**T**< br > *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
2022-10-27 21:20:48 +00:00
|QLinearSigmoid|*in* X:**T**< br > *in* X_scale:**tensor(float)**< br > *in* X_zero_point:**T**< br > *in* Y_scale:**tensor(float)**< br > *in* Y_zero_point:**T**< br > *out* Y:**T**|1+|**T** = tensor(int8), tensor(uint8)|
2023-06-16 01:21:56 +00:00
|QuantizeLinear|*in* x:**T1**< br > *in* y_scale:**T1**< br > *in* y_zero_point:**T2**< br > *out* y:**T2**|1+|**T1** = tensor(float), tensor(float16), tensor(int32)< br /> **T2** = tensor(int8), tensor(uint8)|
2023-04-05 17:49:34 +00:00
|QuickGelu|*in* X:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
2023-11-07 16:26:11 +00:00
|RotaryEmbedding|*in* input:**T**< br > *in* position_ids:**M**< br > *in* cos_cache:**T**< br > *in* sin_cache:**T**< br > *out* output:**T**|1+|**M** = tensor(int64)< br /> **T** = tensor(float), tensor(float16)|
2023-01-06 15:27:10 +00:00
|SkipLayerNormalization|*in* input:**T**< br > *in* skip:**T**< br > *in* gamma:**T**< br > *in* beta:**T**< br > *in* bias:**T**< br > *out* output:**T**< br > *out* mean:**U**< br > *out* inv_std_var:**U**< br > *out* input_skip_bias_sum:**T**|1+|**T** = tensor(float), tensor(float16)|
2024-04-19 05:17:31 +00:00
|SkipSimplifiedLayerNormalization|*in* input:**T**< br > *in* skip:**T**< br > *in* gamma:**T**< br > *in* bias:**T**< br > *out* output:**T**< br > *out* mean:**U**< br > *out* inv_std_var:**U**< br > *out* input_skip_bias_sum:**T**|1+|**T** = tensor(float), tensor(float16)|
2022-10-27 21:20:48 +00:00
| |
| |
|**Operator Domain:** *com.microsoft.dml* ||||
|DmlFusedAdd|*in* A:**T**< br > *in* B:**T**< br > *out* C:**T**|1+|**T** = tensor(float), tensor(float16)|
|DmlFusedBatchNormalization|*in* X:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *in* mean:**T**< br > *in* var:**T**< br > *out* Y:**T**< br > *out* mean:**T**< br > *out* var:**T**< br > *out* saved_mean:**T**< br > *out* saved_var:**T**|1+|**T** = tensor(float), tensor(float16)|
|DmlFusedConv|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|DmlFusedConvTranspose|*in* X:**T**< br > *in* W:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|DmlFusedGemm|*in* A:**T**< br > *in* B:**T**< br > *in* C:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|DmlFusedInstanceNormalization|*in* input:**T**< br > *in* scale:**T**< br > *in* B:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
|DmlFusedMatMul|*in* A:**T**< br > *in* B:**T**< br > *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|
|DmlFusedMeanVarianceNormalization|*in* input:**T**< br > *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|
|DmlFusedSum|*in* data_0:**T**< br > *out* sum:**T**|1+|**T** = tensor(float), tensor(float16)|
2021-05-14 05:05:30 +00:00
| |
| |