diff --git a/docs/ContribOperators.md b/docs/ContribOperators.md index 00cc18e058..9fba550e0a 100644 --- a/docs/ContribOperators.md +++ b/docs/ContribOperators.md @@ -20,6 +20,7 @@ Do not modify directly.* * com.microsoft.DecoderAttention * com.microsoft.DequantizeBFP * com.microsoft.DequantizeLinear + * com.microsoft.DequantizeWithOrder * com.microsoft.DynamicQuantizeLSTM * com.microsoft.DynamicQuantizeMatMul * com.microsoft.EmbedLayerNormalization @@ -57,11 +58,14 @@ Do not modify directly.* * com.microsoft.QLinearReduceMean * com.microsoft.QLinearSigmoid * com.microsoft.QLinearSoftmax + * com.microsoft.QOrderedAttention * com.microsoft.QOrderedGelu * com.microsoft.QOrderedLayerNormalization + * com.microsoft.QOrderedLongformerAttention * com.microsoft.QOrderedMatMul * com.microsoft.QuantizeBFP * com.microsoft.QuantizeLinear + * com.microsoft.QuantizeWithOrder * com.microsoft.Range * com.microsoft.ReduceSumInteger * com.microsoft.Rfft @@ -989,7 +993,9 @@ This version of the operator has been available since version 1 of the 'com.micr ### **com.microsoft.DequantizeBFP** - The BFP dequantization operator. It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor. + The BFP dequantization operator. + It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor. + More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/ #### Version @@ -1000,8 +1006,8 @@ This version of the operator has been available since version 1 of the 'com.micr
bfp_type : int (required)
The type of BFP - must match with the BFPType enum
-
block_dims : list of ints
-
Numbers within a bounding box will span across these dimensions.Any dimension not in this list is the same for all numbers within a bounding box.As an example, consider a 2D tensor with shape [d0, d1] and block_dims equal to [1].Within a bounding box, all elements will be within the same row but will be from different columnns.The default is the last dimension.
+
block_dim : int
+
Each bounding box spans this dimension.Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that consumes the output of this operator.For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and QuantizeBFP(W) would use block_dim 0.The default is the last dimension.
dtype : int
The datatype to dequantize to.
@@ -1081,6 +1087,53 @@ This version of the operator has been available since version 1 of the 'com.micr +### **com.microsoft.DequantizeWithOrder** + + Dequantize input matrix to specific layout used in cublaslt. attr to specify output type, float16 or float32 + +#### Version + +This version of the operator has been available since version 1 of the 'com.microsoft' operator set. + +#### Attributes + +
+
order_input : int (required)
+
cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.
+
order_output : int (required)
+
cublasLt order of output matrix
+
to : int (required)
+
The output data type, only support TensorProto_DataType_FLOAT (1) and TensorProto_DataType_FLOAT16 (10)
+
+ +#### Inputs + +
+
input : Q
+
TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as (B, ROWS, COS)
+
scale_input : S
+
scale of the input
+
+ +#### Outputs + +
+
output : F
+
output tensor
+
+ +#### Type Constraints + +
+
Q : tensor(int8)
+
Constrain input and output types to int8 tensors.
+
F : tensor(float16), tensor(float)
+
Constrain to float types
+
S : tensor(float)
+
Constrain Scale to float32 types
+
+ + ### **com.microsoft.DynamicQuantizeLSTM** #### Version @@ -2916,6 +2969,105 @@ This version of the operator has been available since version 1 of the 'com.micr +### **com.microsoft.QOrderedAttention** + + Quantized version of simplified Multi-Head Self Attention(using int8 with specific matrix Layout). + Multi-Head Self Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT). + The mask_index input is optional. Besides raw attention mask with shape (batch_size, past_sequence_length + sequence_length) + or (batch_size, sequence_length, past_sequence_length + sequence_length) with value 0 for masked and 1 otherwise, + we also support other two formats: When input has right-side padding, mask_index is one dimension with shape (batch_size), + where value of each element is the end position, or valid length of actual sequence excluding padding. When input has + left-side padding, mask_index has shape (2 * batch_size), where the values are the exclusive end positions followed by + the inclusive start positions. When unidirectional is 1, and each token only attend to previous tokens. For GPT-2, both past + and present state are optional. Present state could appear in output even when past state is not in input. + Current version does not support past/present, extra_add and qkv_hidden_sizes. + TODO: Support them if needed in the future. + +#### Version + +This version of the operator has been available since version 1 of the 'com.microsoft' operator set. + +#### Attributes + +
+
num_heads : int (required)
+
Number of attention heads
+
order_input : int (required)
+
cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.
+
order_output : int (required)
+
cublasLt order of global bias
+
order_weight : int (required)
+
cublasLt order of weight matrix
+
qkv_hidden_sizes : list of ints
+
Hidden layer sizes of Q, K, V paths in Attention
+
unidirectional : int
+
Whether every token can only attend to previous tokens. Default value is 0.
+
+ +#### Inputs (17 - 20) + +
+
input : Q
+
3D input tensor with shape (batch_size, sequence_length, input_hidden_size)
+
scale_input : S
+
scale of the input, scalar value (per tensor) currently.
+
scale_Q_gemm : S
+
scale of the gemm - scalar (per-tensor quantization)
+
scale_K_gemm : S
+
scale of the gemm - scalar (per-tensor quantization)
+
scale_V_gemm : S
+
scale of the gemm - scalar (per-tensor quantization)
+
Q_weight : Q
+
2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size
+
K_weight : Q
+
2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size
+
V_weight : Q
+
2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size
+
scale_Q_weight : S
+
scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)
+
scale_K_weight : S
+
scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)
+
scale_V_weight : S
+
scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)
+
Q_bias : S
+
1D input tensor with shape (hidden_size)
+
K_bias : S
+
1D input tensor with shape (hidden_size)
+
V_bias : S
+
1D input tensor with shape (hidden_size)
+
scale_QKT_gemm (optional) : S
+
scale of the gemm - scalar (per-tensor quantization)
+
scale_QKT_softmax (optional) : S
+
scale of the softmax result - scalar (per-tensor quantization)
+
scale_values_gemm : S
+
scale of the gemm - scalar (per-tensor quantization). Also this is the output scale for the operator.
+
mask_index (optional) : G
+
Attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, past_sequence_length + sequence_length)or (batch_size, sequence_length, past_sequence_length + sequence_length), or index with shape (batch_size) or (2 * batch_size).
+
past (optional) : Q
+
past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).
+
extra_add (optional) : S
+
additional add to QxK' with shape (batch_size, num_heads, sequence_length, sequence_length).
+
+ +#### Outputs + +
+
output : Q
+
3D output tensor with shape (batch_size, sequence_length, hidden_size)
+
+ +#### Type Constraints + +
+
Q : tensor(int8)
+
Constrain input and output types to int8 tensors.
+
S : tensor(float)
+
Constrain scales to float32 tensors.
+
G : tensor(int32)
+
Constrain to integer types
+
+ + ### **com.microsoft.QOrderedGelu** Ordered Quantize Gelu. @@ -2928,9 +3080,9 @@ This version of the operator has been available since version 1 of the 'com.micr
order_X : int
-
cublasLt order of input X. Default is ROW MAJOR.
+
cublasLt order of input X. Optional. See the schema of QuantizeWithOrder for order definition.
order_Y : int
-
cublasLt order of matrix Y, must be same as order_X. Default is ROW MAJOR.
+
cublasLt order of matrix Y, must be same as order_X if specified together. Optional.
#### Inputs @@ -2977,7 +3129,7 @@ This version of the operator has been available since version 1 of the 'com.micr
epsilon : float
The epsilon value to use to avoid division by zero.
order_X : int
-
cublasLt order of input X. Default is ROW MAJOR.
+
cublasLt order of input X. Default is ROW MAJOR. See the schema of QuantizeWithOrder for order definition.
order_Y : int
cublasLt order of matrix Y, must be same as order_X. Default is ROW MAJOR.
@@ -3016,9 +3168,9 @@ This version of the operator has been available since version 1 of the 'com.micr -### **com.microsoft.QOrderedMatMul** +### **com.microsoft.QOrderedLongformerAttention** - TODO + Quantized version of Longformer Self Attention (using int8 with specific matrix Layout). #### Version @@ -3027,12 +3179,100 @@ This version of the operator has been available since version 1 of the 'com.micr #### Attributes
-
order_A : int
-
cublasLt order of matrix A. Default is ROW MAJOR.
-
order_B : int
-
cublasLt order of matrix B. Default is ROW MAJOR.
-
order_Y : int
-
cublasLt order of matrix Y and optional matrix C. Default is ROW MAJOR.
+
num_heads : int (required)
+
Number of attention heads
+
order_global_weight : int (required)
+
cublasLt order of weight matrix
+
order_input : int (required)
+
cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.
+
order_output : int (required)
+
cublasLt order of global bias
+
order_weight : int (required)
+
cublasLt order of weight matrix
+
window : int (required)
+
One sided attention windows length W, or half of total window length
+
+ +#### Inputs + +
+
input : Q
+
3D input tensor with shape (batch_size, sequence_length, hidden_size), hidden_size = num_heads * head_size
+
scale_input : S
+
scale of the input
+
weight : Q
+
2D input tensor with shape (hidden_size, 3 * hidden_size)
+
scale_weight : S
+
scale of the weight
+
bias : S
+
1D input tensor with shape (3 * hidden_size), fp32 only currently.
+
scale_bias : S
+
reserved. (not used as add bias need float value in cublasLt for normal order.)
+
scale_qkv_gemm : S
+
scale of the output for fused kqv gemm
+
mask : F
+
Attention mask with shape (batch_size, sequence_length)
+
global_weight : Q
+
2D input tensor with shape (hidden_size, 3 * hidden_size)
+
scale_global_weight : S
+
scale of the global_weight
+
global_bias : S
+
1D input tensor with shape (3 * hidden_size)
+
scale_global_gemm : S
+
scale of the global_qkv_gemm
+
global : G
+
Global attention flags with shape (batch_size, sequence_length)
+
scale_output : S
+
scale of the output
+
+ +#### Outputs + +
+
output : Q
+
3D output tensor with shape (batch_size, sequence_length, hidden_size)
+
+ +#### Type Constraints + +
+
Q : tensor(int8)
+
Constrain input and output types to int8 tensors.
+
S : tensor(float)
+
Constrain scales to float32 tensors.
+
G : tensor(int32)
+
Constrain to integer types
+
F : tensor(float16)
+
Be compatible with float version.
+
+ + +### **com.microsoft.QOrderedMatMul** + + Quantize (Int8) MatMul with order. Implement Y = alpha * A * B + bias + beta * C. Matrix A, B, C, Y are all int8 matrix. + Two type of order combination supported: + *) When order_B is ORDER_COL, order_A must be ORDER_ROW. + bias is vector of {#cols of Y} of float32, C should be batch 1/batch_A. B could be of batch 1 or batch_A. + Note B is reorder to ORDER_COL, or Transposed. Not Transposed first and then Reordered here. + *) When order_B is specify ORDER_COL4_4R2_8C or ORDER_COL32_2R_4R4, orderA must be ORDER_COL32. + MatMul will be implemented using alpha(A * B) + beta * C => Y. + bias is not supported here. B in fact is transposed first then reordered into ORDER_COL4_4R2_8C or ORDER_COL32_2R_4R4 here. + order_Y and order_C will be same as order_A. + Support per column quantized weight, ie, scale_B is 1-D vector of size [#cols of matrix B]. + +#### Version + +This version of the operator has been available since version 1 of the 'com.microsoft' operator set. + +#### Attributes + +
+
order_A : int (required)
+
cublasLt order of matrix A. See the schema of QuantizeWithOrder for order definition.
+
order_B : int (required)
+
cublasLt order of matrix B
+
order_Y : int (required)
+
cublasLt order of matrix Y and optional matrix C
#### Inputs (5 - 8) @@ -3041,19 +3281,19 @@ This version of the operator has been available since version 1 of the 'com.micr
A : Q
3-dimensional matrix A
scale_A : S
-
scale of the input A
+
scale of the input A.
B : Q
-
2-dimensional matrix B
+
2-dimensional matrix B. Transposed if order_B is ORDER_COL.
scale_B : S
-
scale of the input B
+
scale of the input B. Scalar or 1-D float32.
scale_Y : S
-
scale of the output Y
+
scale of the output Y.
bias (optional) : S
-
1d bias
+
1d bias, not scaled with scale_Y.
C (optional) : Q
3d or 2d matrix C. if 2d expand to 3d first. Shape[0] should be 1 or same as A.shape[0]
scale_C (optional) : S
-
scale of the input A
+
scale of the input A.
#### Outputs @@ -3076,6 +3316,7 @@ This version of the operator has been available since version 1 of the 'com.micr ### **com.microsoft.QuantizeBFP** The BFP quantization operator. It consumes a full precision tensor and computes an BFP tensor. + More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/ #### Version @@ -3086,8 +3327,8 @@ This version of the operator has been available since version 1 of the 'com.micr
bfp_type : int (required)
The type of BFP - must match with the BFPType enum
-
block_dims : list of ints
-
Numbers within a bounding box will span across these dimensions.Any dimension not in this list is the same for all numbers within a bounding box.As an example, consider a 2D tensor with shape [d0, d1] and block_dims equal to [1].Within a bounding box, all elements will be within the same row but will be from different columnns.The default is the last dimension.
+
block_dim : int
+
Each bounding box spans this dimension.Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that consumes the output of this operator.For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and QuantizeBFP(W) would use block_dim 0.The default is the last dimension.
#### Inputs @@ -3166,6 +3407,51 @@ This version of the operator has been available since version 1 of the 'com.micr +### **com.microsoft.QuantizeWithOrder** + + Quantize input matrix to specific layout used in cublaslt. + +#### Version + +This version of the operator has been available since version 1 of the 'com.microsoft' operator set. + +#### Attributes + +
+
order_input : int (required)
+
cublasLt order of input matrix. ORDER_COL = 0, ORDER_ROW = 1, ORDER_COL32 = 2, ORDER_COL4_4R2_8C = 3, ORDER_COL32_2R_4R4 = 4. Please refer https://docs.nvidia.com/cuda/cublas/index.html#cublasLtOrder_t for their meaning.
+
order_output : int (required)
+
cublasLt order of output matrix.
+
+ +#### Inputs + +
+
input : F
+
TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as (B, ROWS, COS)
+
scale_input : S
+
scale of the input
+
+ +#### Outputs + +
+
output : Q
+
output tensor
+
+ +#### Type Constraints + +
+
Q : tensor(int8)
+
Constrain input and output types to int8 tensors.
+
F : tensor(float16), tensor(float)
+
Constrain to float types
+
S : tensor(float)
+
Constrain Scale to float32 types
+
+ + ### **com.microsoft.Range** Creates a sequence of numbers that begins at `start` and extends by increments of `delta` diff --git a/onnxruntime/core/graph/contrib_ops/quantization_defs.cc b/onnxruntime/core/graph/contrib_ops/quantization_defs.cc index 71e6a0ee65..62c0d6da3c 100644 --- a/onnxruntime/core/graph/contrib_ops/quantization_defs.cc +++ b/onnxruntime/core/graph/contrib_ops/quantization_defs.cc @@ -211,19 +211,21 @@ ONNX_MS_OPERATOR_SET_SCHEMA(DequantizeLinear, 1, })); static const char* QuantizeBFP_ver1_doc = R"DOC( -The BFP quantization operator. It consumes a full precision tensor and computes an BFP tensor.)DOC"; +The BFP quantization operator. It consumes a full precision tensor and computes an BFP tensor. +More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/)DOC"; ONNX_MS_OPERATOR_SET_SCHEMA( QuantizeBFP, 1, OpSchema() .Attr("bfp_type", "The type of BFP - must match with the BFPType enum", AttributeProto::INT) - .Attr("block_dims", - "Numbers within a bounding box will span across these dimensions." - "Any dimension not in this list is the same for all numbers within a bounding box." - "As an example, consider a 2D tensor with shape [d0, d1] and block_dims equal to [1]." - "Within a bounding box, all elements will be within the same row but will be from different columnns." + .Attr("block_dim", + "Each bounding box spans this dimension." + "Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that " + "consumes the output of this operator." + "For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and " + "QuantizeBFP(W) would use block_dim 0." "The default is the last dimension.", - AttributeProto::INTS, std::vector{-1}) + AttributeProto::INT, static_cast(-1)) .Input(0, "x", "N-D full precision input tensor to be quantized.", "T1") .Output(0, "y", "1-D, contiguous BFP data", "T2") .Output(1, "shape", "Shape of x", "T3") @@ -254,19 +256,22 @@ ONNX_MS_OPERATOR_SET_SCHEMA( })); static const char* DequantizeBFP_ver1_doc = R"DOC( -The BFP dequantization operator. It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor.)DOC"; +The BFP dequantization operator. +It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor. +More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/)DOC"; ONNX_MS_OPERATOR_SET_SCHEMA( DequantizeBFP, 1, OpSchema() .Attr("bfp_type", "The type of BFP - must match with the BFPType enum", AttributeProto::INT) - .Attr("block_dims", - "Numbers within a bounding box will span across these dimensions." - "Any dimension not in this list is the same for all numbers within a bounding box." - "As an example, consider a 2D tensor with shape [d0, d1] and block_dims equal to [1]." - "Within a bounding box, all elements will be within the same row but will be from different columnns." + .Attr("block_dim", + "Each bounding box spans this dimension." + "Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that " + "consumes the output of this operator." + "For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and " + "QuantizeBFP(W) would use block_dim 0." "The default is the last dimension.", - AttributeProto::INTS, std::vector{-1}) + AttributeProto::INT, static_cast(-1)) .Attr("dtype", "The datatype to dequantize to.", AttributeProto::INT, static_cast(ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_FLOAT)) // default .Input(0, "x", "1-D, contiguous, raw, BFP data to be de-quantized.", "T1") @@ -975,51 +980,61 @@ ONNX_MS_OPERATOR_SET_SCHEMA( .TypeConstraint("T", {"tensor(float)"}, "Constrain input and output types to float32 tensors.") .TypeAndShapeInferenceFunction(EmbedLayerNormalizationShapeInference)); - ONNX_MS_OPERATOR_SET_SCHEMA( - QuantizeWithOrder, - 1, - OpSchema() - .SetDoc(R"DOC(Quantize input matrix to specific layout used in cublaslt.)DOC") - .Attr("order_input", - "cublasLt order of input matrix. ORDER_COL = 0, ORDER_ROW = 1, ORDER_COL32 = 2, ORDER_COL4_4R2_8C = 3, ORDER_COL32_2R_4R4 = 4. " - "Please refer https://docs.nvidia.com/cuda/cublas/index.html#cublasLtOrder_t for their meaning.", - AttributeProto::INT) - .Attr("order_output", "cublasLt order of output matrix.", AttributeProto::INT) - .Input(0, "input", "TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as (B, ROWS, COS)", "F") - .Input(1, "scale_input", "scale of the input", "S") - .Output(0, "output", "output tensor", "Q") - .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") - .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, "Constrain to float types") - .TypeConstraint("S", {"tensor(float)"}, "Constrain Scale to float32 types") - .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) { - propagateElemTypeFromDtypeToOutput(ctx, ONNX_NAMESPACE::TensorProto::INT8, 0); - if (!hasInputShape(ctx, 0)) return; - auto& input_shape = getInputShape(ctx, 0); - updateOutputShape(ctx, 0, input_shape); - })); +ONNX_MS_OPERATOR_SET_SCHEMA( + QuantizeWithOrder, 1, + OpSchema() + .SetDoc(R"DOC(Quantize input matrix to specific layout used in cublaslt.)DOC") + .Attr("order_input", + "cublasLt order of input matrix. ORDER_COL = 0, ORDER_ROW = 1, ORDER_COL32 = 2, ORDER_COL4_4R2_8C = 3, " + "ORDER_COL32_2R_4R4 = 4. " + "Please refer https://docs.nvidia.com/cuda/cublas/index.html#cublasLtOrder_t for their meaning.", + AttributeProto::INT) + .Attr("order_output", "cublasLt order of output matrix.", AttributeProto::INT) + .Input(0, "input", + "TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as " + "(B, ROWS, COS)", + "F") + .Input(1, "scale_input", "scale of the input", "S") + .Output(0, "output", "output tensor", "Q") + .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") + .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, "Constrain to float types") + .TypeConstraint("S", {"tensor(float)"}, "Constrain Scale to float32 types") + .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) { + propagateElemTypeFromDtypeToOutput(ctx, ONNX_NAMESPACE::TensorProto::INT8, 0); + if (!hasInputShape(ctx, 0)) return; + auto& input_shape = getInputShape(ctx, 0); + updateOutputShape(ctx, 0, input_shape); + })); - ONNX_MS_OPERATOR_SET_SCHEMA( - DequantizeWithOrder, - 1, - OpSchema() - .SetDoc(R"DOC(Dequantize input matrix to specific layout used in cublaslt. attr to specify output type, float16 or float32)DOC") - .Attr("order_input", "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", AttributeProto::INT) - .Attr("order_output", "cublasLt order of output matrix", AttributeProto::INT) - .Attr("to", "The output data type, only support TensorProto_DataType_FLOAT (1) and TensorProto_DataType_FLOAT16 (10)", AttributeProto::INT) - .Input(0, "input", "TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as (B, ROWS, COS)", "Q") - .Input(1, "scale_input", "scale of the input", "S") - .Output(0, "output", "output tensor", "F") - .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") - .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, "Constrain to float types") - .TypeConstraint("S", {"tensor(float)"}, "Constrain Scale to float32 types") - .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) { - propagateElemTypeFromAttributeToOutput(ctx, "to", 0); - if (!hasInputShape(ctx, 0)) return; - auto& input_shape = getInputShape(ctx, 0); - updateOutputShape(ctx, 0, input_shape); - })); +ONNX_MS_OPERATOR_SET_SCHEMA( + DequantizeWithOrder, 1, + OpSchema() + .SetDoc( + R"DOC(Dequantize input matrix to specific layout used in cublaslt. attr to specify output type, float16 or float32)DOC") + .Attr("order_input", + "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", + AttributeProto::INT) + .Attr("order_output", "cublasLt order of output matrix", AttributeProto::INT) + .Attr("to", + "The output data type, only support TensorProto_DataType_FLOAT (1) and TensorProto_DataType_FLOAT16 (10)", + AttributeProto::INT) + .Input(0, "input", + "TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as " + "(B, ROWS, COS)", + "Q") + .Input(1, "scale_input", "scale of the input", "S") + .Output(0, "output", "output tensor", "F") + .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") + .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, "Constrain to float types") + .TypeConstraint("S", {"tensor(float)"}, "Constrain Scale to float32 types") + .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) { + propagateElemTypeFromAttributeToOutput(ctx, "to", 0); + if (!hasInputShape(ctx, 0)) return; + auto& input_shape = getInputShape(ctx, 0); + updateOutputShape(ctx, 0, input_shape); + })); - constexpr const char* QOrderedMatMul_ver1_doc = R"DOC( +constexpr const char* QOrderedMatMul_ver1_doc = R"DOC( Quantize (Int8) MatMul with order. Implement Y = alpha * A * B + bias + beta * C. Matrix A, B, C, Y are all int8 matrix. Two type of order combination supported: *) When order_B is ORDER_COL, order_A must be ORDER_ROW. @@ -1032,31 +1047,32 @@ order_Y and order_C will be same as order_A. Support per column quantized weight, ie, scale_B is 1-D vector of size [#cols of matrix B]. )DOC"; - ONNX_MS_OPERATOR_SET_SCHEMA( - QOrderedMatMul, - 1, - OpSchema() - .SetDoc(QOrderedMatMul_ver1_doc) - .Attr("order_A", "cublasLt order of matrix A. See the schema of QuantizeWithOrder for order definition.", AttributeProto::INT) - .Attr("order_B", "cublasLt order of matrix B", AttributeProto::INT) - .Attr("order_Y", "cublasLt order of matrix Y and optional matrix C", AttributeProto::INT) - .Input(0, "A", "3-dimensional matrix A", "Q") - .Input(1, "scale_A", "scale of the input A.", "S") - .Input(2, "B", "2-dimensional matrix B. Transposed if order_B is ORDER_COL.", "Q") - .Input(3, "scale_B", "scale of the input B. Scalar or 1-D float32.", "S") - .Input(4, "scale_Y", "scale of the output Y.", "S") - .Input(5, "bias", "1d bias, not scaled with scale_Y.", "S", OpSchema::Optional) - .Input(6, "C", "3d or 2d matrix C. if 2d expand to 3d first. Shape[0] should be 1 or same as A.shape[0] ", "Q", OpSchema::Optional) - .Input(7, "scale_C", "scale of the input A.", "S", OpSchema::Optional) - .Output(0, "Y", "Matrix multiply results from A * B", "Q") - .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") - .TypeConstraint("S", {"tensor(float)"}, "Constrain bias and scales to float32") - .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) { - propagateElemTypeFromInputToOutput(ctx, 0, 0); - ONNX_NAMESPACE::matmulShapeInference(ctx, 0, 2); - })); +ONNX_MS_OPERATOR_SET_SCHEMA( + QOrderedMatMul, 1, + OpSchema() + .SetDoc(QOrderedMatMul_ver1_doc) + .Attr("order_A", "cublasLt order of matrix A. See the schema of QuantizeWithOrder for order definition.", + AttributeProto::INT) + .Attr("order_B", "cublasLt order of matrix B", AttributeProto::INT) + .Attr("order_Y", "cublasLt order of matrix Y and optional matrix C", AttributeProto::INT) + .Input(0, "A", "3-dimensional matrix A", "Q") + .Input(1, "scale_A", "scale of the input A.", "S") + .Input(2, "B", "2-dimensional matrix B. Transposed if order_B is ORDER_COL.", "Q") + .Input(3, "scale_B", "scale of the input B. Scalar or 1-D float32.", "S") + .Input(4, "scale_Y", "scale of the output Y.", "S") + .Input(5, "bias", "1d bias, not scaled with scale_Y.", "S", OpSchema::Optional) + .Input(6, "C", "3d or 2d matrix C. if 2d expand to 3d first. Shape[0] should be 1 or same as A.shape[0] ", "Q", + OpSchema::Optional) + .Input(7, "scale_C", "scale of the input A.", "S", OpSchema::Optional) + .Output(0, "Y", "Matrix multiply results from A * B", "Q") + .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") + .TypeConstraint("S", {"tensor(float)"}, "Constrain bias and scales to float32") + .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) { + propagateElemTypeFromInputToOutput(ctx, 0, 0); + ONNX_NAMESPACE::matmulShapeInference(ctx, 0, 2); + })); - static const char* Attention_QOrdered_doc = R"DOC( +static const char* Attention_QOrdered_doc = R"DOC( Quantized version of simplified Multi-Head Self Attention(using int8 with specific matrix Layout). Multi-Head Self Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT). The mask_index input is optional. Besides raw attention mask with shape (batch_size, past_sequence_length + sequence_length) @@ -1070,128 +1086,159 @@ Current version does not support past/present, extra_add and qkv_hidden_sizes. TODO: Support them if needed in the future. )DOC"; - ONNX_MS_OPERATOR_SET_SCHEMA( - QOrderedAttention, - 1, - OpSchema() - .SetDoc(Attention_QOrdered_doc) - .Attr("num_heads", "Number of attention heads", AttributeProto::INT) - .Attr("unidirectional", "Whether every token can only attend to previous tokens. Default value is 0.", AttributeProto::INT, static_cast(0)) - .Attr("qkv_hidden_sizes", "Hidden layer sizes of Q, K, V paths in Attention", AttributeProto::INTS, OPTIONAL_VALUE) - .Attr("order_input", "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", AttributeProto::INT) - .Attr("order_weight", "cublasLt order of weight matrix", AttributeProto::INT) - .Attr("order_output", "cublasLt order of global bias", AttributeProto::INT) - .Input(0, "input", "3D input tensor with shape (batch_size, sequence_length, input_hidden_size)", "Q") - .Input(1, "scale_input", "scale of the input, scalar value (per tensor) currently.", "S") - .Input(2, "scale_Q_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S") - .Input(3, "scale_K_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S") - .Input(4, "scale_V_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S") - .Input(5, "Q_weight", "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", "Q") - .Input(6, "K_weight", "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", "Q") - .Input(7, "V_weight", "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", "Q") - .Input(8, "scale_Q_weight", "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)", "S") - .Input(9, "scale_K_weight", "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)", "S") - .Input(10, "scale_V_weight", "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)", "S") - .Input(11, "Q_bias", "1D input tensor with shape (hidden_size)", "S") - .Input(12, "K_bias", "1D input tensor with shape (hidden_size)", "S") - .Input(13, "V_bias", "1D input tensor with shape (hidden_size)", "S") - .Input(14, "scale_QKT_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S", OpSchema::Optional) - .Input(15, "scale_QKT_softmax", "scale of the softmax result - scalar (per-tensor quantization)", "S", OpSchema::Optional) - .Input(16, "scale_values_gemm", "scale of the gemm - scalar (per-tensor quantization). Also this is the output scale for the operator.", "S") - .Input(17, "mask_index", - "Attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, past_sequence_length + sequence_length)" - "or (batch_size, sequence_length, past_sequence_length + sequence_length), or index with shape (batch_size) or (2 * batch_size).", - "G", OpSchema::Optional) - .Input(18, "past", "past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).", "Q", OpSchema::Optional) - .Input(19, "extra_add", "additional add to QxK' with shape (batch_size, num_heads, sequence_length, sequence_length).", "S", OpSchema::Optional) - .Output(0, "output", "3D output tensor with shape (batch_size, sequence_length, hidden_size)", "Q") - .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") - .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32 tensors.") - .TypeConstraint("G", {"tensor(int32)"}, "Constrain to integer types") - .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput)); +ONNX_MS_OPERATOR_SET_SCHEMA( + QOrderedAttention, 1, + OpSchema() + .SetDoc(Attention_QOrdered_doc) + .Attr("num_heads", "Number of attention heads", AttributeProto::INT) + .Attr("unidirectional", "Whether every token can only attend to previous tokens. Default value is 0.", + AttributeProto::INT, static_cast(0)) + .Attr("qkv_hidden_sizes", "Hidden layer sizes of Q, K, V paths in Attention", AttributeProto::INTS, + OPTIONAL_VALUE) + .Attr("order_input", + "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", + AttributeProto::INT) + .Attr("order_weight", "cublasLt order of weight matrix", AttributeProto::INT) + .Attr("order_output", "cublasLt order of global bias", AttributeProto::INT) + .Input(0, "input", "3D input tensor with shape (batch_size, sequence_length, input_hidden_size)", "Q") + .Input(1, "scale_input", "scale of the input, scalar value (per tensor) currently.", "S") + .Input(2, "scale_Q_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S") + .Input(3, "scale_K_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S") + .Input(4, "scale_V_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S") + .Input(5, "Q_weight", + "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", + "Q") + .Input(6, "K_weight", + "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", + "Q") + .Input(7, "V_weight", + "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", + "Q") + .Input(8, "scale_Q_weight", + "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel " + "quantization)", + "S") + .Input(9, "scale_K_weight", + "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel " + "quantization)", + "S") + .Input(10, "scale_V_weight", + "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel " + "quantization)", + "S") + .Input(11, "Q_bias", "1D input tensor with shape (hidden_size)", "S") + .Input(12, "K_bias", "1D input tensor with shape (hidden_size)", "S") + .Input(13, "V_bias", "1D input tensor with shape (hidden_size)", "S") + .Input(14, "scale_QKT_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S", OpSchema::Optional) + .Input(15, "scale_QKT_softmax", "scale of the softmax result - scalar (per-tensor quantization)", "S", + OpSchema::Optional) + .Input(16, "scale_values_gemm", + "scale of the gemm - scalar (per-tensor quantization). Also this is the output scale for the operator.", + "S") + .Input(17, "mask_index", + "Attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, " + "past_sequence_length + sequence_length)" + "or (batch_size, sequence_length, past_sequence_length + sequence_length), or index with shape " + "(batch_size) or (2 * batch_size).", + "G", OpSchema::Optional) + .Input(18, "past", + "past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).", + "Q", OpSchema::Optional) + .Input(19, "extra_add", + "additional add to QxK' with shape (batch_size, num_heads, sequence_length, sequence_length).", "S", + OpSchema::Optional) + .Output(0, "output", "3D output tensor with shape (batch_size, sequence_length, hidden_size)", "Q") + .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") + .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32 tensors.") + .TypeConstraint("G", {"tensor(int32)"}, "Constrain to integer types") + .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput)); - ONNX_MS_OPERATOR_SET_SCHEMA( - QOrderedLayerNormalization, - 1, - OpSchema() - .SetDoc("QOrderedLayerNormalization") - .Attr("axis", - "The first normalization dimension: normalization " - "will be performed along dimensions axis " - ": rank(inputs).", - AttributeProto::INT, - static_cast(-1)) - .Attr("epsilon", "The epsilon value to use to avoid division by zero.", - AttributeProto::FLOAT, 1e-5f) - .Attr("order_X", "cublasLt order of input X. Default is ROW MAJOR. See the schema of QuantizeWithOrder for order definition.", - AttributeProto::INT, static_cast(1)) - .Attr("order_Y", "cublasLt order of matrix Y, must be same as order_X. Default is ROW MAJOR.", - AttributeProto::INT, static_cast(1)) - .AllowUncheckedAttributes() - .Input(0, "X", "Input data tensor from the previous layer.", "Q") - .Input(1, "scale_X", "scale of the quantized X", "S") - .Input(2, "scale", "Scale tensor, i.e., gamma vector.", "F") - .Input(3, "B", "Bias tensor.", "F", OpSchema::Optional) - .Input(4, "scale_Y", "scale of the quantized X", "S") - .Output(0, "Y", "Output data tensor.", "Q") - .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, - "Constrain input gamma and bias could be float16/float tensors. " - "float may get better precision, float16 runs faster.") - .TypeConstraint("S", {"tensor(float)"}, "quantization scale must be float tensors.") - .TypeConstraint("Q", {"tensor(int8)"}, "quantization tensor must be int8 tensors.") - .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) { - propagateShapeAndTypeFromFirstInput(ctx); - propagateElemTypeFromInputToOutput(ctx, 0, 0); - })); +ONNX_MS_OPERATOR_SET_SCHEMA(QOrderedLayerNormalization, 1, + OpSchema() + .SetDoc("QOrderedLayerNormalization") + .Attr("axis", + "The first normalization dimension: normalization " + "will be performed along dimensions axis " + ": rank(inputs).", + AttributeProto::INT, static_cast(-1)) + .Attr("epsilon", "The epsilon value to use to avoid division by zero.", + AttributeProto::FLOAT, 1e-5f) + .Attr("order_X", + "cublasLt order of input X. Default is ROW MAJOR. See the schema of " + "QuantizeWithOrder for order definition.", + AttributeProto::INT, static_cast(1)) + .Attr("order_Y", + "cublasLt order of matrix Y, must be same as order_X. Default is ROW MAJOR.", + AttributeProto::INT, static_cast(1)) + .AllowUncheckedAttributes() + .Input(0, "X", "Input data tensor from the previous layer.", "Q") + .Input(1, "scale_X", "scale of the quantized X", "S") + .Input(2, "scale", "Scale tensor, i.e., gamma vector.", "F") + .Input(3, "B", "Bias tensor.", "F", OpSchema::Optional) + .Input(4, "scale_Y", "scale of the quantized X", "S") + .Output(0, "Y", "Output data tensor.", "Q") + .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, + "Constrain input gamma and bias could be float16/float tensors. " + "float may get better precision, float16 runs faster.") + .TypeConstraint("S", {"tensor(float)"}, "quantization scale must be float tensors.") + .TypeConstraint("Q", {"tensor(int8)"}, "quantization tensor must be int8 tensors.") + .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) { + propagateShapeAndTypeFromFirstInput(ctx); + propagateElemTypeFromInputToOutput(ctx, 0, 0); + })); - ONNX_MS_OPERATOR_SET_SCHEMA( - QOrderedGelu, - 1, - OpSchema() - .SetDoc(R"DOC(Ordered Quantize Gelu.)DOC") - .Attr("order_X", "cublasLt order of input X. Optional. See the schema of QuantizeWithOrder for order definition.", - AttributeProto::INT, OPTIONAL_VALUE) - .Attr("order_Y", "cublasLt order of matrix Y, must be same as order_X if specified together. Optional.", - AttributeProto::INT, OPTIONAL_VALUE) - .Input(0, "X", "N-dimensional input A", "Q") - .Input(1, "scale_X", "scale of the input A", "S") - .Input(2, "scale_Y", "scale of the output Y", "S") - .Output(0, "Y", "Output of the Gelu", "Q") - .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") - .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32") - .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput)); +ONNX_MS_OPERATOR_SET_SCHEMA( + QOrderedGelu, 1, + OpSchema() + .SetDoc(R"DOC(Ordered Quantize Gelu.)DOC") + .Attr("order_X", + "cublasLt order of input X. Optional. See the schema of QuantizeWithOrder for order definition.", + AttributeProto::INT, OPTIONAL_VALUE) + .Attr("order_Y", "cublasLt order of matrix Y, must be same as order_X if specified together. Optional.", + AttributeProto::INT, OPTIONAL_VALUE) + .Input(0, "X", "N-dimensional input A", "Q") + .Input(1, "scale_X", "scale of the input A", "S") + .Input(2, "scale_Y", "scale of the output Y", "S") + .Output(0, "Y", "Output of the Gelu", "Q") + .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") + .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32") + .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput)); - ONNX_MS_OPERATOR_SET_SCHEMA( - QOrderedLongformerAttention, - 1, - OpSchema() - .SetDoc(R"DOC(Quantized version of Longformer Self Attention (using int8 with specific matrix Layout).)DOC") - .Attr("num_heads", "Number of attention heads", AttributeProto::INT) - .Attr("window", "One sided attention windows length W, or half of total window length", AttributeProto::INT) - .Attr("order_input", "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", AttributeProto::INT) - .Attr("order_weight", "cublasLt order of weight matrix", AttributeProto::INT) - .Attr("order_global_weight", "cublasLt order of weight matrix", AttributeProto::INT) - .Attr("order_output", "cublasLt order of global bias", AttributeProto::INT) - .Input(0, "input", "3D input tensor with shape (batch_size, sequence_length, hidden_size), hidden_size = num_heads * head_size", "Q") - .Input(1, "scale_input", "scale of the input", "S") - .Input(2, "weight", "2D input tensor with shape (hidden_size, 3 * hidden_size)", "Q") - .Input(3, "scale_weight", "scale of the weight", "S") - .Input(4, "bias", "1D input tensor with shape (3 * hidden_size), fp32 only currently.", "S") - .Input(5, "scale_bias", "reserved. (not used as add bias need float value in cublasLt for normal order.)", "S") - .Input(6, "scale_qkv_gemm", "scale of the output for fused kqv gemm", "S") - .Input(7, "mask", "Attention mask with shape (batch_size, sequence_length)", "F") - .Input(8, "global_weight", "2D input tensor with shape (hidden_size, 3 * hidden_size)", "Q") - .Input(9, "scale_global_weight", "scale of the global_weight", "S") - .Input(10, "global_bias", "1D input tensor with shape (3 * hidden_size)", "S") - .Input(11, "scale_global_gemm", "scale of the global_qkv_gemm", "S") - .Input(12, "global", "Global attention flags with shape (batch_size, sequence_length)", "G") - .Input(13, "scale_output", "scale of the output", "S") - .Output(0, "output", "3D output tensor with shape (batch_size, sequence_length, hidden_size)", "Q") - .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") - .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32 tensors.") - .TypeConstraint("G", {"tensor(int32)"}, "Constrain to integer types") - .TypeConstraint("F", {"tensor(float16)"}, "Be compatible with float version.") - .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput)); +ONNX_MS_OPERATOR_SET_SCHEMA( + QOrderedLongformerAttention, 1, + OpSchema() + .SetDoc(R"DOC(Quantized version of Longformer Self Attention (using int8 with specific matrix Layout).)DOC") + .Attr("num_heads", "Number of attention heads", AttributeProto::INT) + .Attr("window", "One sided attention windows length W, or half of total window length", AttributeProto::INT) + .Attr("order_input", + "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", + AttributeProto::INT) + .Attr("order_weight", "cublasLt order of weight matrix", AttributeProto::INT) + .Attr("order_global_weight", "cublasLt order of weight matrix", AttributeProto::INT) + .Attr("order_output", "cublasLt order of global bias", AttributeProto::INT) + .Input(0, "input", + "3D input tensor with shape (batch_size, sequence_length, hidden_size), hidden_size = num_heads * " + "head_size", + "Q") + .Input(1, "scale_input", "scale of the input", "S") + .Input(2, "weight", "2D input tensor with shape (hidden_size, 3 * hidden_size)", "Q") + .Input(3, "scale_weight", "scale of the weight", "S") + .Input(4, "bias", "1D input tensor with shape (3 * hidden_size), fp32 only currently.", "S") + .Input(5, "scale_bias", "reserved. (not used as add bias need float value in cublasLt for normal order.)", "S") + .Input(6, "scale_qkv_gemm", "scale of the output for fused kqv gemm", "S") + .Input(7, "mask", "Attention mask with shape (batch_size, sequence_length)", "F") + .Input(8, "global_weight", "2D input tensor with shape (hidden_size, 3 * hidden_size)", "Q") + .Input(9, "scale_global_weight", "scale of the global_weight", "S") + .Input(10, "global_bias", "1D input tensor with shape (3 * hidden_size)", "S") + .Input(11, "scale_global_gemm", "scale of the global_qkv_gemm", "S") + .Input(12, "global", "Global attention flags with shape (batch_size, sequence_length)", "G") + .Input(13, "scale_output", "scale of the output", "S") + .Output(0, "output", "3D output tensor with shape (batch_size, sequence_length, hidden_size)", "Q") + .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.") + .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32 tensors.") + .TypeConstraint("G", {"tensor(int32)"}, "Constrain to integer types") + .TypeConstraint("F", {"tensor(float16)"}, "Be compatible with float version.") + .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput)); } // namespace contrib } // namespace onnxruntime diff --git a/onnxruntime/test/contrib_ops/quantize_bfp_test.cc b/onnxruntime/test/contrib_ops/quantize_bfp_test.cc index e13a933313..8e259f6735 100644 --- a/onnxruntime/test/contrib_ops/quantize_bfp_test.cc +++ b/onnxruntime/test/contrib_ops/quantize_bfp_test.cc @@ -34,11 +34,11 @@ TEST(QuantizeBFPTest, CreateQuantizeGraph) { bfp_type.set_i(static_cast(onnxruntime::contrib::BFPType::BFP_1_8_8_16)); bfp_type.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INT); attributes["bfp_type"] = bfp_type; - ONNX_NAMESPACE::AttributeProto block_dims; - block_dims.set_name("block_dims"); - block_dims.add_ints(1); // bounding box is over dimension 1 - block_dims.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INTS); - attributes["block_dims"] = block_dims; + ONNX_NAMESPACE::AttributeProto block_dim; + block_dim.set_name("block_dim"); + block_dim.set_i(1); // bounding box is over dimension 1 + block_dim.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INT); + attributes["block_dim"] = block_dim; std::vector output_defs; ONNX_NAMESPACE::TypeProto y_byte; @@ -91,11 +91,11 @@ TEST(DequantizeBFPTest, CreateDequantizeGraph) { bfp_type.set_i(static_cast(onnxruntime::contrib::BFPType::BFP_1_8_8_16)); bfp_type.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INT); attributes["bfp_type"] = bfp_type; - ONNX_NAMESPACE::AttributeProto block_dims; - block_dims.set_name("block_dims"); - block_dims.add_ints(1); // bounding box is over dimension 1 - block_dims.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INTS); - attributes["block_dims"] = block_dims; + ONNX_NAMESPACE::AttributeProto block_dim; + block_dim.set_name("block_dim"); + block_dim.set_i(1); // bounding box is over dimension 1 + block_dim.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INT); + attributes["block_dim"] = block_dim; ONNX_NAMESPACE::AttributeProto dtype; dtype.set_name("dtype"); dtype.set_i(static_cast(ONNX_NAMESPACE::TensorProto_DataType_FLOAT));