diff --git a/docs/ContribOperators.md b/docs/ContribOperators.md
index 00cc18e058..9fba550e0a 100644
--- a/docs/ContribOperators.md
+++ b/docs/ContribOperators.md
@@ -20,6 +20,7 @@ Do not modify directly.*
* com.microsoft.DecoderAttention
* com.microsoft.DequantizeBFP
* com.microsoft.DequantizeLinear
+ * com.microsoft.DequantizeWithOrder
* com.microsoft.DynamicQuantizeLSTM
* com.microsoft.DynamicQuantizeMatMul
* com.microsoft.EmbedLayerNormalization
@@ -57,11 +58,14 @@ Do not modify directly.*
* com.microsoft.QLinearReduceMean
* com.microsoft.QLinearSigmoid
* com.microsoft.QLinearSoftmax
+ * com.microsoft.QOrderedAttention
* com.microsoft.QOrderedGelu
* com.microsoft.QOrderedLayerNormalization
+ * com.microsoft.QOrderedLongformerAttention
* com.microsoft.QOrderedMatMul
* com.microsoft.QuantizeBFP
* com.microsoft.QuantizeLinear
+ * com.microsoft.QuantizeWithOrder
* com.microsoft.Range
* com.microsoft.ReduceSumInteger
* com.microsoft.Rfft
@@ -989,7 +993,9 @@ This version of the operator has been available since version 1 of the 'com.micr
### **com.microsoft.DequantizeBFP**
- The BFP dequantization operator. It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor.
+ The BFP dequantization operator.
+ It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor.
+ More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/
#### Version
@@ -1000,8 +1006,8 @@ This version of the operator has been available since version 1 of the 'com.micr
- bfp_type : int (required)
- The type of BFP - must match with the BFPType enum
-- block_dims : list of ints
-- Numbers within a bounding box will span across these dimensions.Any dimension not in this list is the same for all numbers within a bounding box.As an example, consider a 2D tensor with shape [d0, d1] and block_dims equal to [1].Within a bounding box, all elements will be within the same row but will be from different columnns.The default is the last dimension.
+- block_dim : int
+- Each bounding box spans this dimension.Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that consumes the output of this operator.For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and QuantizeBFP(W) would use block_dim 0.The default is the last dimension.
- dtype : int
- The datatype to dequantize to.
@@ -1081,6 +1087,53 @@ This version of the operator has been available since version 1 of the 'com.micr
+### **com.microsoft.DequantizeWithOrder**
+
+ Dequantize input matrix to specific layout used in cublaslt. attr to specify output type, float16 or float32
+
+#### Version
+
+This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
+
+#### Attributes
+
+
+- order_input : int (required)
+- cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.
+- order_output : int (required)
+- cublasLt order of output matrix
+- to : int (required)
+- The output data type, only support TensorProto_DataType_FLOAT (1) and TensorProto_DataType_FLOAT16 (10)
+
+
+#### Inputs
+
+
+- input : Q
+- TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as (B, ROWS, COS)
+- scale_input : S
+- scale of the input
+
+
+#### Outputs
+
+
+- output : F
+- output tensor
+
+
+#### Type Constraints
+
+
+- Q : tensor(int8)
+- Constrain input and output types to int8 tensors.
+- F : tensor(float16), tensor(float)
+- Constrain to float types
+- S : tensor(float)
+- Constrain Scale to float32 types
+
+
+
### **com.microsoft.DynamicQuantizeLSTM**
#### Version
@@ -2916,6 +2969,105 @@ This version of the operator has been available since version 1 of the 'com.micr
+### **com.microsoft.QOrderedAttention**
+
+ Quantized version of simplified Multi-Head Self Attention(using int8 with specific matrix Layout).
+ Multi-Head Self Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT).
+ The mask_index input is optional. Besides raw attention mask with shape (batch_size, past_sequence_length + sequence_length)
+ or (batch_size, sequence_length, past_sequence_length + sequence_length) with value 0 for masked and 1 otherwise,
+ we also support other two formats: When input has right-side padding, mask_index is one dimension with shape (batch_size),
+ where value of each element is the end position, or valid length of actual sequence excluding padding. When input has
+ left-side padding, mask_index has shape (2 * batch_size), where the values are the exclusive end positions followed by
+ the inclusive start positions. When unidirectional is 1, and each token only attend to previous tokens. For GPT-2, both past
+ and present state are optional. Present state could appear in output even when past state is not in input.
+ Current version does not support past/present, extra_add and qkv_hidden_sizes.
+ TODO: Support them if needed in the future.
+
+#### Version
+
+This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
+
+#### Attributes
+
+
+- num_heads : int (required)
+- Number of attention heads
+- order_input : int (required)
+- cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.
+- order_output : int (required)
+- cublasLt order of global bias
+- order_weight : int (required)
+- cublasLt order of weight matrix
+- qkv_hidden_sizes : list of ints
+- Hidden layer sizes of Q, K, V paths in Attention
+- unidirectional : int
+- Whether every token can only attend to previous tokens. Default value is 0.
+
+
+#### Inputs (17 - 20)
+
+
+- input : Q
+- 3D input tensor with shape (batch_size, sequence_length, input_hidden_size)
+- scale_input : S
+- scale of the input, scalar value (per tensor) currently.
+- scale_Q_gemm : S
+- scale of the gemm - scalar (per-tensor quantization)
+- scale_K_gemm : S
+- scale of the gemm - scalar (per-tensor quantization)
+- scale_V_gemm : S
+- scale of the gemm - scalar (per-tensor quantization)
+- Q_weight : Q
+- 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size
+- K_weight : Q
+- 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size
+- V_weight : Q
+- 2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size
+- scale_Q_weight : S
+- scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)
+- scale_K_weight : S
+- scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)
+- scale_V_weight : S
+- scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)
+- Q_bias : S
+- 1D input tensor with shape (hidden_size)
+- K_bias : S
+- 1D input tensor with shape (hidden_size)
+- V_bias : S
+- 1D input tensor with shape (hidden_size)
+- scale_QKT_gemm (optional) : S
+- scale of the gemm - scalar (per-tensor quantization)
+- scale_QKT_softmax (optional) : S
+- scale of the softmax result - scalar (per-tensor quantization)
+- scale_values_gemm : S
+- scale of the gemm - scalar (per-tensor quantization). Also this is the output scale for the operator.
+- mask_index (optional) : G
+- Attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, past_sequence_length + sequence_length)or (batch_size, sequence_length, past_sequence_length + sequence_length), or index with shape (batch_size) or (2 * batch_size).
+- past (optional) : Q
+- past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).
+- extra_add (optional) : S
+- additional add to QxK' with shape (batch_size, num_heads, sequence_length, sequence_length).
+
+
+#### Outputs
+
+
+- output : Q
+- 3D output tensor with shape (batch_size, sequence_length, hidden_size)
+
+
+#### Type Constraints
+
+
+- Q : tensor(int8)
+- Constrain input and output types to int8 tensors.
+- S : tensor(float)
+- Constrain scales to float32 tensors.
+- G : tensor(int32)
+- Constrain to integer types
+
+
+
### **com.microsoft.QOrderedGelu**
Ordered Quantize Gelu.
@@ -2928,9 +3080,9 @@ This version of the operator has been available since version 1 of the 'com.micr
- order_X : int
-- cublasLt order of input X. Default is ROW MAJOR.
+- cublasLt order of input X. Optional. See the schema of QuantizeWithOrder for order definition.
- order_Y : int
-- cublasLt order of matrix Y, must be same as order_X. Default is ROW MAJOR.
+- cublasLt order of matrix Y, must be same as order_X if specified together. Optional.
#### Inputs
@@ -2977,7 +3129,7 @@ This version of the operator has been available since version 1 of the 'com.micr
epsilon : float
The epsilon value to use to avoid division by zero.
order_X : int
-cublasLt order of input X. Default is ROW MAJOR.
+cublasLt order of input X. Default is ROW MAJOR. See the schema of QuantizeWithOrder for order definition.
order_Y : int
cublasLt order of matrix Y, must be same as order_X. Default is ROW MAJOR.
@@ -3016,9 +3168,9 @@ This version of the operator has been available since version 1 of the 'com.micr
-### **com.microsoft.QOrderedMatMul**
+### **com.microsoft.QOrderedLongformerAttention**
- TODO
+ Quantized version of Longformer Self Attention (using int8 with specific matrix Layout).
#### Version
@@ -3027,12 +3179,100 @@ This version of the operator has been available since version 1 of the 'com.micr
#### Attributes
-- order_A : int
-- cublasLt order of matrix A. Default is ROW MAJOR.
-- order_B : int
-- cublasLt order of matrix B. Default is ROW MAJOR.
-- order_Y : int
-- cublasLt order of matrix Y and optional matrix C. Default is ROW MAJOR.
+- num_heads : int (required)
+- Number of attention heads
+- order_global_weight : int (required)
+- cublasLt order of weight matrix
+- order_input : int (required)
+- cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.
+- order_output : int (required)
+- cublasLt order of global bias
+- order_weight : int (required)
+- cublasLt order of weight matrix
+- window : int (required)
+- One sided attention windows length W, or half of total window length
+
+
+#### Inputs
+
+
+- input : Q
+- 3D input tensor with shape (batch_size, sequence_length, hidden_size), hidden_size = num_heads * head_size
+- scale_input : S
+- scale of the input
+- weight : Q
+- 2D input tensor with shape (hidden_size, 3 * hidden_size)
+- scale_weight : S
+- scale of the weight
+- bias : S
+- 1D input tensor with shape (3 * hidden_size), fp32 only currently.
+- scale_bias : S
+- reserved. (not used as add bias need float value in cublasLt for normal order.)
+- scale_qkv_gemm : S
+- scale of the output for fused kqv gemm
+- mask : F
+- Attention mask with shape (batch_size, sequence_length)
+- global_weight : Q
+- 2D input tensor with shape (hidden_size, 3 * hidden_size)
+- scale_global_weight : S
+- scale of the global_weight
+- global_bias : S
+- 1D input tensor with shape (3 * hidden_size)
+- scale_global_gemm : S
+- scale of the global_qkv_gemm
+- global : G
+- Global attention flags with shape (batch_size, sequence_length)
+- scale_output : S
+- scale of the output
+
+
+#### Outputs
+
+
+- output : Q
+- 3D output tensor with shape (batch_size, sequence_length, hidden_size)
+
+
+#### Type Constraints
+
+
+- Q : tensor(int8)
+- Constrain input and output types to int8 tensors.
+- S : tensor(float)
+- Constrain scales to float32 tensors.
+- G : tensor(int32)
+- Constrain to integer types
+- F : tensor(float16)
+- Be compatible with float version.
+
+
+
+### **com.microsoft.QOrderedMatMul**
+
+ Quantize (Int8) MatMul with order. Implement Y = alpha * A * B + bias + beta * C. Matrix A, B, C, Y are all int8 matrix.
+ Two type of order combination supported:
+ *) When order_B is ORDER_COL, order_A must be ORDER_ROW.
+ bias is vector of {#cols of Y} of float32, C should be batch 1/batch_A. B could be of batch 1 or batch_A.
+ Note B is reorder to ORDER_COL, or Transposed. Not Transposed first and then Reordered here.
+ *) When order_B is specify ORDER_COL4_4R2_8C or ORDER_COL32_2R_4R4, orderA must be ORDER_COL32.
+ MatMul will be implemented using alpha(A * B) + beta * C => Y.
+ bias is not supported here. B in fact is transposed first then reordered into ORDER_COL4_4R2_8C or ORDER_COL32_2R_4R4 here.
+ order_Y and order_C will be same as order_A.
+ Support per column quantized weight, ie, scale_B is 1-D vector of size [#cols of matrix B].
+
+#### Version
+
+This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
+
+#### Attributes
+
+
+- order_A : int (required)
+- cublasLt order of matrix A. See the schema of QuantizeWithOrder for order definition.
+- order_B : int (required)
+- cublasLt order of matrix B
+- order_Y : int (required)
+- cublasLt order of matrix Y and optional matrix C
#### Inputs (5 - 8)
@@ -3041,19 +3281,19 @@ This version of the operator has been available since version 1 of the 'com.micr
A : Q
3-dimensional matrix A
scale_A : S
-scale of the input A
+scale of the input A.
B : Q
-2-dimensional matrix B
+2-dimensional matrix B. Transposed if order_B is ORDER_COL.
scale_B : S
-scale of the input B
+scale of the input B. Scalar or 1-D float32.
scale_Y : S
-scale of the output Y
+scale of the output Y.
bias (optional) : S
-1d bias
+1d bias, not scaled with scale_Y.
C (optional) : Q
3d or 2d matrix C. if 2d expand to 3d first. Shape[0] should be 1 or same as A.shape[0]
scale_C (optional) : S
-scale of the input A
+scale of the input A.
#### Outputs
@@ -3076,6 +3316,7 @@ This version of the operator has been available since version 1 of the 'com.micr
### **com.microsoft.QuantizeBFP**
The BFP quantization operator. It consumes a full precision tensor and computes an BFP tensor.
+ More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/
#### Version
@@ -3086,8 +3327,8 @@ This version of the operator has been available since version 1 of the 'com.micr
- bfp_type : int (required)
- The type of BFP - must match with the BFPType enum
-- block_dims : list of ints
-- Numbers within a bounding box will span across these dimensions.Any dimension not in this list is the same for all numbers within a bounding box.As an example, consider a 2D tensor with shape [d0, d1] and block_dims equal to [1].Within a bounding box, all elements will be within the same row but will be from different columnns.The default is the last dimension.
+- block_dim : int
+- Each bounding box spans this dimension.Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that consumes the output of this operator.For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and QuantizeBFP(W) would use block_dim 0.The default is the last dimension.
#### Inputs
@@ -3166,6 +3407,51 @@ This version of the operator has been available since version 1 of the 'com.micr
+### **com.microsoft.QuantizeWithOrder**
+
+ Quantize input matrix to specific layout used in cublaslt.
+
+#### Version
+
+This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
+
+#### Attributes
+
+
+- order_input : int (required)
+- cublasLt order of input matrix. ORDER_COL = 0, ORDER_ROW = 1, ORDER_COL32 = 2, ORDER_COL4_4R2_8C = 3, ORDER_COL32_2R_4R4 = 4. Please refer https://docs.nvidia.com/cuda/cublas/index.html#cublasLtOrder_t for their meaning.
+- order_output : int (required)
+- cublasLt order of output matrix.
+
+
+#### Inputs
+
+
+- input : F
+- TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as (B, ROWS, COS)
+- scale_input : S
+- scale of the input
+
+
+#### Outputs
+
+
+- output : Q
+- output tensor
+
+
+#### Type Constraints
+
+
+- Q : tensor(int8)
+- Constrain input and output types to int8 tensors.
+- F : tensor(float16), tensor(float)
+- Constrain to float types
+- S : tensor(float)
+- Constrain Scale to float32 types
+
+
+
### **com.microsoft.Range**
Creates a sequence of numbers that begins at `start` and extends by increments of `delta`
diff --git a/onnxruntime/core/graph/contrib_ops/quantization_defs.cc b/onnxruntime/core/graph/contrib_ops/quantization_defs.cc
index 71e6a0ee65..62c0d6da3c 100644
--- a/onnxruntime/core/graph/contrib_ops/quantization_defs.cc
+++ b/onnxruntime/core/graph/contrib_ops/quantization_defs.cc
@@ -211,19 +211,21 @@ ONNX_MS_OPERATOR_SET_SCHEMA(DequantizeLinear, 1,
}));
static const char* QuantizeBFP_ver1_doc = R"DOC(
-The BFP quantization operator. It consumes a full precision tensor and computes an BFP tensor.)DOC";
+The BFP quantization operator. It consumes a full precision tensor and computes an BFP tensor.
+More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/)DOC";
ONNX_MS_OPERATOR_SET_SCHEMA(
QuantizeBFP, 1,
OpSchema()
.Attr("bfp_type", "The type of BFP - must match with the BFPType enum", AttributeProto::INT)
- .Attr("block_dims",
- "Numbers within a bounding box will span across these dimensions."
- "Any dimension not in this list is the same for all numbers within a bounding box."
- "As an example, consider a 2D tensor with shape [d0, d1] and block_dims equal to [1]."
- "Within a bounding box, all elements will be within the same row but will be from different columnns."
+ .Attr("block_dim",
+ "Each bounding box spans this dimension."
+ "Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that "
+ "consumes the output of this operator."
+ "For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and "
+ "QuantizeBFP(W) would use block_dim 0."
"The default is the last dimension.",
- AttributeProto::INTS, std::vector{-1})
+ AttributeProto::INT, static_cast(-1))
.Input(0, "x", "N-D full precision input tensor to be quantized.", "T1")
.Output(0, "y", "1-D, contiguous BFP data", "T2")
.Output(1, "shape", "Shape of x", "T3")
@@ -254,19 +256,22 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
}));
static const char* DequantizeBFP_ver1_doc = R"DOC(
-The BFP dequantization operator. It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor.)DOC";
+The BFP dequantization operator.
+It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor.
+More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/)DOC";
ONNX_MS_OPERATOR_SET_SCHEMA(
DequantizeBFP, 1,
OpSchema()
.Attr("bfp_type", "The type of BFP - must match with the BFPType enum", AttributeProto::INT)
- .Attr("block_dims",
- "Numbers within a bounding box will span across these dimensions."
- "Any dimension not in this list is the same for all numbers within a bounding box."
- "As an example, consider a 2D tensor with shape [d0, d1] and block_dims equal to [1]."
- "Within a bounding box, all elements will be within the same row but will be from different columnns."
+ .Attr("block_dim",
+ "Each bounding box spans this dimension."
+ "Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that "
+ "consumes the output of this operator."
+ "For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and "
+ "QuantizeBFP(W) would use block_dim 0."
"The default is the last dimension.",
- AttributeProto::INTS, std::vector{-1})
+ AttributeProto::INT, static_cast(-1))
.Attr("dtype", "The datatype to dequantize to.", AttributeProto::INT,
static_cast(ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_FLOAT)) // default
.Input(0, "x", "1-D, contiguous, raw, BFP data to be de-quantized.", "T1")
@@ -975,51 +980,61 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
.TypeConstraint("T", {"tensor(float)"}, "Constrain input and output types to float32 tensors.")
.TypeAndShapeInferenceFunction(EmbedLayerNormalizationShapeInference));
- ONNX_MS_OPERATOR_SET_SCHEMA(
- QuantizeWithOrder,
- 1,
- OpSchema()
- .SetDoc(R"DOC(Quantize input matrix to specific layout used in cublaslt.)DOC")
- .Attr("order_input",
- "cublasLt order of input matrix. ORDER_COL = 0, ORDER_ROW = 1, ORDER_COL32 = 2, ORDER_COL4_4R2_8C = 3, ORDER_COL32_2R_4R4 = 4. "
- "Please refer https://docs.nvidia.com/cuda/cublas/index.html#cublasLtOrder_t for their meaning.",
- AttributeProto::INT)
- .Attr("order_output", "cublasLt order of output matrix.", AttributeProto::INT)
- .Input(0, "input", "TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as (B, ROWS, COS)", "F")
- .Input(1, "scale_input", "scale of the input", "S")
- .Output(0, "output", "output tensor", "Q")
- .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
- .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, "Constrain to float types")
- .TypeConstraint("S", {"tensor(float)"}, "Constrain Scale to float32 types")
- .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
- propagateElemTypeFromDtypeToOutput(ctx, ONNX_NAMESPACE::TensorProto::INT8, 0);
- if (!hasInputShape(ctx, 0)) return;
- auto& input_shape = getInputShape(ctx, 0);
- updateOutputShape(ctx, 0, input_shape);
- }));
+ONNX_MS_OPERATOR_SET_SCHEMA(
+ QuantizeWithOrder, 1,
+ OpSchema()
+ .SetDoc(R"DOC(Quantize input matrix to specific layout used in cublaslt.)DOC")
+ .Attr("order_input",
+ "cublasLt order of input matrix. ORDER_COL = 0, ORDER_ROW = 1, ORDER_COL32 = 2, ORDER_COL4_4R2_8C = 3, "
+ "ORDER_COL32_2R_4R4 = 4. "
+ "Please refer https://docs.nvidia.com/cuda/cublas/index.html#cublasLtOrder_t for their meaning.",
+ AttributeProto::INT)
+ .Attr("order_output", "cublasLt order of output matrix.", AttributeProto::INT)
+ .Input(0, "input",
+ "TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as "
+ "(B, ROWS, COS)",
+ "F")
+ .Input(1, "scale_input", "scale of the input", "S")
+ .Output(0, "output", "output tensor", "Q")
+ .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
+ .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, "Constrain to float types")
+ .TypeConstraint("S", {"tensor(float)"}, "Constrain Scale to float32 types")
+ .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
+ propagateElemTypeFromDtypeToOutput(ctx, ONNX_NAMESPACE::TensorProto::INT8, 0);
+ if (!hasInputShape(ctx, 0)) return;
+ auto& input_shape = getInputShape(ctx, 0);
+ updateOutputShape(ctx, 0, input_shape);
+ }));
- ONNX_MS_OPERATOR_SET_SCHEMA(
- DequantizeWithOrder,
- 1,
- OpSchema()
- .SetDoc(R"DOC(Dequantize input matrix to specific layout used in cublaslt. attr to specify output type, float16 or float32)DOC")
- .Attr("order_input", "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", AttributeProto::INT)
- .Attr("order_output", "cublasLt order of output matrix", AttributeProto::INT)
- .Attr("to", "The output data type, only support TensorProto_DataType_FLOAT (1) and TensorProto_DataType_FLOAT16 (10)", AttributeProto::INT)
- .Input(0, "input", "TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as (B, ROWS, COS)", "Q")
- .Input(1, "scale_input", "scale of the input", "S")
- .Output(0, "output", "output tensor", "F")
- .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
- .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, "Constrain to float types")
- .TypeConstraint("S", {"tensor(float)"}, "Constrain Scale to float32 types")
- .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
- propagateElemTypeFromAttributeToOutput(ctx, "to", 0);
- if (!hasInputShape(ctx, 0)) return;
- auto& input_shape = getInputShape(ctx, 0);
- updateOutputShape(ctx, 0, input_shape);
- }));
+ONNX_MS_OPERATOR_SET_SCHEMA(
+ DequantizeWithOrder, 1,
+ OpSchema()
+ .SetDoc(
+ R"DOC(Dequantize input matrix to specific layout used in cublaslt. attr to specify output type, float16 or float32)DOC")
+ .Attr("order_input",
+ "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.",
+ AttributeProto::INT)
+ .Attr("order_output", "cublasLt order of output matrix", AttributeProto::INT)
+ .Attr("to",
+ "The output data type, only support TensorProto_DataType_FLOAT (1) and TensorProto_DataType_FLOAT16 (10)",
+ AttributeProto::INT)
+ .Input(0, "input",
+ "TODO: input tensor of (ROWS, COLS). if less than 2d, will broadcast to (1, X). If 3d, it is treated as "
+ "(B, ROWS, COS)",
+ "Q")
+ .Input(1, "scale_input", "scale of the input", "S")
+ .Output(0, "output", "output tensor", "F")
+ .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
+ .TypeConstraint("F", {"tensor(float16)", "tensor(float)"}, "Constrain to float types")
+ .TypeConstraint("S", {"tensor(float)"}, "Constrain Scale to float32 types")
+ .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
+ propagateElemTypeFromAttributeToOutput(ctx, "to", 0);
+ if (!hasInputShape(ctx, 0)) return;
+ auto& input_shape = getInputShape(ctx, 0);
+ updateOutputShape(ctx, 0, input_shape);
+ }));
- constexpr const char* QOrderedMatMul_ver1_doc = R"DOC(
+constexpr const char* QOrderedMatMul_ver1_doc = R"DOC(
Quantize (Int8) MatMul with order. Implement Y = alpha * A * B + bias + beta * C. Matrix A, B, C, Y are all int8 matrix.
Two type of order combination supported:
*) When order_B is ORDER_COL, order_A must be ORDER_ROW.
@@ -1032,31 +1047,32 @@ order_Y and order_C will be same as order_A.
Support per column quantized weight, ie, scale_B is 1-D vector of size [#cols of matrix B].
)DOC";
- ONNX_MS_OPERATOR_SET_SCHEMA(
- QOrderedMatMul,
- 1,
- OpSchema()
- .SetDoc(QOrderedMatMul_ver1_doc)
- .Attr("order_A", "cublasLt order of matrix A. See the schema of QuantizeWithOrder for order definition.", AttributeProto::INT)
- .Attr("order_B", "cublasLt order of matrix B", AttributeProto::INT)
- .Attr("order_Y", "cublasLt order of matrix Y and optional matrix C", AttributeProto::INT)
- .Input(0, "A", "3-dimensional matrix A", "Q")
- .Input(1, "scale_A", "scale of the input A.", "S")
- .Input(2, "B", "2-dimensional matrix B. Transposed if order_B is ORDER_COL.", "Q")
- .Input(3, "scale_B", "scale of the input B. Scalar or 1-D float32.", "S")
- .Input(4, "scale_Y", "scale of the output Y.", "S")
- .Input(5, "bias", "1d bias, not scaled with scale_Y.", "S", OpSchema::Optional)
- .Input(6, "C", "3d or 2d matrix C. if 2d expand to 3d first. Shape[0] should be 1 or same as A.shape[0] ", "Q", OpSchema::Optional)
- .Input(7, "scale_C", "scale of the input A.", "S", OpSchema::Optional)
- .Output(0, "Y", "Matrix multiply results from A * B", "Q")
- .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
- .TypeConstraint("S", {"tensor(float)"}, "Constrain bias and scales to float32")
- .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
- propagateElemTypeFromInputToOutput(ctx, 0, 0);
- ONNX_NAMESPACE::matmulShapeInference(ctx, 0, 2);
- }));
+ONNX_MS_OPERATOR_SET_SCHEMA(
+ QOrderedMatMul, 1,
+ OpSchema()
+ .SetDoc(QOrderedMatMul_ver1_doc)
+ .Attr("order_A", "cublasLt order of matrix A. See the schema of QuantizeWithOrder for order definition.",
+ AttributeProto::INT)
+ .Attr("order_B", "cublasLt order of matrix B", AttributeProto::INT)
+ .Attr("order_Y", "cublasLt order of matrix Y and optional matrix C", AttributeProto::INT)
+ .Input(0, "A", "3-dimensional matrix A", "Q")
+ .Input(1, "scale_A", "scale of the input A.", "S")
+ .Input(2, "B", "2-dimensional matrix B. Transposed if order_B is ORDER_COL.", "Q")
+ .Input(3, "scale_B", "scale of the input B. Scalar or 1-D float32.", "S")
+ .Input(4, "scale_Y", "scale of the output Y.", "S")
+ .Input(5, "bias", "1d bias, not scaled with scale_Y.", "S", OpSchema::Optional)
+ .Input(6, "C", "3d or 2d matrix C. if 2d expand to 3d first. Shape[0] should be 1 or same as A.shape[0] ", "Q",
+ OpSchema::Optional)
+ .Input(7, "scale_C", "scale of the input A.", "S", OpSchema::Optional)
+ .Output(0, "Y", "Matrix multiply results from A * B", "Q")
+ .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
+ .TypeConstraint("S", {"tensor(float)"}, "Constrain bias and scales to float32")
+ .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
+ propagateElemTypeFromInputToOutput(ctx, 0, 0);
+ ONNX_NAMESPACE::matmulShapeInference(ctx, 0, 2);
+ }));
- static const char* Attention_QOrdered_doc = R"DOC(
+static const char* Attention_QOrdered_doc = R"DOC(
Quantized version of simplified Multi-Head Self Attention(using int8 with specific matrix Layout).
Multi-Head Self Attention that can be either unidirectional (like GPT-2) or bidirectional (like BERT).
The mask_index input is optional. Besides raw attention mask with shape (batch_size, past_sequence_length + sequence_length)
@@ -1070,128 +1086,159 @@ Current version does not support past/present, extra_add and qkv_hidden_sizes.
TODO: Support them if needed in the future.
)DOC";
- ONNX_MS_OPERATOR_SET_SCHEMA(
- QOrderedAttention,
- 1,
- OpSchema()
- .SetDoc(Attention_QOrdered_doc)
- .Attr("num_heads", "Number of attention heads", AttributeProto::INT)
- .Attr("unidirectional", "Whether every token can only attend to previous tokens. Default value is 0.", AttributeProto::INT, static_cast(0))
- .Attr("qkv_hidden_sizes", "Hidden layer sizes of Q, K, V paths in Attention", AttributeProto::INTS, OPTIONAL_VALUE)
- .Attr("order_input", "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", AttributeProto::INT)
- .Attr("order_weight", "cublasLt order of weight matrix", AttributeProto::INT)
- .Attr("order_output", "cublasLt order of global bias", AttributeProto::INT)
- .Input(0, "input", "3D input tensor with shape (batch_size, sequence_length, input_hidden_size)", "Q")
- .Input(1, "scale_input", "scale of the input, scalar value (per tensor) currently.", "S")
- .Input(2, "scale_Q_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S")
- .Input(3, "scale_K_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S")
- .Input(4, "scale_V_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S")
- .Input(5, "Q_weight", "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", "Q")
- .Input(6, "K_weight", "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", "Q")
- .Input(7, "V_weight", "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size", "Q")
- .Input(8, "scale_Q_weight", "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)", "S")
- .Input(9, "scale_K_weight", "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)", "S")
- .Input(10, "scale_V_weight", "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel quantization)", "S")
- .Input(11, "Q_bias", "1D input tensor with shape (hidden_size)", "S")
- .Input(12, "K_bias", "1D input tensor with shape (hidden_size)", "S")
- .Input(13, "V_bias", "1D input tensor with shape (hidden_size)", "S")
- .Input(14, "scale_QKT_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S", OpSchema::Optional)
- .Input(15, "scale_QKT_softmax", "scale of the softmax result - scalar (per-tensor quantization)", "S", OpSchema::Optional)
- .Input(16, "scale_values_gemm", "scale of the gemm - scalar (per-tensor quantization). Also this is the output scale for the operator.", "S")
- .Input(17, "mask_index",
- "Attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, past_sequence_length + sequence_length)"
- "or (batch_size, sequence_length, past_sequence_length + sequence_length), or index with shape (batch_size) or (2 * batch_size).",
- "G", OpSchema::Optional)
- .Input(18, "past", "past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).", "Q", OpSchema::Optional)
- .Input(19, "extra_add", "additional add to QxK' with shape (batch_size, num_heads, sequence_length, sequence_length).", "S", OpSchema::Optional)
- .Output(0, "output", "3D output tensor with shape (batch_size, sequence_length, hidden_size)", "Q")
- .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
- .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32 tensors.")
- .TypeConstraint("G", {"tensor(int32)"}, "Constrain to integer types")
- .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
+ONNX_MS_OPERATOR_SET_SCHEMA(
+ QOrderedAttention, 1,
+ OpSchema()
+ .SetDoc(Attention_QOrdered_doc)
+ .Attr("num_heads", "Number of attention heads", AttributeProto::INT)
+ .Attr("unidirectional", "Whether every token can only attend to previous tokens. Default value is 0.",
+ AttributeProto::INT, static_cast(0))
+ .Attr("qkv_hidden_sizes", "Hidden layer sizes of Q, K, V paths in Attention", AttributeProto::INTS,
+ OPTIONAL_VALUE)
+ .Attr("order_input",
+ "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.",
+ AttributeProto::INT)
+ .Attr("order_weight", "cublasLt order of weight matrix", AttributeProto::INT)
+ .Attr("order_output", "cublasLt order of global bias", AttributeProto::INT)
+ .Input(0, "input", "3D input tensor with shape (batch_size, sequence_length, input_hidden_size)", "Q")
+ .Input(1, "scale_input", "scale of the input, scalar value (per tensor) currently.", "S")
+ .Input(2, "scale_Q_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S")
+ .Input(3, "scale_K_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S")
+ .Input(4, "scale_V_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S")
+ .Input(5, "Q_weight",
+ "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size",
+ "Q")
+ .Input(6, "K_weight",
+ "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size",
+ "Q")
+ .Input(7, "V_weight",
+ "2D input tensor with shape (input_hidden_size, hidden_size), where hidden_size = num_heads * head_size",
+ "Q")
+ .Input(8, "scale_Q_weight",
+ "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel "
+ "quantization)",
+ "S")
+ .Input(9, "scale_K_weight",
+ "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel "
+ "quantization)",
+ "S")
+ .Input(10, "scale_V_weight",
+ "scale of the weight (scalar for per-tensor quantization or 1-D of dims [hidden_size] for per-channel "
+ "quantization)",
+ "S")
+ .Input(11, "Q_bias", "1D input tensor with shape (hidden_size)", "S")
+ .Input(12, "K_bias", "1D input tensor with shape (hidden_size)", "S")
+ .Input(13, "V_bias", "1D input tensor with shape (hidden_size)", "S")
+ .Input(14, "scale_QKT_gemm", "scale of the gemm - scalar (per-tensor quantization)", "S", OpSchema::Optional)
+ .Input(15, "scale_QKT_softmax", "scale of the softmax result - scalar (per-tensor quantization)", "S",
+ OpSchema::Optional)
+ .Input(16, "scale_values_gemm",
+ "scale of the gemm - scalar (per-tensor quantization). Also this is the output scale for the operator.",
+ "S")
+ .Input(17, "mask_index",
+ "Attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, "
+ "past_sequence_length + sequence_length)"
+ "or (batch_size, sequence_length, past_sequence_length + sequence_length), or index with shape "
+ "(batch_size) or (2 * batch_size).",
+ "G", OpSchema::Optional)
+ .Input(18, "past",
+ "past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size).",
+ "Q", OpSchema::Optional)
+ .Input(19, "extra_add",
+ "additional add to QxK' with shape (batch_size, num_heads, sequence_length, sequence_length).", "S",
+ OpSchema::Optional)
+ .Output(0, "output", "3D output tensor with shape (batch_size, sequence_length, hidden_size)", "Q")
+ .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
+ .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32 tensors.")
+ .TypeConstraint("G", {"tensor(int32)"}, "Constrain to integer types")
+ .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
- ONNX_MS_OPERATOR_SET_SCHEMA(
- QOrderedLayerNormalization,
- 1,
- OpSchema()
- .SetDoc("QOrderedLayerNormalization")
- .Attr("axis",
- "The first normalization dimension: normalization "
- "will be performed along dimensions axis "
- ": rank(inputs).",
- AttributeProto::INT,
- static_cast(-1))
- .Attr("epsilon", "The epsilon value to use to avoid division by zero.",
- AttributeProto::FLOAT, 1e-5f)
- .Attr("order_X", "cublasLt order of input X. Default is ROW MAJOR. See the schema of QuantizeWithOrder for order definition.",
- AttributeProto::INT, static_cast(1))
- .Attr("order_Y", "cublasLt order of matrix Y, must be same as order_X. Default is ROW MAJOR.",
- AttributeProto::INT, static_cast(1))
- .AllowUncheckedAttributes()
- .Input(0, "X", "Input data tensor from the previous layer.", "Q")
- .Input(1, "scale_X", "scale of the quantized X", "S")
- .Input(2, "scale", "Scale tensor, i.e., gamma vector.", "F")
- .Input(3, "B", "Bias tensor.", "F", OpSchema::Optional)
- .Input(4, "scale_Y", "scale of the quantized X", "S")
- .Output(0, "Y", "Output data tensor.", "Q")
- .TypeConstraint("F", {"tensor(float16)", "tensor(float)"},
- "Constrain input gamma and bias could be float16/float tensors. "
- "float may get better precision, float16 runs faster.")
- .TypeConstraint("S", {"tensor(float)"}, "quantization scale must be float tensors.")
- .TypeConstraint("Q", {"tensor(int8)"}, "quantization tensor must be int8 tensors.")
- .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
- propagateShapeAndTypeFromFirstInput(ctx);
- propagateElemTypeFromInputToOutput(ctx, 0, 0);
- }));
+ONNX_MS_OPERATOR_SET_SCHEMA(QOrderedLayerNormalization, 1,
+ OpSchema()
+ .SetDoc("QOrderedLayerNormalization")
+ .Attr("axis",
+ "The first normalization dimension: normalization "
+ "will be performed along dimensions axis "
+ ": rank(inputs).",
+ AttributeProto::INT, static_cast(-1))
+ .Attr("epsilon", "The epsilon value to use to avoid division by zero.",
+ AttributeProto::FLOAT, 1e-5f)
+ .Attr("order_X",
+ "cublasLt order of input X. Default is ROW MAJOR. See the schema of "
+ "QuantizeWithOrder for order definition.",
+ AttributeProto::INT, static_cast(1))
+ .Attr("order_Y",
+ "cublasLt order of matrix Y, must be same as order_X. Default is ROW MAJOR.",
+ AttributeProto::INT, static_cast(1))
+ .AllowUncheckedAttributes()
+ .Input(0, "X", "Input data tensor from the previous layer.", "Q")
+ .Input(1, "scale_X", "scale of the quantized X", "S")
+ .Input(2, "scale", "Scale tensor, i.e., gamma vector.", "F")
+ .Input(3, "B", "Bias tensor.", "F", OpSchema::Optional)
+ .Input(4, "scale_Y", "scale of the quantized X", "S")
+ .Output(0, "Y", "Output data tensor.", "Q")
+ .TypeConstraint("F", {"tensor(float16)", "tensor(float)"},
+ "Constrain input gamma and bias could be float16/float tensors. "
+ "float may get better precision, float16 runs faster.")
+ .TypeConstraint("S", {"tensor(float)"}, "quantization scale must be float tensors.")
+ .TypeConstraint("Q", {"tensor(int8)"}, "quantization tensor must be int8 tensors.")
+ .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
+ propagateShapeAndTypeFromFirstInput(ctx);
+ propagateElemTypeFromInputToOutput(ctx, 0, 0);
+ }));
- ONNX_MS_OPERATOR_SET_SCHEMA(
- QOrderedGelu,
- 1,
- OpSchema()
- .SetDoc(R"DOC(Ordered Quantize Gelu.)DOC")
- .Attr("order_X", "cublasLt order of input X. Optional. See the schema of QuantizeWithOrder for order definition.",
- AttributeProto::INT, OPTIONAL_VALUE)
- .Attr("order_Y", "cublasLt order of matrix Y, must be same as order_X if specified together. Optional.",
- AttributeProto::INT, OPTIONAL_VALUE)
- .Input(0, "X", "N-dimensional input A", "Q")
- .Input(1, "scale_X", "scale of the input A", "S")
- .Input(2, "scale_Y", "scale of the output Y", "S")
- .Output(0, "Y", "Output of the Gelu", "Q")
- .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
- .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32")
- .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
+ONNX_MS_OPERATOR_SET_SCHEMA(
+ QOrderedGelu, 1,
+ OpSchema()
+ .SetDoc(R"DOC(Ordered Quantize Gelu.)DOC")
+ .Attr("order_X",
+ "cublasLt order of input X. Optional. See the schema of QuantizeWithOrder for order definition.",
+ AttributeProto::INT, OPTIONAL_VALUE)
+ .Attr("order_Y", "cublasLt order of matrix Y, must be same as order_X if specified together. Optional.",
+ AttributeProto::INT, OPTIONAL_VALUE)
+ .Input(0, "X", "N-dimensional input A", "Q")
+ .Input(1, "scale_X", "scale of the input A", "S")
+ .Input(2, "scale_Y", "scale of the output Y", "S")
+ .Output(0, "Y", "Output of the Gelu", "Q")
+ .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
+ .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32")
+ .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
- ONNX_MS_OPERATOR_SET_SCHEMA(
- QOrderedLongformerAttention,
- 1,
- OpSchema()
- .SetDoc(R"DOC(Quantized version of Longformer Self Attention (using int8 with specific matrix Layout).)DOC")
- .Attr("num_heads", "Number of attention heads", AttributeProto::INT)
- .Attr("window", "One sided attention windows length W, or half of total window length", AttributeProto::INT)
- .Attr("order_input", "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.", AttributeProto::INT)
- .Attr("order_weight", "cublasLt order of weight matrix", AttributeProto::INT)
- .Attr("order_global_weight", "cublasLt order of weight matrix", AttributeProto::INT)
- .Attr("order_output", "cublasLt order of global bias", AttributeProto::INT)
- .Input(0, "input", "3D input tensor with shape (batch_size, sequence_length, hidden_size), hidden_size = num_heads * head_size", "Q")
- .Input(1, "scale_input", "scale of the input", "S")
- .Input(2, "weight", "2D input tensor with shape (hidden_size, 3 * hidden_size)", "Q")
- .Input(3, "scale_weight", "scale of the weight", "S")
- .Input(4, "bias", "1D input tensor with shape (3 * hidden_size), fp32 only currently.", "S")
- .Input(5, "scale_bias", "reserved. (not used as add bias need float value in cublasLt for normal order.)", "S")
- .Input(6, "scale_qkv_gemm", "scale of the output for fused kqv gemm", "S")
- .Input(7, "mask", "Attention mask with shape (batch_size, sequence_length)", "F")
- .Input(8, "global_weight", "2D input tensor with shape (hidden_size, 3 * hidden_size)", "Q")
- .Input(9, "scale_global_weight", "scale of the global_weight", "S")
- .Input(10, "global_bias", "1D input tensor with shape (3 * hidden_size)", "S")
- .Input(11, "scale_global_gemm", "scale of the global_qkv_gemm", "S")
- .Input(12, "global", "Global attention flags with shape (batch_size, sequence_length)", "G")
- .Input(13, "scale_output", "scale of the output", "S")
- .Output(0, "output", "3D output tensor with shape (batch_size, sequence_length, hidden_size)", "Q")
- .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
- .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32 tensors.")
- .TypeConstraint("G", {"tensor(int32)"}, "Constrain to integer types")
- .TypeConstraint("F", {"tensor(float16)"}, "Be compatible with float version.")
- .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
+ONNX_MS_OPERATOR_SET_SCHEMA(
+ QOrderedLongformerAttention, 1,
+ OpSchema()
+ .SetDoc(R"DOC(Quantized version of Longformer Self Attention (using int8 with specific matrix Layout).)DOC")
+ .Attr("num_heads", "Number of attention heads", AttributeProto::INT)
+ .Attr("window", "One sided attention windows length W, or half of total window length", AttributeProto::INT)
+ .Attr("order_input",
+ "cublasLt order of input matrix. See the schema of QuantizeWithOrder for order definition.",
+ AttributeProto::INT)
+ .Attr("order_weight", "cublasLt order of weight matrix", AttributeProto::INT)
+ .Attr("order_global_weight", "cublasLt order of weight matrix", AttributeProto::INT)
+ .Attr("order_output", "cublasLt order of global bias", AttributeProto::INT)
+ .Input(0, "input",
+ "3D input tensor with shape (batch_size, sequence_length, hidden_size), hidden_size = num_heads * "
+ "head_size",
+ "Q")
+ .Input(1, "scale_input", "scale of the input", "S")
+ .Input(2, "weight", "2D input tensor with shape (hidden_size, 3 * hidden_size)", "Q")
+ .Input(3, "scale_weight", "scale of the weight", "S")
+ .Input(4, "bias", "1D input tensor with shape (3 * hidden_size), fp32 only currently.", "S")
+ .Input(5, "scale_bias", "reserved. (not used as add bias need float value in cublasLt for normal order.)", "S")
+ .Input(6, "scale_qkv_gemm", "scale of the output for fused kqv gemm", "S")
+ .Input(7, "mask", "Attention mask with shape (batch_size, sequence_length)", "F")
+ .Input(8, "global_weight", "2D input tensor with shape (hidden_size, 3 * hidden_size)", "Q")
+ .Input(9, "scale_global_weight", "scale of the global_weight", "S")
+ .Input(10, "global_bias", "1D input tensor with shape (3 * hidden_size)", "S")
+ .Input(11, "scale_global_gemm", "scale of the global_qkv_gemm", "S")
+ .Input(12, "global", "Global attention flags with shape (batch_size, sequence_length)", "G")
+ .Input(13, "scale_output", "scale of the output", "S")
+ .Output(0, "output", "3D output tensor with shape (batch_size, sequence_length, hidden_size)", "Q")
+ .TypeConstraint("Q", {"tensor(int8)"}, "Constrain input and output types to int8 tensors.")
+ .TypeConstraint("S", {"tensor(float)"}, "Constrain scales to float32 tensors.")
+ .TypeConstraint("G", {"tensor(int32)"}, "Constrain to integer types")
+ .TypeConstraint("F", {"tensor(float16)"}, "Be compatible with float version.")
+ .TypeAndShapeInferenceFunction(ONNX_NAMESPACE::propagateShapeAndTypeFromFirstInput));
} // namespace contrib
} // namespace onnxruntime
diff --git a/onnxruntime/test/contrib_ops/quantize_bfp_test.cc b/onnxruntime/test/contrib_ops/quantize_bfp_test.cc
index e13a933313..8e259f6735 100644
--- a/onnxruntime/test/contrib_ops/quantize_bfp_test.cc
+++ b/onnxruntime/test/contrib_ops/quantize_bfp_test.cc
@@ -34,11 +34,11 @@ TEST(QuantizeBFPTest, CreateQuantizeGraph) {
bfp_type.set_i(static_cast(onnxruntime::contrib::BFPType::BFP_1_8_8_16));
bfp_type.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INT);
attributes["bfp_type"] = bfp_type;
- ONNX_NAMESPACE::AttributeProto block_dims;
- block_dims.set_name("block_dims");
- block_dims.add_ints(1); // bounding box is over dimension 1
- block_dims.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INTS);
- attributes["block_dims"] = block_dims;
+ ONNX_NAMESPACE::AttributeProto block_dim;
+ block_dim.set_name("block_dim");
+ block_dim.set_i(1); // bounding box is over dimension 1
+ block_dim.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INT);
+ attributes["block_dim"] = block_dim;
std::vector output_defs;
ONNX_NAMESPACE::TypeProto y_byte;
@@ -91,11 +91,11 @@ TEST(DequantizeBFPTest, CreateDequantizeGraph) {
bfp_type.set_i(static_cast(onnxruntime::contrib::BFPType::BFP_1_8_8_16));
bfp_type.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INT);
attributes["bfp_type"] = bfp_type;
- ONNX_NAMESPACE::AttributeProto block_dims;
- block_dims.set_name("block_dims");
- block_dims.add_ints(1); // bounding box is over dimension 1
- block_dims.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INTS);
- attributes["block_dims"] = block_dims;
+ ONNX_NAMESPACE::AttributeProto block_dim;
+ block_dim.set_name("block_dim");
+ block_dim.set_i(1); // bounding box is over dimension 1
+ block_dim.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_INT);
+ attributes["block_dim"] = block_dim;
ONNX_NAMESPACE::AttributeProto dtype;
dtype.set_name("dtype");
dtype.set_i(static_cast(ONNX_NAMESPACE::TensorProto_DataType_FLOAT));