*This file is automatically generated from the registered contrib operator schemas by [this script](https://github.com/microsoft/onnxruntime/blob/main/tools/python/gen_contrib_doc.py).
<dd>Attention mask with shape (batch_size, 1, max_sequence_length, max_sequence_length), (batch_size, total_sequence_length) or (batch_size, sequence_length, total_sequence_length), or index with shape (batch_size) or (2 * batch_size)</dd>
<dd>past state for key and value with shape (2, batch_size, num_heads, past_sequence_length, head_size)When past_present_share_buffer is set, its shape is (2, batch_size, num_heads, max_sequence_length, head_size)</dd>
<dd>past state for key and value with shape (2, batch_size, num_heads, total_sequence_length, head_size). If past_present_share_buffer is set, its shape is (2, batch_size, num_heads, max_sequence_length, head_size), while effective_seq_length = (past_sequence_length + kv_sequence_length).</dd>
- ot = f(Xt*(Wo^T) + Ht-1*(Ro^T) + Po (.) Ct + Wbo + Rbo)
- Ht = ot (.) h(Ct)
AttentionWrapp Notations:
`lstm()' - wrapped inner cell.
Ht, Ct = lstm(concat(Xt, ATTNt-1), Ct-1)
`am()` - attention mechanism the wrapper used.
CONTEXTt, ALIGNt = am(Ht, ALIGNt-1)
`AW` - attention layer weights, optional.
`ATTN` - attention state, initial is zero. If `AW` provided, it is the output of the attention layer,
ATTNt = concat(Ht, CONTEXTt) * AW
otherwise,
ATTNt = CONTEXTt
RNN layer output:
`Y` - if needed is the sequence of Ht from lstm cell.
`Y_h` - is the last valid H from lstm cell.
`Y_c` - is the last valid C from lstm cell.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>activation_alpha</tt> : list of floats</dt>
<dd>Optional scaling values used by some activation functions. The values are consumed in the order of activation functions, for example (f, g, h) in LSTM. Default values are the same as of corresponding ONNX operators.For example with LeakyRelu, the default alpha is 0.01.</dd>
<dt><tt>activation_beta</tt> : list of floats</dt>
<dd>Optional scaling values used by some activation functions. The values are consumed in the order of activation functions, for example (f, g, h) in LSTM. Default values are the same as of corresponding ONNX operators.</dd>
<dt><tt>activations</tt> : list of strings</dt>
<dd>A list of 3 (or 6 if bidirectional) activation functions for input, output, forget, cell, and hidden. The activation functions must be one of the activation functions specified above. Optional: See the equations for default if not specified.</dd>
<dt><tt>clip</tt> : float</dt>
<dd>Cell clip threshold. Clipping bounds the elements of a tensor in the range of [-threshold, +threshold] and is applied to the input of activations. No clip if not specified.</dd>
<dt><tt>direction</tt> : string</dt>
<dd>Specify if the RNN is forward, reverse, or bidirectional. Must be one of forward (default), reverse, or bidirectional.</dd>
<dt><tt>hidden_size</tt> : int</dt>
<dd>Number of neurons in the hidden layer.</dd>
<dt><tt>input_forget</tt> : int</dt>
<dd>Couple the input and forget gates if 1, default 0.</dd>
<dd>The input sequences packed (and potentially padded) into one 3-D tensor with the shape of `[seq_length, batch_size, input_size]`</dd>
<dt><tt>W</tt> : T</dt>
<dd>The weight tensor for the gates. Concatenation of `W[iofc]` and `WB[iofc]` (if bidirectional) along dimension 0. The tensor has shape `[num_directions, 4*hidden_size, input_size]`.</dd>
<dt><tt>R</tt> : T</dt>
<dd>The recurrence weight tensor. Concatenation of `R[iofc]` and `RB[iofc]` (if bidirectional) along dimension 0. This tensor has shape `[num_directions, 4*hidden_size, hidden_size]`.</dd>
<dt><tt>B</tt> (optional) : T</dt>
<dd>The bias tensor for input gate. Concatenation of `[Wb[iofc], Rb[iofc]]`, and `[WBb[iofc], RBb[iofc]]` (if bidirectional) along dimension 0. This tensor has shape `[num_directions, 8*hidden_size]`. Optional: If not specified - assumed to be 0.</dd>
<dt><tt>sequence_lens</tt> (optional) : T1</dt>
<dd>Optional tensor specifying lengths of the sequences in a batch. If not specified - assumed all sequences in the batch to have length `seq_length`. It has shape `[batch_size]`</dd>
<dt><tt>initial_h</tt> (optional) : T</dt>
<dd>Optional initial value of the hidden. If not specified - assumed to be 0. It has shape `[num_directions, batch_size, hidden_size]`.</dd>
<dt><tt>initial_c</tt> (optional) : T</dt>
<dd>Optional initial value of the cell. If not specified - assumed to be 0. It has shape `[num_directions, batch_size, hidden_size]`.</dd>
<dt><tt>P</tt> (optional) : T</dt>
<dd>The weight tensor for peepholes. Concatenation of `P[iof]` and `PB[iof]` (if bidirectional) along dimension 0. It has shape `[num_directions, 3*hidde_size]`. Optional: If not specified - assumed to be 0.</dd>
<dt><tt>QW</tt> (optional) : T</dt>
<dd>The weight tensor of the query layer in the attention mechanism. Should be of shape `[num_directions, am_query_depth(hidden_size of lstm), am_attn_size]`</dd>
<dt><tt>MW</tt> (optional) : T</dt>
<dd>The weight tensor of the memory layer in the attention mechanism. Should be of shape `[num_directions, memory_depth, am_attn_size]`</dd>
<dt><tt>V</tt> (optional) : T</dt>
<dd>The attention_v tensor in the attention mechanism. Should be of shape `[num_directions, am_attn_size]`</dd>
<dt><tt>M</tt> (optional) : T</dt>
<dd>The sequence of the memory (input) for attention mechanism. Should be of `[batch_size, max_memory_step, memory_depth]`</dd>
<dt><tt>memory_seq_lens</tt> (optional) : T1</dt>
<dd>The sequence length of the input memory for the attention mechanism. Should be of `[batch_size]`</dd>
<dt><tt>AW</tt> (optional) : T</dt>
<dd>The weights of attention layer in the attention wrapper. If exists, should be of shape `[num_directions, memory_depth+hidden_size, aw_attn_size]. Please note that attention mechanism context depth is also memory_depth in the attention mechanism.`</dd>
<dd>The subgraph for the first decoding run. It will be called once before `decoder` subgraph. This is relevant only for the GPT2 model. If this attribute is missing, the `decoder` subgraph will be used for all decoding runs</dd>
<dd>The sequence used as a prompt for the generation. Shape is (batch_size, sequence_length)</dd>
<dt><tt>max_length</tt> : I</dt>
<dd>The maximum length of the sequence to be generated. Shape is (1)</dd>
<dt><tt>min_length</tt> (optional) : I</dt>
<dd>The minimum length below which the score of eos_token_id is set to -Inf. Shape is (1)</dd>
<dt><tt>num_beams</tt> : I</dt>
<dd>Number of beams for beam search. 1 means no beam search. Shape is (1)</dd>
<dt><tt>num_return_sequences</tt> : I</dt>
<dd>The number of returned sequences in the batch. Shape is (1)</dd>
<dt><tt>length_penalty</tt> (optional) : T</dt>
<dd>Exponential penalty to the length. Default value 1.0 means no penalty.Value > 1.0 encourages longer sequences, while values <1.0producesshortersequences.Shapeis(1,)</dd>
<dd>Mask of vocabulary for first step. Words that masked with 0 are not allowed to be generated, and 1 is allowed. Shape is (batch_size, vocab_size)</dd>
<dd>Word IDs of generated sequences. Shape is (batch_size, num_return_sequences, max_sequence_length)</dd>
<dt><tt>sequences_scores</tt> (optional) : T</dt>
<dd>Final beam score of the generated sequences. Shape is (batch_size, num_return_sequences)</dd>
<dt><tt>scores</tt> (optional) : T</dt>
<dd>Processed beam scores for each vocabulary token at each generation step.Beam scores consisting of log softmax scores for each vocabulary token and sum of log softmax of previously generated tokens in this beam.Shape is (max_length - sequence_length, batch_size, num_beams, vocab_size)</dd>
<dd>The ratio of random dropout, with value in [0, 1). If this input was not set, or if it was set to 0, the output would be a simple copy of the input. If it's non-zero, output will be a random dropout of the scaled input, which is typically the case during training. It is an optional value, if not specified it will default to 0.5.</dd>
<dd>If set to true then it indicates dropout is being used for training. It is an optional value hence unless specified explicitly, it is false. If it is false, ratio is ignored and the operation mimics inference mode where nothing will be dropped from the input data and if mask is requested as output it will contain all ones.</dd>
Y = softmax(scores + bias)) with simple broadcast on bias. Intended to specialize softmax(scores + additive_mask) commonly found in transformer models.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
<dd>apply softmax to elements for dimensions axis or higher</dd>
<dt><tt>is_inner_broadcast</tt> : int (required)</dt>
<dd>true if broadcast bias across input for dimensions broadcast_axis to axis-1, otherwise broadcast bias across input for dimensions 0 to broadcast_axis - 1</dd>
output, dropout_bitmask = Dropout(data + bias, ratio) + residual, Intended to specialize the dropout pattern commonly found in transformer models.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>seed</tt> : int</dt>
<dd>(Optional) Seed to the random generator, if not specified we will auto generate one.</dd>
</dl>
#### Inputs (2 - 5)
<dl>
<dt><tt>data</tt> : T</dt>
<dd>The input data as Tensor.</dd>
<dt><tt>bias</tt> : T</dt>
<dd>The bias input, a vector with the same shape as last dim of data OR same shape with data</dd>
<dt><tt>residual</tt> (optional) : T</dt>
<dd>The residual input, must have the same shape as data</dd>
<dt><tt>ratio</tt> (optional) : T1</dt>
<dd>The ratio of random dropout, with value in [0, 1). If this input was not set, or if it was set to 0, the output would be a simple copy of the input. If it's non-zero, output will be a random dropout of the scaled input, which is typically the case during training. It is an optional value, if not specified it will default to 0.5.</dd>
<dt><tt>training_mode</tt> (optional) : T2</dt>
<dd>If set to true then it indicates dropout is being used for training. It is an optional value hence unless specified explicitly, it is false. If it is false, ratio is ignored and the operation mimics inference mode where nothing will be dropped from the input data and if mask is requested as output it will contain all ones.</dd>
BitmaskDropout takes an input floating-point tensor, an optional input ratio (floating-point scalar) and an optional input training_mode (boolean scalar).
It produces two tensor outputs: output (floating-point tensor) and mask (optional `Tensor<uint32>`). If `training_mode` is true then the output Y will be a random dropout.
Note that this Dropout scales the masked input data by the following equation, so to convert the trained model into inference mode, the user can simply not pass `training_mode` input or set it to false.
```
output = scale * data * mask,
```
where
```
scale = 1. / (1. - ratio).
```
This op functions in much the same was as Dropout-11 and Dropout-13 do, execpt that the mask is output as a bit-packed uint32 tensor, instead of a boolean tensor.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>seed</tt> : int</dt>
<dd>(Optional) Seed to the random generator, if not specified we will auto generate one.</dd>
</dl>
#### Inputs (1 - 3)
<dl>
<dt><tt>data</tt> : T</dt>
<dd>The input data as Tensor.</dd>
<dt><tt>ratio</tt> (optional) : T1</dt>
<dd>The ratio of random dropout, with value in [0, 1). If this input was not set, or if it was set to 0, the output would be a simple copy of the input. If it's non-zero, output will be a random dropout of the scaled input, which is typically the case during training. It is an optional value, if not specified it will default to 0.5.</dd>
<dt><tt>training_mode</tt> (optional) : T2</dt>
<dd>If set to true then it indicates dropout is being used for training. It is an optional value hence unless specified explicitly, it is false. If it is false, ratio is ignored and the operation mimics inference mode where nothing will be dropped from the input data and if mask is requested as output it will contain all ones.</dd>
Extracts crops from the input image tensor and resizes them using bilinear sampling or nearest neighbor sampling
(possibly with aspect ratio change) to a common output size specified by crop_height and crop_width.
Returns a tensor with crops from the input image at positions defined at the bounding box locations in boxes.
The cropped boxes are all resized (with bilinear or nearest neighbor interpolation) to
a fixed size = [crop_height, crop_width]. The result is a 4-D tensor [num_boxes, crop_height, crop_width, depth].
The resizing is corner aligned.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>extrapolation_value</tt> : float</dt>
<dd>Value used for extrapolation, when applicable. Default is 0.0f. </dd>
<dt><tt>mode</tt> : string</dt>
<dd>The pooling method. Two modes are supported: 'bilinear' and 'nearest'. Default is 'bilinear'.</dd>
</dl>
#### Inputs
<dl>
<dt><tt>X</tt> : T1</dt>
<dd>Input data tensor from the previous operator; 4-D feature map of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data.</dd>
<dt><tt>rois</tt> : T1</dt>
<dd>RoIs (Regions of Interest) to pool over; rois is 2-D input of shape (num_rois, 4) given as [[y1, x1, y2, x2], ...]. The RoIs' coordinates are normalized in the coordinate system of the input image. Each coordinate set has a 1:1 correspondence with the 'batch_indices' input.</dd>
<dt><tt>batch_indices</tt> : T2</dt>
<dd>1-D tensor of shape (num_rois,) with each element denoting the index of the corresponding image in the batch.</dd>
<dt><tt>crop_size</tt> : T2</dt>
<dd>1-D tensor of 2 elements: [crop_height, crop_width]. All cropped image patches are resized to this size. Both crop_height and crop_width need to be positive.</dd>
</dl>
#### Outputs
<dl>
<dt><tt>Y</tt> : T1</dt>
<dd>RoI pooled output, 4-D tensor of shape (num_rois, C, crop_height, crop_width). The r-th batch element Y[r-1] is a pooled feature map corresponding to the r-th RoI X[r-1].</dd>
This DecoderAttention supports self attention and cross attention, key and value cache, and key_padding_mask. The attention mask is not support at the moment.
Some boolean parameters are passed by runtime input for generic purpose
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
It consumes the raw BFP data and some metadata such as the shape and strides of the original tensor and computes the dequantized tensor.
More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/
<dd>Each bounding box spans this dimension.Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that consumes the output of this operator.For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and QuantizeBFP(W) would use block_dim 0.The default is the last dimension.</dd>
The linear dequantization operator. It consumes a quantized data, a scale, a zero point and computes the full precision data.
The dequantization formula is y = (x - x_zero_point) * x_scale.
Scale and zero point must have same shape. They must be either scalar (per tensor) or 1-D tensor (per 'axis').
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>axis</tt> : int</dt>
<dd>The axis along which same quantization parameters are applied. It's optional.If it's not specified, it means per-tensor quantization and input 'x_scale' and 'x_zero_point' must be scalars.If it's specified, it means per 'axis' quantization and input 'x_scale' and 'x_zero_point' must be 1-D tensors.</dd>
<dd>Scale for input 'x'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-axis quantization.If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
<dd>Zero point for input 'x'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-axis quantization.If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>activation_alpha</tt> : list of floats</dt>
<dd>Optional scaling values used by some activation functions. The values are consumed in the order of activation functions, for example (f, g, h) in LSTM. Default values are the same as of corresponding ONNX operators.For example with LeakyRelu, the default alpha is 0.01.</dd>
<dt><tt>activation_beta</tt> : list of floats</dt>
<dd>Optional scaling values used by some activation functions. The values are consumed in the order of activation functions, for example (f, g, h) in LSTM. Default values are the same as of corresponding ONNX operators.</dd>
<dt><tt>activations</tt> : list of strings</dt>
<dd>A list of 3 (or 6 if bidirectional) activation functions for input, output, forget, cell, and hidden. The activation functions must be one of the activation functions specified above. Optional: See the equations for default if not specified.</dd>
<dt><tt>clip</tt> : float</dt>
<dd>Cell clip threshold. Clipping bounds the elements of a tensor in the range of [-threshold, +threshold] and is applied to the input of activations. No clip if not specified.</dd>
<dt><tt>direction</tt> : string</dt>
<dd>Specify if the RNN is forward, reverse, or bidirectional. Must be one of forward (default), reverse, or bidirectional.</dd>
<dt><tt>hidden_size</tt> : int</dt>
<dd>Number of neurons in the hidden layer</dd>
<dt><tt>input_forget</tt> : int</dt>
<dd>Couple the input and forget gates if 1.</dd>
</dl>
#### Inputs
<dl>
<dt><tt>X</tt> : T</dt>
<dd>The input sequences packed (and potentially padded) into one 3-D tensor with the shape of `[seq_length, batch_size, input_size]`.</dd>
<dt><tt>W</tt> : T2</dt>
<dd>The weight tensor for the gates. Concatenation of `W[iofc]` and `WB[iofc]` (if bidirectional) along dimension 0. The tensor has shape `[num_directions, input_size, 4*hidden_size]`.</dd>
<dt><tt>R</tt> : T2</dt>
<dd>The recurrence weight tensor. Concatenation of `R[iofc]` and `RB[iofc]` (if bidirectional) along dimension 0. This tensor has shape `[num_directions, hidden_size, 4*hidden_size]`.</dd>
<dt><tt>B</tt> (optional) : T</dt>
<dd>The bias tensor for input gate. Concatenation of `[Wb[iofc], Rb[iofc]]`, and `[WBb[iofc], RBb[iofc]]` (if bidirectional) along dimension 0. This tensor has shape `[num_directions, 8*hidden_size]`. Optional: If not specified - assumed to be 0.</dd>
<dt><tt>sequence_lens</tt> (optional) : T1</dt>
<dd>Optional tensor specifying lengths of the sequences in a batch. If not specified - assumed all sequences in the batch to have length `seq_length`. It has shape `[batch_size]`.</dd>
<dt><tt>initial_h</tt> (optional) : T</dt>
<dd>Optional initial value of the hidden. If not specified - assumed to be 0. It has shape `[num_directions, batch_size, hidden_size]`.</dd>
<dt><tt>initial_c</tt> (optional) : T</dt>
<dd>Optional initial value of the cell. If not specified - assumed to be 0. It has shape `[num_directions, batch_size, hidden_size]`.</dd>
<dt><tt>P</tt> (optional) : T</dt>
<dd>The weight tensor for peepholes. Concatenation of `P[iof]` and `PB[iof]` (if bidirectional) along dimension 0. It has shape `[num_directions, 3*hidde_size]`. Optional: If not specified - assumed to be 0.</dd>
<dt><tt>W_scale</tt> : T</dt>
<dd>W's scale. Its size is [num_directions] for per-tensor/layer quantization, or [num_directions, 4*hidden_size] for per-channel quantization on the axis input_size.</dd>
<dt><tt>W_zero_point</tt> : T2</dt>
<dd>W's zero point. Its size is [num_directions] for per-tensor/layer quantization, or [num_directions, 4*hidden_size] for per-channel quantization on the axis input_size.</dd>
<dt><tt>R_scale</tt> : T</dt>
<dd>R's scale. Its size is [num_directions] for per-tensor/layer quantization, or [num_directions, 4*hidden_size] for per-channel quantization on the axis input_size.</dd>
<dt><tt>R_zero_point</tt> : T2</dt>
<dd>R's zero point. Its size is [num_directions] for per-tensor/layer quantization, or [num_directions, 4*hidden_size] for per-channel quantization on the axis input_size.</dd>
</dl>
#### Outputs (0 - 3)
<dl>
<dt><tt>Y</tt> (optional) : T</dt>
<dd>A tensor that concats all the intermediate output values of the hidden. It has shape `[seq_length, num_directions, batch_size, hidden_size]`. </dd>
<dt><tt>Y_h</tt> (optional) : T</dt>
<dd>The last output value of the hidden. It has shape `[num_directions, batch_size, hidden_size]`.</dd>
<dt><tt>Y_c</tt> (optional) : T</dt>
<dd>The last output value of the cell. It has shape `[num_directions, batch_size, hidden_size]`.</dd>
</dl>
#### Type Constraints
<dl>
<dt><tt>T</tt> : tensor(float)</dt>
<dd>Constrain input and output types to float tensors.</dd>
<dd>Scale of quantized input 'B'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-column quantization. If it's a 1-D tensor, its number of elements should be equal to the number of columns of input 'B'.</dd>
<dt><tt>b_zero_point</tt> (optional) : T2</dt>
<dd>Zero point tensor for input 'B'. It's optional and default value is 0. It could be a scalar or a 1-D tensor, which means a per-tensor or per-column quantization. If it's a 1-D tensor, its number of elements should be equal to the number of columns of input 'B'.</dd>
GELU (Gaussian Error Linear Unit) approximation: Y=0.5*X*(1+tanh(0.797885*X+0.035677*X*X*X)) with an optional input of bias that will be added to X before GELU.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
<dd>The subgraph for the first decoding run. It will be called once before `decoder` subgraph. This is relevant only for the GPT2 model. If this attribute is missing, the `decoder` subgraph will be used for all decoding runs</dd>
<dd>Mask of vocabulary for first step. Words that masked with 0 are not allowed to be generated, and 1 is allowed. Shape is (batch_size, vocab_size)</dd>
Given an `input` and a flow-field `grid`, computes the `output` using `input` values and pixel locations from `grid`.
Currently, only spatial (4-D) inputs are supported. For `input` with shape (N, C, H, W) and `grid` with shape (N, H_out, W_out, 2),
the `output` will have shape (N, C, H_out, W_out).
For each output location `output[n, :, h, w]`, the size-2 vector `grid[n, h, w]` specifies `input` pixel locations `x` and `y`,
which are used to interpolate the output value `output[n, :, h, w]`.
The GridSample operator is often used in doing grid generator and sampler in the [Spatial Transformer Networks](https://arxiv.org/abs/1506.02025).
See also in [torch.nn.functional.grid_sample](https://pytorch.org/docs/master/generated/torch.nn.functional.grid_sample.html#torch-nn-functional-grid-sample).
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>align_corners</tt> : int</dt>
<dd>If align_corners=1, the extrema (-1 and 1) are considered as referring to the center points of the input's corner pixels. If align_corners=0, they are instead considered as referring to the corner points of the input's corner pixels, making the sampling more resolution agnostic.</dd>
<dt><tt>mode</tt> : string</dt>
<dd>Three interpolation modes: bilinear (default), nearest and bicubic.</dd>
<dt><tt>padding_mode</tt> : string</dt>
<dd>Support padding modes for outside grid values: `zeros`(default), `border`, `reflection`. zeros: use 0 for out-of-bound grid locations, border: use border values for out-of-bound grid locations, reflection: use values at locations reflected by the border for out-of-bound grid locations.</dd>
</dl>
#### Inputs
<dl>
<dt><tt>X</tt> : T1</dt>
<dd>4-D tensor of shape (N, C, H, W), where N is the batch size, C is the numbers of channels, H and W are the height and width of the input data.</dd>
<dt><tt>Grid</tt> : T1</dt>
<dd>Input offset, 4-D tensor of shape (N, H_out, W_out, 2), where H_out and W_out are the height and width of grid and output, Grid specifies the sampling pixel locations normalized by the input spatial dimensions. Therefore, it should have most values in the range of [-1, 1]. If grid has values outside the range of [-1, 1], the corresponding outputs will be handled as defined by padding_mode.</dd>
</dl>
#### Outputs
<dl>
<dt><tt>Y</tt> : T2</dt>
<dd>4-D tensor of shape (N, C, H_out, W_out).</dd>
<dd>Constrain output Y data types as 32-bit integer tensor.T3 must be tensor(uint32) when both T1 and T2 are tensor(uint16),or must be tensor(int32) when either T1 or T2 is tensor(int16).</dd>
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Inputs (4 - 7)
<dl>
<dt><tt>A</tt> : T1</dt>
<dd>N-dimensional matrix A</dd>
<dt><tt>B</tt> : T2</dt>
<dd>N-dimensional matrix B</dd>
<dt><tt>a_scale</tt> : T3</dt>
<dd>Scale of quantized input 'A'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-column quantization. If it's a 1-D tensor, its number of elements should be equal to the number of columns of input 'A'.</dd>
<dt><tt>b_scale</tt> : T3</dt>
<dd>Scale of quantized input 'B'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-column quantization. If it's a 1-D tensor, its number of elements should be equal to the number of columns of input 'B'.</dd>
<dt><tt>a_zero_point</tt> (optional) : T1</dt>
<dd>Zero point tensor for input 'A'. It's optional and default value is 0. It could be a scalar or a 1-D tensor, which means a per-tensor or per-column quantization. If it's a 1-D tensor, its number of elements should be equal to the number of columns of input 'A'.</dd>
<dt><tt>b_zero_point</tt> (optional) : T2</dt>
<dd>Zero point tensor for input 'B'. It's optional and default value is 0. It could be a scalar or a 1-D tensor, which means a per-tensor or per-column quantization. If it's a 1-D tensor, its number of elements should be equal to the number of columns of input 'B'.</dd>
<dt><tt>bias</tt> (optional) : T3</dt>
<dd>1D input tensor, whose dimension is same as B's last dimension</dd>
The underlying implementation is MurmurHash3_x86_32 generating low latency 32bits hash suitable for implementing lookup tables, Bloom filters, count min sketch or feature hashing.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>positive</tt> : int</dt>
<dd>If value is 1, output type is uint32_t, else int32_t. Default value is 1.</dd>
<dt><tt>seed</tt> : int</dt>
<dd>Seed for the hashing algorithm, unsigned 32-bit integer, default to 0.</dd>
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>auto_pad</tt> : string</dt>
<dd></dd>
<dt><tt>dilations</tt> : list of ints</dt>
<dd>dilation value along each spatial axis of the filter. If not present, the dilation defaults is 1 along each spatial axis.</dd>
<dt><tt>group</tt> : int</dt>
<dd>number of groups input channels and output channels are divided into.</dd>
<dt><tt>kernel_shape</tt> : list of ints</dt>
<dd>The shape of the convolution kernel. If not present, should be inferred from input W.</dd>
<dt><tt>pads</tt> : list of ints</dt>
<dd></dd>
<dt><tt>strides</tt> : list of ints</dt>
<dd>Stride along each spatial axis. If not present, the stride defaults is 1 along each spatial axis.</dd>
</dl>
#### Inputs (2 - 3)
<dl>
<dt><tt>X</tt> : T</dt>
<dd>Input data tensor from previous layer; has size (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and width. Note that this is for the 2D image. Otherwise the size is (N x C x D1 x D2 ... x Dn). Optionally, if dimension denotation is in effect, the operation expects input data tensor to arrive with the dimension denotation of [DATA_BATCH, DATA_CHANNEL, DATA_FEATURE, DATA_FEATURE ...].</dd>
<dt><tt>W</tt> : T</dt>
<dd>The weight tensor that will be used in the convolutions; has size (M x C/group x kH x kW), where C is the number of channels, and kH and kW are the height and width of the kernel, and M is the number of feature maps. For more than 2 dimensions, the kernel shape will be (M x C/group x k1 x k2 x ... x kn), where (k1 x k2 x ... kn) is the dimension of the kernel. Optionally, if dimension denotation is in effect, the operation expects the weight tensor to arrive with the dimension denotation of [FILTER_OUT_CHANNEL, FILTER_IN_CHANNEL, FILTER_SPATIAL, FILTER_SPATIAL ...]. Assuming zero based indices for the shape array, X.shape[1] == (W.shape[1] * group) == C and W.shape[0] mod G == 0. Or in other words FILTER_IN_CHANNEL multiplied by the number of groups should be equal to DATA_CHANNEL and the number of feature maps M should be a multiple of the number of groups G.</dd>
<dt><tt>B</tt> (optional) : T</dt>
<dd>Optional 1D bias to be added to the convolution, has size of M.</dd>
</dl>
#### Outputs
<dl>
<dt><tt>Y</tt> : T</dt>
<dd>Output data tensor that contains the result of the convolution. The output dimensions are functions of the kernel size, stride size, and pad lengths.</dd>
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>mode</tt> : string</dt>
<dd>Three modes: `constant`(default) - pads with a given constant value, `reflect` - pads with the reflection of the vector mirrored on the first and last values of the vector along each axis, `edge` - pads with the edge values of array</dd>
</dl>
#### Inputs (2 - 3)
<dl>
<dt><tt>data</tt> : T</dt>
<dd>Input tensor.</dd>
<dt><tt>pads</tt> : tensor(int64)</dt>
<dd>Tensor of integers indicating the number of padding elements to add or remove (if negative) at the beginning and end of each axis. For 2D input tensor, it is the number of pixels. `pads` should be a 1D tensor of shape [2 * input_rank] or a 2D tensor of shape [1, 2 * input_rank]. `pads` format (1D example) should be as follow [x1_begin, x2_begin,...,x1_end, x2_end,...], where xi_begin is the number of pixels added at the beginning of axis `i` and xi_end, the number of pixels added at the end of axis `i`.</dd>
<dt><tt>value</tt> (optional) : T</dt>
<dd>(Optional) A scalar or rank 1 tensor containing a single value to be filled if the mode chosen is `constant` (by default it is 0.0).</dd>
<dd>scale of weight scale. It's a scalar or a 1D tensor, which means a per-tensor/per-column quantization.Its size should be 3 * hidden_size if it is per-column quantization</dd>
<dd>zero point of quantized weight tensor. It's a scalar or a 1D tensor, which means a per-tensor/per-column quantization.Its size should be 3 * hidden_size if it is per-column quantization</dd>
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>alpha</tt> : float</dt>
<dd>Scalar multiplier for the product of input tensors A * B.</dd>
<dt><tt>transA</tt> : int</dt>
<dd>Whether A should be transposed</dd>
<dt><tt>transB</tt> : int</dt>
<dd>Whether B should be transposed</dd>
</dl>
#### Inputs (6 - 9)
<dl>
<dt><tt>A</tt> : TA</dt>
<dd>Input tensor A. The shape of A should be (M, K) if transA is 0, or (K, M) if transA is non-zero.</dd>
<dt><tt>a_scale</tt> : T</dt>
<dd>Scale of quantized input 'A'. It is a scalar,which means a per-tensor quantization.</dd>
<dt><tt>a_zero_point</tt> : TA</dt>
<dd>Zero point tensor for input 'A'. It is a scalar.</dd>
<dt><tt>B</tt> : TB</dt>
<dd>Input tensor B. The shape of B should be (K, N) if transB is 0, or (N, K) if transB is non-zero.</dd>
<dt><tt>b_scale</tt> : T</dt>
<dd>Scale of quantized input 'B'. It could be a scalar or a 1-D tensor, which means a per-tensor or per-column quantization. If it's a 1-D tensor, its number of elements should be equal to the number of columns of input 'B'.</dd>
<dt><tt>b_zero_point</tt> : TB</dt>
<dd>Zero point tensor for input 'B'. It's optional and default value is 0. It could be a scalar or a 1-D tensor, which means a per-tensor or per-column quantization. If it's a 1-D tensor, its number of elements should be equal to the number of columns of input 'B'.</dd>
<dt><tt>C</tt> (optional) : TC</dt>
<dd>Optional input tensor C. If not specified, the computation is done as if C is a scalar 0. The shape of C should be unidirectional broadcastable to (M, N). Its type is int32_t and must be quantized with zero_point = 0 and scale = alpha / beta * a_scale * b_scale.</dd>
<dt><tt>y_scale</tt> (optional) : T</dt>
<dd>Scale of output 'Y'. It is a scalar, which means a per-tensor quantization. It is optional. The output is full precision(float32) if it is not provided. Or the output is quantized.</dd>
<dt><tt>y_zero_point</tt> (optional) : TYZ</dt>
<dd>Zero point tensor for output 'Y'. It is a scalar, which means a per-tensor quantization. It is optional. The output is full precision(float32) if it is not provided. Or the output is quantized.</dd>
The output of each pooling window is divided by the number of elements (exclude pad when attribute count_include_pad is zero).
Input and output scales and zero points are used to convert the output to a new quantization range.
Output = Dequantize(Input) -> AveragePool on fp32 data -> Quantize(output)
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>auto_pad</tt> : string</dt>
<dd>auto_pad must be either NOTSET, SAME_UPPER, SAME_LOWER or VALID. Where default value is NOTSET, which means explicit padding is used. SAME_UPPER or SAME_LOWER mean pad the input so that the output spatial size match the input.In case of odd number add the extra padding at the end for SAME_UPPER and at the beginning for SAME_LOWER. VALID mean no padding.</dd>
<dt><tt>ceil_mode</tt> : int</dt>
<dd>Whether to use ceil or floor (default) to compute the output shape.</dd>
<dd>Whether include pad pixels when calculating values for the edges. Default is 0, doesn't count include pad.</dd>
<dt><tt>kernel_shape</tt> : list of ints (required)</dt>
<dd>The size of the kernel along each axis.</dd>
<dt><tt>pads</tt> : list of ints</dt>
<dd>Padding for the beginning and ending along each spatial axis, it can take any value greater than or equal to 0. The value represent the number of pixels added to the beginning and end part of the corresponding axis. `pads` format should be as follow [x1_begin, x2_begin...x1_end, x2_end,...], where xi_begin the number of pixels added at the beginning of axis `i` and xi_end, the number of pixels added at the end of axis `i`. This attribute cannot be used simultaneously with auto_pad attribute. If not present, the padding defaults to 0 along start and end of each spatial axis.</dd>
<dt><tt>strides</tt> : list of ints</dt>
<dd>Stride along each spatial axis. If not present, the stride defaults to 1 along each spatial axis.</dd>
</dl>
#### Inputs (4 - 5)
<dl>
<dt><tt>X</tt> : T</dt>
<dd>Input data tensor from the previous operator; dimensions for image case are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 ... Dn), where N is the batch size. Optionally, if dimension denotation is in effect, the operation expects the input data tensor to arrive with the dimension denotation of [DATA_BATCH, DATA_CHANNEL, DATA_FEATURE, DATA_FEATURE ...].</dd>
<dt><tt>x_scale</tt> : tensor(float)</dt>
<dd>Input scale. It's a scalar, which means a per-tensor/layer quantization.</dd>
<dt><tt>x_zero_point</tt> (optional) : T</dt>
<dd>Input zero point. Default value is 0 if it's not specified. It's a scalar, which means a per-tensor/layer quantization.</dd>
<dt><tt>y_scale</tt> : tensor(float)</dt>
<dd>Output scale. It's a scalar, which means a per-tensor/layer quantization.</dd>
<dt><tt>y_zero_point</tt> (optional) : T</dt>
<dd>Output zero point. Default value is 0 if it's not specified. It's a scalar, which means a per-tensor/layer quantization.</dd>
</dl>
#### Outputs
<dl>
<dt><tt>Y</tt> : T</dt>
<dd>Output data tensor from average or max pooling across the input tensor. Dimensions will vary based on various kernel, stride, and pad sizes. Floor value of the dimension is used</dd>
</dl>
#### Type Constraints
<dl>
<dt><tt>T</tt> : tensor(uint8), tensor(int8)</dt>
<dd>Constrain input and output types to 8 bit tensors.</dd>
Concatenate a list of tensors into a single tensor.All input tensors must have the same shape, except for the dimension size of the axis to concatenate on.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
QLinearGlobalAveragePool consumes an input tensor X and applies Average pooling across
the values in the same channel. This is equivalent to AveragePool with kernel size
equal to the spatial dimension of input tensor. Input is of type uint8_t or int8_t.
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>channels_last</tt> : int</dt>
<dd></dd>
</dl>
#### Inputs
<dl>
<dt><tt>X</tt> : T</dt>
<dd>Input data tensor from the previous operator; According to channels_last, dimensions for image case are (N x C x H x W), or (N x H x W x C) where N is the batch size, C is the number of channels, and H and W are the height and the width of the data. For non image case, the dimensions are in the form of (N x C x D1 x D2 ... Dn), or (N x D1 X D2 ... Dn x C) where N is the batch size.</dd>
<dt><tt>x_scale</tt> : tensor(float)</dt>
<dd>Scale of quantized input 'X'. It must be a scalar.</dd>
<dt><tt>x_zero_point</tt> : T</dt>
<dd>Zero point tensor for input 'X'. It must be a scalar.</dd>
<dt><tt>y_scale</tt> : tensor(float)</dt>
<dd>Scale of quantized output 'Y'. It must be a scalar.</dd>
<dt><tt>y_zero_point</tt> : T</dt>
<dd>Zero point tensor for output 'Y'. It must be a scalar.</dd>
</dl>
#### Outputs
<dl>
<dt><tt>Y</tt> : T</dt>
<dd>Output data tensor from pooling across the input tensor. The output tensor has the same rank as the input. with the N and C value keep it value, while the otherdimensions are all 1.</dd>
More documentation on the BFP format can be found in this paper: https://www.microsoft.com/en-us/research/publication/pushing-the-limits-of-narrow-precision-inferencing-at-cloud-scale-with-microsoft-floating-point/
<dd>Each bounding box spans this dimension.Typically, the block dimension corresponds to the reduction dimension of the matrix multipication that consumes the output of this operator.For example, for a 2D matrix multiplication A@W, QuantizeBFP(A) would use block_dim 1 and QuantizeBFP(W) would use block_dim 0.The default is the last dimension.</dd>
The linear quantization operator. It consumes a full precision data, a scale, a zero point to compute the low precision / quantized tensor.
The quantization formula is y = saturate ((x / y_scale) + y_zero_point).For saturation, it saturates to [0, 255] if it's uint8, or [-128, 127] if it's int8.
For (x / y_scale), it's rounding to nearest ties to even. Refer to https://en.wikipedia.org/wiki/Rounding for details.
Scale and zero point must have same shape. They must be either scalar (per tensor) or 1-D tensor (per 'axis').
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>axis</tt> : int</dt>
<dd>The axis along which same quantization parameters are applied. It's optional.If it's not specified, it means per-tensor quantization and input 'x_scale' and 'x_zero_point' must be scalars.If it's specified, it means per 'axis' quantization and input 'x_scale' and 'x_zero_point' must be 1-D tensors.</dd>
</dl>
#### Inputs
<dl>
<dt><tt>x</tt> : T1</dt>
<dd>N-D full precision Input tensor to be quantized.</dd>
<dt><tt>y_scale</tt> : T1</dt>
<dd>Scale for doing quantization to get 'y'. It could be a scalar or a 1-D tensor,which means a per-tensor or per-axis quantization. If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
<dt><tt>y_zero_point</tt> : T2</dt>
<dd>Zero point for doing quantization to get 'y'. It could be a scalar or a 1-D tensor, which means a per-tensoror per-axis quantization. If it's a 1-D tensor, its number of elements should be equal to the dimension value of 'axis' dimension of input 'x'.</dd>
</dl>
#### Outputs
<dl>
<dt><tt>y</tt> : T2</dt>
<dd>N-D quantized output tensor. It has same shape as input 'x'.</dd>
<dd>Constrain output data type to 32-bit integer tensor.T2 must be tensor(uint32) when T1 is tensor(uint8),or must be tensor(int32) when T1 is tensor(int8).</dd>
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>custom</tt> : int</dt>
<dd>If 1 custom sampling logic</dd>
<dt><tt>decoder</tt> : graph (required)</dt>
<dd>Decoder subgraph to execute in a loop.</dd>
<dt><tt>decoder_start_token_id</tt> : int</dt>
<dd>The id of the token that indicates decoding starts.</dd>
<dt><tt>encoder</tt> : graph</dt>
<dd>The subgraph for initialization of encoder and decoder. It will be called once before decoder subgraph.</dd>
<dt><tt>eos_token_id</tt> : int (required)</dt>
<dd>The id of the end-of-sequence token</dd>
<dt><tt>filter_value</tt> : float</dt>
<dd>All filtered values will be set to this float value.</dd>
<dt><tt>init_decoder</tt> : graph</dt>
<dd>The subgraph for the first decoding run. It will be called once before `decoder` subgraph. This is relevant only for the GPT2 model. If this attribute is missing, the `decoder` subgraph will be used for all decoding runs</dd>
<dt><tt>min_tokens_to_keep</tt> : int</dt>
<dd>Minimumber of tokens we keep per batch example in the output.</dd>
<dt><tt>model_type</tt> : int</dt>
<dd>Model type: 0 for decoder only like GPT-2; 1 for encoder decoder like Bart</dd>
<dt><tt>no_repeat_ngram_size</tt> : int</dt>
<dd>no repeat ngrams size</dd>
<dt><tt>pad_token_id</tt> : int (required)</dt>
<dd>The id of the padding token</dd>
<dt><tt>presence_penalty</tt> : float</dt>
<dd>Presence penalty for custom sampling</dd>
<dt><tt>temperature</tt> : float</dt>
<dd>The value used to module the next token probabilities.</dd>
<dt><tt>top_p</tt> : float</dt>
<dd>If set to float <1,onlythesmallestsetofmostprobabletokenswithprobabilitiesthataddupto`top_p`orhigherarekeptforgeneration.</dd>
<dt><tt>vocab_size</tt> : int</dt>
<dd>Size of the vocabulary. If not provided, it will be inferred from the decoder subgraph's output shape</dd>
<dd>Mask of vocabulary for first step. Words that masked with 0 are not allowed to be generated, and 1 is allowed. Shape is (batch_size, vocab_size)</dd>
<dt><tt>attention_mask</tt> (optional) : I</dt>
<dd>Custom attention mask. Shape is (batch_size, sequence_length)</dd>
<dt><tt>presence_mask</tt> (optional) : I</dt>
<dd>Presence penalty mask. Shape is (batch_size, vocab_size)</dd>
whose shape is [2, 5] because you can find at most 5 tokens per input string.
Note that the input at most can have two axes, so 3-D and higher dimension are not supported.
If "separators" contains a single empty string, the Tokenizer will enter into character tokenezation mode. This means all strings
will be broken part into individual characters.
For each input string, the second mode searches matches of "tokenexp" and each match will be a token in Y.
The matching of "tokenexp" is conducted greedily (i.e., a match should be as long as possible).
This operator searches for the first match starting from the beginning of the considered string,
and then launches another search starting from the first remained character after the first matched token.
If no match found, this operator will remove the first character from the remained string and do another search.
This procedure will be repeated until reaching the end of the considered string.
Let's consider another example to illustrate the effect of setting "mark" to true.
If input is ["Hello", "World"],
then the corresponding output would be [0x02, "Hello", "World", 0x03].
This implies that if mark is true, [C]/[N, C] - input's output shape becomes [C, D+2]/[N, C, D+2].
If tokenizer removes the entire content of [C]-input, it will produce [[]].
I.e. the output shape should be [C][0] or [N][C][0] if input shape was [N][C].
If the tokenizer receives empty input of [0] then the output is [0] if empty input
of [N, 0] then [N, 0].
#### Version
This version of the operator has been available since version 1 of the 'com.microsoft' operator set.
#### Attributes
<dl>
<dt><tt>mark</tt> : int (required)</dt>
<dd>Boolean whether to mark the beginning/end character with start of text character (0x02)/end of text character (0x03).</dd>
<dt><tt>mincharnum</tt> : int (required)</dt>
<dd>Minimum number of characters allowed in the output. For example, if mincharnum is 2, tokens such as "A" and "B" would be ignored</dd>
<dt><tt>pad_value</tt> : string (required)</dt>
<dd>The string used to pad output tensors when the tokens extracted doesn't match the maximum number of tokens found. If start/end markers are needed, padding will appear outside the markers.</dd>
<dt><tt>separators</tt> : list of strings</dt>
<dd>an optional list of strings attribute that contains a list of separators - regular expressions to match separators Two consecutive segments in X connected by a separator would be divided into two tokens. For example, if the input is "Hello World!" and this attribute contains only one space character, the corresponding output would be ["Hello", "World!"]. To achieve character-level tokenization, one should set the 'separators' to [""], which contains an empty string.</dd>
<dd>An optional string. Token's regular expression in basic POSIX format (pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03). If set, tokenizer may produce tokens matching the specified pattern. Note that one and only of 'tokenexp' and 'separators' should be set.</dd>
<dd>A 0-D scalar tensor. If specified, the entries at `padding_idx` do not contribute to the gradient; therefore, the embedding vector at `padding_idx` is not updated during training, i.e. it remains as a fixed pad.</dd>
<dd>A 0-D bool tensor. If given, this will scale gradients by the inverse of frequency of the indices (words) in the mini-batch. Default is ``False``</dd>
</dl>
#### Outputs
<dl>
<dt><tt>Y</tt> : T</dt>
<dd>Output tensor of the same type as the input tensor. Shape of the output is * x M, where '*' is the shape of input indices, and 'M' is the embedding size.</dd>
<dd>A 0-D tensor containing a single value corresponding to the number diagonals above or the main diagonal to exclude or include.Default value is 0 if it's not specified.</dd>
<dd>This operator applies convolution to word from left to right with window equal to conv_window_size and stride to 1.Take word 'example' for example, with conv_window_size equal to 2, conv is applied to [ex],[xa], [am], [mp]...If not provide, use the first dimension of conv kernel shape.</dd>