onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-17 21:10:43 +00:00

Author	SHA1	Message	Date
stevenlix	270c09a37f	Add timestamp logits processor for whisper (#15853 ) Enable timestamp estimation and logits processing for Whisper model.	2023-05-16 21:40:00 -07:00
kunal-vaishnavi	5b663d6797	Whisper Multitask and Multilingual (#15936 ) ### Description This PR enables Whisper's multitask format and allows a user to use Whisper for multiple tasks (e.g. transcription, translation) and for multilingual purposes (e.g. English, Spanish). This PR also removes `attention_mask` as a required input for Whisper with beam search. ### Usage Here is an example of how you can use Whisper for English transcription. ``` import numpy as np import onnxruntime as ort from datasets import load_dataset from transformers import AutoConfig, AutoProcessor model = "openai/whisper-tiny" config = AutoConfig.from_pretrained(model) processor = AutoProcessor.from_pretrained(model) forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe") # forced_decoder_ids is of the format [(1, 50259), (2, 50359), (3, 50363)] and needs to be # of the format [50258, 50259, 50359, 50363] where 50258 is the start token id forced_decoder_ids = [config.decoder_start_token_id] + list(map(lambda token: token[1], forced_decoder_ids)) ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") input_features = processor(ds[0]["audio"]["array"], return_tensors="np").input_features inputs = { "input_features": np.float32(input_features), "max_length": np.array([26], dtype=np.int32), "min_length": np.array([1], dtype=np.int32), "num_beams": np.array([2], dtype=np.int32), "num_return_sequences": np.array([1], dtype=np.int32), "length_penalty": np.array([1.0], dtype=np.float32), "repetition_penalty": np.array([1.0], dtype=np.float32), "decoder_input_ids": np.array([forced_decoder_ids], dtype=np.int32), } sess = ort.InferenceSession("whisper-tiny_beamsearch.onnx", providers=["CPUExecutionProvider"]) outputs = sess.run(None, inputs) # Print tokens and decoded output print(outputs[0][0][0]) print(processor.decode(outputs[0][0][0])) ``` If you don't want to provide specific decoder input ids or you want Whisper to predict the output language and task, you can set `forced_decoder_ids = [config.decoder_start_token_id]` instead. ### Motivation and Context As seen in the figure below from the [OpenAI Whisper paper](https://cdn.openai.com/papers/whisper.pdf), Whisper can be used for multiple tasks and languages. ![Screenshot 2023-05-12 165215](https://github.com/microsoft/onnxruntime/assets/115581922/49335e39-a79c-4f78-92e9-89b034405f65)	2023-05-15 14:36:33 -07:00
Ye Wang	3418ca28a8	pack qkv in t5 decoder (#15801 ) ### Description <!-- Describe your changes. --> V100, b_4_s_128, max_output_len=64, beam=4 before: t5_small: 101.28ms t5_base: 200.07ms after: t5_small: 87.65ms t5_base: 174.44ms ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-05-15 13:45:39 -07:00
kunal-vaishnavi	39d6d7050d	Change EmbedLayerNormalization mask index output to optional (#15526 ) ### Description This PR changes an EmbedLayerNormalization node's mask index output to be an optional output if a mask input is not provided. ### Motivation and Context The documentation for EmbedLayerNormalization states ``` The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated. ``` However, if the mask input is not provided, the mask index output is still calculated and required.	2023-04-27 16:32:42 -07:00
Patrice Vignola	3be5bfe363	[DML EP] Add MatMul + SoftMax fusion (#15240 )	2023-04-11 08:31:04 -07:00
stevenlix	6d126f8996	Add FP16 support for Whisper model (#15427 ) Current ORT can only run inference for Whisper FP32 model. This PR adds FP16 support.	2023-04-08 21:36:10 -07:00
Chen Fu	8dce83a818	Fuse 'Add' operator into FP16 Conv (#15213 ) ### Description Adding 'Add' functionality to FP16 Conv operator. It takes a tensor that has the same shape of the output tensor, and add it to the result tensor. ### Motivation and Context Needed to run Resnet 50	2023-04-07 09:51:03 -07:00
petermcaughan	1251964f96	Petermca/beamsearch whisper (#15339 ) ### Description Adjust various code paths to allow Whisper model to function with BeamSearch op. Approach: Add a new kModelType enum value in IGenerationParameters as so: #### Old: 0 = GPT2, 1 = T5 #### New: 0 = GPT2, 1 = T5, 2 = Whisper When the user assigns this attribute value to 2, various shape and type checks are changed to accommodate Whisper inputs. ### Motivation and Context BeamSearch is currently designed to function with BERT-based models with inputs as vocab tokens, and needs changes to function with Whisper inputs (3-D float values processed from audio data). --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-04-04 09:09:10 -07:00
Ye Wang	fbfe92f66a	DecoderMaskedMultiHeadAttention enhancement (#15292 )	2023-04-02 21:53:03 -07:00
Yufeng Li	c08d6b42e8	Add tool to support packing mode for BERT model (#15283 ) ### Description <!-- Describe your changes. --> Add a tool to convert fused BERT like model to packing mode ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-31 08:46:47 -07:00
Ye Wang	44ba23e0f5	Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166 ) ### Description <!-- Describe your changes. --> As synced offline, rename this op and will create another op for mha that supports both self and cross attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-23 12:31:38 -07:00
Ye Wang	2ee822d483	Extend memory efficient attention coverage in Attention/MHA cuda op (#15064 ) ### Description <!-- Describe your changes. --> 1. upgrade cutlass to 3.0 that containing attn_bias support. 2. extend Attention/MHA to use memory efficient attention when rel_pos_bias with [1, num_head, s, s] and 1d mask with [2 batch_size + 1] are present. new mask format introduction: MASK_1D_KEY_SEQ_LEN_START, [3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1], query_start[0], ..., query_start[batch_size - 1], query_end[batch_size - 1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size - 1]] e.g 2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this 1D mask is [3, 5, 0, 6, 12, 0, 6, 12] ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> It potentially benefits tnlrv6 and t5(encoder) --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-03-23 11:05:17 -07:00
Hariharan Seshadri	7033346605	Support mask_filter_value attribute in DecoderMaskedMultiheadAttention (#15158 )	2023-03-23 11:00:09 -07:00
Yufeng Li	c7ced7a5e9	Add PackedAttention for packing mode (#14858 ) ### Description <!-- Describe your changes. --> Transformer models can handle batch of inputs at once. However, sequences in a batch usually have different length. Then we have to pad the short one to have same length as the longest. This is not efficient especially for large batch with high variance. This PR introduces a PackedAttention operator which can take in packed sequences (no padding) and also produces output in packing mode. There will be another PR to use the PackedAttention to implement the encoder in packing mode. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-21 12:59:29 -07:00
Hariharan Seshadri	ed7ab1660d	[CUDA] Add option to use DecoderMaskedMultiheadAttention in BeamSearch (#14990 )	2023-03-15 17:16:32 -07:00
Ye Wang	538d64891a	[t5 optimization] kernel changes to t5 (#14928 ) ### Description <!-- Describe your changes. --> 1. support optional bias in Attention op (used in T5 encoder) 2. support broadcasting rel_pos_bias in attention_softmax.h 3. add scale in MHA op's attributes 4. support past_key/past_value and present_key/present_value in MHA 5. UT and parity tests are added 6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920 note: the fusions will be in another PR since mt5 needs to be tested and an issue from github will be investigated. Future works: 1. support shared buffer for past/present 2. enable trt kernels when possible and investigate (trt/cutlass)kernels with rel_pos_bias) 3. support KV/QKV packing with past/present ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-13 14:29:16 -07:00
Hariharan Seshadri	112a4d215a	[CUDA] Support decoding multihead self-attention implementation (#14848 )	2023-03-08 09:17:54 -08:00
Ye Wang	58da3cacdf	support NeoX-style rotary embedding (#14785 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-02-22 18:21:34 -08:00
Tianlei Wu	eb2ac72fa9	Stable Diffusion CUDA Optimizations Part 4 (#14680 ) (1) Support packed QKV format in MultiHeadAttention. This format could avoid add bias transpose when TRT fused kernel is used. (2) Add cache for cumulated sequence length computation. For SD, it only need computed once since sequence length is fixed. (3) Do not allocate qkv workspace to save memory for packed KV or QKV. (4) Add unit tests for packed kv and packed qkv format in MultiHeadAttention (5) Mark some fusion options for SD only Performance tests show slight improvement in T4. Average latency reduced 0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5 models. Memory usage drops from 5.1GB to 4.8GB.	2023-02-15 14:55:42 -08:00
Tianlei Wu	f638c5a2ae	Stable Diffusion CUDA Optimizations Part 3 (#14646 ) The third part for stable diffusion CUDA optimizations (1) Add BiasAdd operator to replace two Add (bias and residual); Add fusion for BiasAdd (2) Add Attention fusion for VAE decoder. (3) Update float16 conversion to handle Resize and GroupNorm. This could reduce two Cast nodes for each Resize op in fp16 model. (4) Force inputs and outputs to be float16 to avoid data casts in the pipeline. (5) Add options --force_fp32_ops, --inspect etc in optimize script so that user could force some operator to run in float32 to potentially get better image quality (with cost of performance). Performance tests show slight improvement in T4. Average latency reduced 0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.	2023-02-14 12:46:50 -08:00
Ye Wang	b539c364ee	Some kernel changes for TULR (#14517 ) ### Description <!-- Describe your changes. --> 1. fix a bug in relative position bias kernel where seq_len > 32 2. rename extra_add_qk to relative_position_bias 3. support relative_position_bias in multihead attention (B, N, S, S*) 4. gru_gate support by Lei ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>	2023-02-07 11:51:06 -08:00
Tianlei Wu	a6c5ba0185	Stable Diffusion CUDA Optimizations (#14428 ) ### Description Add stable diffusion CUDA kernel optimizations. The following are included: (1) GroupNorm operator. This kernel is from TensorRT 8.5. (2) BiasSplitGelu operator. This kernel is modified from SplitGelu of TensorRT 8.5. We added bias to the SplitGelu. (3) NhwcConv operator. This adds support of NHWC format (ONNX Conv operator uses NCHW format). (3) Update MultiHeadAttention (packed kv and no bias) for cross attention. This could avoid transpose of kv for TRT fused cross attention kernel. (4) Optimization and benchmark script Not included: (1) Script to convert Conv to NhwcConv in onnx graph. (2) Update symbolic shape inference for NhwcConv. (3) Add SeqLen2Spatial operator (4) Documents Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are implemented based on stable diffusion usage. They might not be applicable to any input size or dimensions. For example, BiasSplitGelu requires hidden size to be 2560 \| 5120 \| 10240, and NhwcConv assumes 4D input/weight. There is minor increasement of binary size. For SM=75 only, python package wheel size adds (33757K - 33640K) = 117 KB. It is possible to move NHWC from template parameter to constructor to reduce binary size (with slight cost of performance). Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest cuDNN to get best performance.	2023-02-02 23:43:51 -08:00
Ye Wang	de7a868d5f	Update quantization_defs.cc (#14380 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-01-20 15:03:50 -08:00
Ye Wang	668586e8f8	Support muP in Attention (#14348 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-19 20:36:55 -08:00
Tianlei Wu	477cad3051	[CUDA] Add trt cross attention kernels (#14328 ) Add TRT cross attention kernels for stable diffusion optimization.	2023-01-17 17:55:45 -08:00
Ye Wang	2db57a53a3	Add mask_filter in Attention related ops' attribute (#14274 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/12843 Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-17 12:28:11 -08:00
Zhang Lei	15141a40b4	Add present_past_share_buff to QAttention Defs to enable QAttention related tests. (#14297 )	2023-01-14 09:19:06 -08:00
Ye Wang	c9a53c9255	Some changes to Sampling Op (#14218 ) ### Description <!-- Describe your changes. --> 1. add an optional input to pass in seed 2. two UTs. one for top_p=0.5, another for top_p=0.01(create greedy search result, in convert_generation.py) 3. fix a bug in cpu kernel ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-12 14:15:26 -08:00
Ye Wang	a01bf8dbb1	rename CrossAttention to MultiHeadAttention (#14201 ) ### Description <!-- Describe your changes. --> rename the CrossAttention to MultiheadAttention since this op can also be used as self attention ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-10 10:18:39 -08:00
Ye Wang	5eac2c1f41	relational attention bias cuda op (#14149 ) ### Description This cuda op implements the compute_bias() method in T5 Attention including the permutation. note: 1. bias_table needs to be saved in col-major. be careful when implementing fusion script 2. second input(sequence length) is placed on cpu. (using Shape node's output should be good) 3. the first dimension of output is 1, so extra_add_qk in attention should support broadcasting 4. compute_bias() only used in self-attn in t5 TODO: docs change will be applied later ### Motivation and Context It's part of the process of optimizing t5 attention as well as t5 based generation model Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-06 17:32:58 -08:00
Tianlei Wu	2cacb24cb0	Add CrossAttention operator (#14146 ) Move separated Q, K and V (without input projection) from Attention to a new operator CrossAttention. The Attention operator is hard to maintain when we need support with and without input projection in one class. Add a new operator according to feedback. Some change might need in the future, but not in this PR: (1) bias could be optional (We will not proceed that route unless experiments show that fusing Add bias with MatMul instead of this op could improve performance). (2) support packed KV. There are two ways to support it: when key and value are same Tensor, they are packed; or we can make value as optional, and use packed mode when value is empty and the key has packed K/V. (3) support cached key and value, and other (like relative position bias), or more attention mask format. They can be added easily without breaking backward compatible. (4) ROCm/CPU implementation of this op.	2023-01-06 14:27:40 -08:00
Hariharan Seshadri	d0c5ffd5f7	Misc transformer fixes - 2 (#14156 ) ### Description 1. The graph pattern search introduced in https://github.com/microsoft/onnxruntime/pull/13914/ needs to be enhanced so that SkipLayerNormalization is supported 2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization` fusion. The optional output of SLN needs to also include the bias (if present) and the added output should be a sum of `input + skip + (bias)` ### Motivation and Context Fix some breaking tests	2023-01-06 07:27:10 -08:00
Ye Wang	ae148ebc05	T5 skip_layer_norm cuda op (#14093 ) ### Description T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean Square Layer Normalization. ORT already have the simplified_layer_norm which is the RMS layer_norm. This PR extends this T5 layer_norm with support of skip/bias and the residual output. This new op is named SkipSimplifiedLayerNorm and has similar interface as SkipLayerNorm but removes the beta as input ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-04 13:31:53 -08:00
Ye Wang	68518a1b72	Sampling op (#13426 ) ### Description <!-- Describe your changes. --> Sampling op for cpu and cuda support huggingface case and custom case ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2022-12-22 17:34:12 -08:00
Hariharan Seshadri	7ed8bd4f95	Support (Bias)SkipLayerNormalization fusion in GPT2 (#13988 )	2022-12-21 23:04:44 -08:00
Zhang Lei	fba09faf5b	Implement reuse past and present tensor in Attention Ops. (#13791 ) Implement reuse kv_cache past and present tensor in Attention Ops. Unit test for abover feature. Utilize the reuse kv_cache for past and present tensor in Greedy Search. Correctness test for it. Co-authored-by: Zhang Lei <phill.zhang@gmail.com>	2022-12-18 10:03:53 -08:00
Hariharan Seshadri	abc5c25a85	Updates to GreedySearch/BeamSearch (#13943 )	2022-12-13 20:25:26 -08:00
Jian Chen	d7d932c1c2	Cjian/where python operator (#12795 ) Description: This PR will enable the python tool to run QWhere and QDQWhere operation Limitation: s8s8 Where is still not supported.	2022-12-12 13:27:47 -08:00
Hariharan Seshadri	004a1538d3	Extend vocab padding for logits MatMul for fp16 GPT2 GreedySearch (#13842 )	2022-12-06 19:39:20 -08:00
Tianlei Wu	8b0e0f4927	Add RemovePadding and RestorePadding for BERT model (#13701 ) Add two operators RemovePadding and RestorePadding based on ideal of effective transformer (https://github.com/bytedance/effective_transformer) to improve large batch size inference for BERT model.	2022-11-22 10:00:23 -08:00
Hariharan Seshadri	c7329e004d	Improve fp16 performance of GPT-2's logits MatMul while using BeamSearch (#13686 )	2022-11-18 18:50:19 -08:00
Ye Wang	38a74af45d	Support position_ids broadcasting in EmbedLayerNorm (#13677 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> fix https://github.com/microsoft/onnxruntime/issues/13508	2022-11-17 17:56:27 -08:00
Vincent Wang	8b0669bf63	QuickGelu Fusion (#12417 ) Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad. For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type: Before, FW takes 335us, BW takes 614us ![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png) After, FW takes 115us, BW takes 139us, which is much faster. ![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png) For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs] After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs] Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2022-10-28 18:12:07 +08:00
Tianlei Wu	7aafd86229	Update Attention operator to support separated Q/K/V inputs (#13410 ) ### Description Allow separated Q, K and V inputs to support cross attention: * Q: [batch_size, sequence_length, hidden_size] * K: [batch_size, kv_sequence_length, hidden_size] * V: [batch_size, kv_sequence_length, v_hidden_size] * Output: [batch_size, sequence_length, v_hidden_size] To use separated Q/K/V inputs, the input tensor is for query, and two optional inputs are added for key and value. Weights for input projection is not included for now, so the MatMul of input projection shall be done out of Attention operator, but Add bias is included for performance consideration.	2022-10-25 11:51:06 -07:00
Ye Wang	928c9889a3	A few fixes for generative model ops (#13363 ) ### Description <!-- Describe your changes. --> Fix a bug in GreedySearch Op when batch > 1 Support custom attention mask in GreedySearch and BeamSearch with GPT2 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-21 15:00:18 -07:00
garanews	38906625a3	fix some typo in docs (#13212 ) ### Description <!-- Describe your changes. --> fix some typo in docs ### Motivation and Context singed vs signed succeding vs succeeding fileter vs filter kernal vs kernel libary vs library	2022-10-07 15:58:18 -07:00
ashari4	b09dd11ece	BFP schemas: Change block dimension type to Int (#13169 ) * Change block dimension type to Int from Ints. * In response to feedback that the block dimension corresponds to the reduction dimension of the consuming matrix multiplication. There is always only 1 reduction dimension.	2022-10-06 11:11:43 -07:00
ashari4	c4a7e88fc8	QuantizeBFP and DequantizeBFP (#12833 ) * `QuantizeBFP` and `DequantizeBFP` schemas - similar to `QuantizeLinear` and `DeQuantizeLinear`. * BFP datatype is represented as a `uint8` tensor with shape and stride metadata. This is preferrable to adding a new datatype for BFP, which is more disruptive and [discouraged by PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2). Context: The Microsoft Floating Point (BFP) datatype shares an exponent for every n numbers called a “bounding box.” Each number still has its own mantissa and sign bits. BFP has been shown to incur 3-4 less cost (energy and area) than BFloat16 and INT8 counterparts without reductions in accuracy for the ImageNet benchmark as described in [Rouhani 2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf). Requirements: * There are many variants of BFP (number of mantissa bits, number of shared exponent bits, size of bounding box, custom bit fields, etc.) * The size and layout of an BFP variant varies across hardware * bounding box can be over arbitrary dimensions; for example, for the channel "C" dimension in a N x C x H x W tensor for convolution Goals of this PR: * Add initial versions of QuantizeBFP and DequantizeBFP operators to enable QDQ-style quantization with BFP. Once the schemas stabilize, we can consider upstreaming to ONNX. * Add some basic type and shape inferencing tests; tests that run on an EP will be a follow-up.	2022-09-22 14:02:55 -07:00
Hariharan Seshadri	ad69aac491	Introduce ordered quantization ops for the CUDA EP [1/n] (#12582 ) Initial core small set for the ordered quantization ops for cuda EP.	2022-09-07 11:58:15 -07:00
Yulong Wang	c144acc534	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00
Wei-Sheng Chin	dc486d146b	Make ORT callable from various Pytorch compilers (LazyTensor, TorchDynamo, etc) (#10460 ) * Make ORT as Pytorch JIT backend LORT likely doesn't work with aten fallback so we only test LORT in its own CI. * Revert changes to enable external CUDA allocator. Will add it later. Revert "Revert changes to enable external CUDA allocator. Will add it later." This reverts commit d5487f2e193014c805505afae8fb577c53667658. Fix external allocator * Relax tolerance and remove commented code * Print more information in CI * Fix pointer * Address comments. 1. Reuse ORT-eager mode's environment. 2. Remove unused ctor. * Use Pytorch master branch as all PRs are merged Fix * Refine based on cpplint feedbacks * Revert changes to allow custom CUDA allocator in public APIs * Use torch.testing.assert_close * Use unittest framework * Switch docker repo * Rename .cpp to .cc * Address comments * Add comment * Use same pipeline file for eager and lort pipelines * Address comments * Add yaml comment * Fix cmake files * Address comments * Rename flags, remove printing code, remove dead comment	2022-08-22 09:40:40 -07:00
Cheng	64e991a9fc	[Qlinearsoftmax] contrib cpu (#12177 ) * [Qlinearsoftmax] contrib cpu * int8 implementation * contrib operator md * qdq transformer test * new attribute: opset * doc * quantized tool * remove template to reduce Binary size * doc of contribe operators * enforce x_shape is valid * fix reduce_size if input-shape is dynamic * add UT * register one op for reducing binarysize * kernel hash update * docs/ContribOperators.md	2022-08-10 10:52:02 +08:00
Vincent Wang	37995a7245	[CUDA] BiasSoftmax Supporting New Pattern (#12361 )	2022-08-05 06:59:24 +08:00
Ye Wang	b622e5fa9b	Support vocab_mask/prefix_vocab_mask/no_repeat_number in greedysearch op (#12327 ) * support more inputs for greedy search * fix docs * refactor test * lint * review comments	2022-08-03 10:10:08 -07:00
Ye Wang	89ac61f4d4	support gpt2 model with greedy search (#12068 ) * greedy search gpt2 cpu checkin * add cuda support * add test * provider * update * fix some bugs * refactor impl class * refactor test * remove unused func * refactor parameters class * simplify padding * fix lint warnings * python format * Revert "python format" This reverts commit f25fe1017fa33d960b2418ebbb5dba6a4bd043cf. * python format * fix pipelines * fix pipeline * move bufferallocater to generate_impl_base * review comments(alignment, filename/namespace change) * rebase2 * python reformat * reformat * fix rocm build * review comment * review comments * review comments * fix a bug * rebase test files * python format * format import order * review comments * fix build	2022-07-22 15:45:16 -07:00
PeixuanZuo	5579d81fc8	[add] Add operator gemmfastgelu for ROCM (#12101 ) * [ADD] add gemm fast gelu * [UPDATE] refunction matmul_impl * [Update] delete tuning_ in this pr * [FIX] code format * [FIX] compiler warning * [Update] update doc	2022-07-13 15:40:16 +08:00
Ye Wang	859ef277a0	apply zcode changes to the beam search op (#11880 ) * apply zcode changes to the beam search op * fix pipeline failure * add doc * workaround for C# * update * update * use name zcode * review comment * review comments * fix cpplint * review coments	2022-06-20 18:39:07 -07:00
Tianlei Wu	6ee2c1b5fc	Remove temperature input from BeamSearch operator (#11896 ) * remove temperature input * update index of remaining inputs	2022-06-20 09:50:45 -07:00
Tianlei Wu	def78a1b81	Support T5 in BeamSearch operator (#11450 ) (1) Support T5 in BeamSearch operator, and add both CPU and CUDA implementation. (2) Change BeamSearch op: rename encoder_decoder_init attribute to encoder, and add decoder_start_token_id attribute (3) Update convert_to_onnx for T5 to use int32 instead of int64 inputs as default. (4) Add more tests in best_beam_search.py (5) fix ORT_ENFORCE of hypothesis_buffer_offset_ (6) Improve ONNX conversion: (a) Change encoder some dynamic axes to fixed dim value (b) add --separate_encoder_and_decoder_init (c) correct name t5-3B => t5-3b, t5-11B => t5-11b (d) Add --use_int32_inputs in convert t5 to onnx (e) Allow t5 beam search conversion in one step	2022-06-10 15:06:57 -07:00
Hector Li	95a16c1ffe	Snpe ep (#11665 ) * Initiate Ort SNPE EP * fix snpe ep windows build which is caused by the utility method (ToUTF8String) name change on master * correct the source path for libonnxruntime.so while building for andorid package * add AdditionalDependencies for amr64 * On MS-Windows, the patchfile must be a text file, i.e. CR-LF must be used as line endings. A file with LF may give the error: "Assertion failed, hunk, file patch.c, line 343," unless the option '--binary' is given. * fix build failure if snpe is not enabled * update doc for contrib op * separate out snpe ep settings to onnxruntime_snpe_provider.cmake * renaming according review comments * update according review comments	2022-06-03 14:10:02 -07:00
Vincent Wang	02724c54ff	[CUDA] Implement BitmaskDropout, BitmaskBiasDropout and BitmaskDropoutGrad (#11534 ) * Implement BitmaskDropout and associated unit tests. * Implement BitmaskDropoutGrad and associated unit tests. * Implement Dropout -> BitmaskDropout rewrite rule and associated unit tests. * Implement (Dropout,DropoutGrad) -> (BitmaskDropout,BitmaskDropoutGrad) rewrite rule. This commit does not yet include unit tests for this rewrite rule. This commit also introduces improved documentation for all changes which will be grouped into this PR. * bitmask dropout * fix win build * bugfix for rocm * bugfix * fix code format * fix ut * fix build break * fix ut in win * resolve comments * fix ut in trt * resolve comments * fix rocm build error * fix typo Co-authored-by: Aidan Beggs <aidanbeggs@microsoft.com>	2022-05-27 17:24:47 +08:00
Tianlei Wu	0e335aba37	Update BeamSearch operator spec to support t5 (#10777 ) * change BeamSearch op to support encoder decoder model * check model_type and decoder attribute * fix * update comments * warn shape inference issue with onnx v1.11 or T5 * skip parity test when tempature != 1.0 * fix build	2022-03-04 21:52:45 -08:00
Tianlei Wu	36c3271546	BeamSearch op cuda (#10556 ) Add BeamSearch cuda implementation with support of fp16 GPT-2 subgraph	2022-02-25 13:08:55 -08:00
Changming Sun	3185680b6c	Add NHWC CONV contrib op (#10506 )	2022-02-10 15:47:49 -08:00
Viswanath Boga	ad9d2e2e89	Prefix match in first iteration of beam search OP (#10231 ) * Add BeamSearch op schema * Add ONNX conversion for beams search * remove attention_mask and change input order * add option to run baseline * add check data type NULL * applies VerifyNodeAndOpMatch to subgraph * update input_ids shape * Add node name for Cast node * expose API for topk * parse parameters * Add beam search scorer * output results * fix typo * use c++ template and format python * fix build pipeline errors * symbolic shape infer of input onnx * output scores * add kernel def hash * Handle vocab_mask; move CheckSubgraph * undo insert_cast_transformer.cc and fusion_utils.py * fix typo * fix merge * update doc * add repetition penalty * refactoring: add GptSubgraph class * move BeamSearchState from .h to .cc file * adjust logits processor order * add batch generation example * fix repetition penalty for dup words in sequence * Add test * Add no repeat ngram processor * refactoring: move logits processor to classes * fix build warning * show latency * use allocator in beam state * use allocator in sequences * fix build error * move next_positions to beam state * Changes for prefix matching * removing debugs * removing more debugs * clean up * clean up * cpu doc updated * Updated docs * updated prefix_vocab_mask dimension in convert script * changes to support bxs prefix_vocab_mask in beamsearchop kernel * doc update * OperatorKernels.md updated * matching docs from artifacts * minor change in logits processor * Addressing comments * Updated the prefix vocab mask usage properly Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2022-02-03 00:14:39 +05:30
Vincent Wang	44e2db9397	CUDA BFloat16 Refactor (#10085 )	2022-01-14 19:38:56 +08:00
Vincent Wang	ceb17f82ff	Use FusedMatMul When Transpose is Between First Dim and Contiguous Batch Dims (#9734 ) * fusedmatmul support transpose batches * fix win build * fix contrib op md * more comments	2021-12-27 10:49:46 +08:00
Tianlei Wu	ef36488df0	Add BeamSearch operator for GPT-2 decoding (#9680 ) * Add BeamSearch operator and CPU implementation * Add ONNX conversion script	2021-12-16 16:08:05 -08:00
Ye Wang	6856619b18	Decoder Attention CUDA Op (#9792 ) * add kernel interface * register kernel * add self/cross qkv projection without cache * add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H) * refactor ConcatPastToPresent * DecoderQkvToContext interface * q,k,v buffer and cache as output * qk, pv and transctx * fix compiler error on linux machine * key_padding_mask * add test_parity file. However not runnable * add partial unittest * made partial attributes to inputs * --gen_doc * change kernel interface, add more tests * morre parity tests * fix test * fix typo * transpose optimizer has bug. remove it temporarily * add input shape checks * add type/shape inference * fix cache shape check * fix rocm build failure * fix rocm build error * review comments * review comments	2021-11-19 19:25:36 -08:00
Hariharan Seshadri	bbeceb7541	Support optional type in ORT (#8339 )	2021-11-04 15:01:42 -07:00
Viswanath Boga	85874bb315	embed layer fusion gpt2 (#9336 ) * Changes to fuse embed layer for gpt2, kernal changes pending * verified add output and regular add match * Test added for additional output embedlayernorm, working on CUDA * Test passing on CPU * updated convert_to_onnx toll to check parity correctly * removed some debugs * couple of TODO left as in optimizer.py * removed changes to optimizer.py * fixing build * fixing build * updated order of initilization * added a test case for float16 * updating the docs * updating tests failing due to embed layer fusion * update unit tests * updating CUDA documentation in operatorkernels.md * addressing comments * OperatorKernels.md updated with CUDA * adding TODO to qembed_layer * minor edit * updated docs * addressing comments * adding position ids to embed layer gpt2 * updating fused gpt2 model * added extra test * remove comments * addressing comments * contrib_defs.cc updated * all tests passing * fixing a typo * minor edit * trigger build * qembedlayernorm checkinputs updated * fixing build error * fixing build error * fixing build error	2021-10-28 11:06:26 -07:00
Bowen Bao	e983f37121	Bifurcation detector for aggressive decoding (#9432 ) ``` Component for aggressive decoding. Find the bifurcation index of predicted tokens, between source tokens, starting from previous suffix match index, and predicted tokens. Concat predicted tokens, starting from bifurcation index, to the back of current tokens. This forms the output tokens. Detect suffix match index in source tokens, between source tokens and output tokens. Detection is based on finding the appearances of last n-gram in output tokens in source tokens. A match is considered found if source tokens contain a single matching n-gram. Return the index of the start of the n-gram in source tokens. No matching if found if src tokens contain multiple or zero matching n-grams. Return -1. ```	2021-10-19 19:53:56 -07:00
Hariharan Seshadri	4698b73725	Fix output shape description of Attention op's schema (#9406 )	2021-10-19 15:56:35 -07:00
mindest	f9cf62912a	Add same_shape case for BiasDropout (#9188 ) * bias dropout improvement * add transform case for same shape case * combine kernel * merge with vectorized kernel * use "has_same_shape_bias" * minor: a "N % 4 != 0" case * add op UT for has_same_shape_bias * address comments; add param case for 1d bias; add param case tests for 1d and same-shape bias * rewrite logic condition Co-authored-by: Peng Wang <pengwa@microsoft.com>	2021-10-12 19:57:38 +08:00
Yufeng Li	ceeb1a65d6	Add quantization support of GEMM directly with QGemm (#8447 ) QGemm takes in quantized A, B, C, and quantization parameters of output Y, in which C and quantization parameters of Y are optional. Its output can be quantized or full precision, which depends on whether quantization parameters of Y exists or not. If quant params of Y are provided, the output will be requantized or is full precision. Comparing with QLinearMatMul and MatMulInteger, QGemm supports transpose, apha and beta attribute. The formula for quantized GEMM is: Y = alpha * scale_a * scale_b * ((A_int8 - zp_a) * (B_int8 - zp_b) + C_int32), in which, C_int32 is quantized with formula: C_int32 = (beta * C) / (alpha * scale_a * scale_b)	2021-07-27 21:21:49 -07:00
Dmitri Smirnov	950fe5e28b	Implement SparseTensor and infrastructure suppport and advance ONNX commit (#8038 ) SparseTensor support Implement Builder pattern Fix support for 1-D and 2-D COO indices Implement and test CSR support. Handle shape inference for SparseTensors Implement conversion for COO, CSR and tests. Address the case where constant sparse initializer is the output. Implement test infra for SparseTensors Implement SparseDenseMatMul for Csr and COO and tested it. Add hash for SparseToDenseMatMul Finish shared provider refactor Refactor GetOrCreate to Create Working on py interface Expose OrtDevice and use it in allocate_numpy Adjust Sparse interfaces, add support for string SparseTensor. Add tests. Add and test to_cuda() Add accessors to format specific indices Test values and indices views, read-only flag, after GC access Add sparse related methods to OrtValue Re-work SparseTensor wrapper, add OrtValue methods Rework numpy_array_to_cuda/to_cpu Add run_with_ort_values Add models and test sparse_mat_mul with run_with_ort_values Refactor sparse tensor to use a single buffer Ifdef x86 Eigen CSR sparse matmul implementation Exclude broken test, check for string type when copying cross device Split pybind schema, regenerate docs, add exclusion Conditionally exclude schema module Update docs fix cuda build Add test to a filter and renerate JS docs Add conversion and test string support for sparse tensors Exclude conversion utils from minimal build Add CUDA Memcpy and adjust provider interfaces	2021-07-22 15:24:36 -07:00
DeyuHuang	4275055868	Add Gridsampler contrib op (#8372 ) * add Gridsampler contrib op * fix gridsampler_paddingmode_border test * disable the tests until the kernel added * fix CI failure * change GridSampler to GridSample	2021-07-22 15:39:28 +08:00
Viswanath Boga	afce0e2543	Attention kernel update to handle different Q,K,V hidden sizes (#8039 ) * changes working to convert akv nodes * changes to replace nodes * changes to accomodate qkv hidden sizes as attributes * kernel to accept qkv_hidden_size attributes * Working till compute for varied dimension, todo applyattention() * changes to make all regression tests work * inference running successfully without prepack * success inference with pre-pack weights * add test for diff sizes * bias shape need not be a mul of 3 * get the output_hidden_size from input * infer output shape from input * merge with master * cleaning up files that got merged wrong * accurancy at accepted level * added unit test case for different dimensions * all unit tests passing * packed weights working for attention * prepacked weights working * added test case for newly added extra qk input * updated unit test to test only extra add qk * fixing build error * removing few debugs * reverting test changes * all python test passing * cleaning up * new unit test added, major clean up of code * removed extra code * minor * minor fix to tests * prepack weights code cleaned up * compacted compute() in attention.cc * reformat compute() * making a parameter T * adding 3 q,k,v buffers in all cases * fixing build * running tests only on cpu * Updating docs * trigger ci builds * Addressing comments in PR * addressing some more comments * get add_qk_str from add_qk node directly * updating docs, added extra check to verify attn inputs * Optimized the extra add by parallelizing * added attention_shape to symbolic_shape_infer.py * minor refactoring to address comments	2021-07-19 12:21:33 -07:00
Nick Kreeger	800b62a139	Create a quantized EmbedLayerNorm for ORT. (#8124 ) Create a quantized EmbedLayerNorm Op for ORT	2021-06-25 17:51:43 -05:00
Negin Raoof	80b7b134bf	Adding optional ops in contrib ops (#7946 ) * Added optional const spec	2021-06-24 13:16:31 -07:00
Bowen Bao	51c12a715b	Add NGramRepeatBlock contrib op (#8078 ) Description: Enforce no repetition of n-grams. Scores are set to `-inf` for tokens that form a repeated n-gram if added to the back of the input_ids. Motivation and Context Needed by transformer models in sequence generation algorithms (greedy search and beam search). This module has heavy impact on performance, and can be highly parallelized.	2021-06-21 10:21:48 -07:00
Scott McKay	0fbec1b9c1	Update the operator documentation generation (#7787 ) * Update the operator documentation generation - Make layout a little nicer - Update to latest supported operators including training - Fix some links that are broken when the docs content is copied to github-pages - Fix incorrect usage of 'onnx.ai.ml' as the default domain - ML ops are now separated from the real default domain of 'onnx.ai' - Include CPU, CUDA and training kernels - exclude DNNL as it's not an EP we own * There are separate paths for CUDA and CUDNN as they are not guaranteed to be in the same location on a Windows machine. Use the CUDNN path when looking for the CUDNN library. * Enable validation of both contrib ops and operator kernels in build Filter generation so it's deterministic Add ability for CI to publish the md files as build artifacts if they differ so a developer can download and add to their PR to resolve any diffs. Remove workarounds for github-pages as that will now link to the github docs which display correctly	2021-06-02 17:47:40 +10:00
Yufeng Li	a74e41e47d	Add non-zero zp support for quant matmul and attention (#7570 ) * add non-zero zp support * support A and B scale with any dimensions	2021-05-14 16:50:31 -07:00
Zhang Lei	50c5edcf13	Add nhwc support for QLinearAveragePool operator (#7656 ) * Add nhwc support for QLinearAveragePool operator * Update ContribOperators.md * Update OperatorKernels.md with cpu,dnnl and cuda enabled.	2021-05-13 22:05:30 -07:00
Tracy Sharpe	16297a8e61	Implement NCHWc Upsample linear mode (#7623 ) Extend the existing NCHWc Upsample operator to support linear modes too.	2021-05-10 12:16:16 -07:00
Ye Wang	803837df63	Add 4dmask support for attention cuda kernel (#7591 ) * checkin * add 4dmask support in attention cuda op * trim * add comments * fix build/test error * review comments and add tests * sync doc * review comments * minor change	2021-05-07 20:17:29 -07:00
Tracy Sharpe	d13e5b2fd9	NCHWc: ReorderInput improvements (#7442 ) Implement various improvements related to reordering a tensor for use by NCHWc operations: Relax the requirement that the input channel count must be a multiple of the NCHWc block size (either 8 or 16 depending on ISA). The requirement now is that the channel count must be a multiple of 4. The implementation of MlasReorderInputNchw would need further work to support relaxing this further, but I don't have any models where I've observed this to be necessary yet. Support fusing a Transpose(NHWC->NCHW) into a following ReorderInput. ReorderInput now has a channels_last attribute as was done in the past for ReorderOutput. This helps with models converted from TF where the converter is unable to remove all Transpose operations. Add threading support to ReorderInput to accelerate performance (ReorderOutput will come later).	2021-04-26 19:16:39 -07:00
Zhang Lei	ada0fbbd2d	Implement qlinear concat and unit test. (#7341 ) * Implement qlinear concat and unit test. Add quantization tools for QLinearConcat and it quantization tests. * Add kernel def hash for QLinearConcat. * Change according to PR. Add qdq transformer support for QLinearConcat. * Add QDQ Transformer unittest. Fix typo on domain. * remove dup logic of no use. * fix x86 build error. * Update operator docs.	2021-04-26 13:38:40 -07:00
Changming Sun	afa7b23609	Update docs/ContribOperators.md and the script that generates it. (#7399 )	2021-04-21 16:20:56 -07:00
Changming Sun	5bd192c439	Update ContribOperators.md (#7246 )	2021-04-05 17:11:33 -07:00
Ashwini Khade	2a018cc235	revert contrib op version bump and deprecation of TransposeMatMul (#5424 ) * revert contrib op version bump and deprecation of TransposeMatMul * update documentation	2020-10-12 13:02:15 -07:00
Ashwini Khade	3f00b8db8f	move all experimental ops to version 1 of ms domain (#5287 ) * move all experimental ops to version 1 of ms domain * deprecate TransposeMatMul in favor of FusedMatMul * update documentation	2020-09-30 14:50:18 -07:00
Nat Kershaw (MSFT)	8a03b6e5c7	Render Operator documentation as compliant markdown (#3658 )	2020-09-02 15:07:50 -07:00
Hariharan Seshadri	1599562016	Fix BatchNorm CUDA kernel definition	2020-04-18 17:21:29 -07:00
Hariharan Seshadri	b4457ecb7a	Fix `gen_doc` build option and refresh documentation (#3545 ) * Support listing keys in custom metadata map via C/C++ API * nit * PR feedback * Nit * Initial commit * More changes * Support listing keys in custom metadata map via C/C++ API * nit * PR feedback * Nit * Initial commit * More changes * Add md files * Doc changes * Update * revert cmake changes * Update * Doc change * Update * Update	2020-04-17 14:41:04 -07:00
David Fan	c9d83a52a8	Implement contrib op CropAndResize (#1277 ) * Implement contrib op CropAndResize * Implement contrib op CropAndResize	2019-06-24 18:34:35 -07:00
Hariharan Seshadri	c69dff7928	Implement contrib kernels for Pad (changed interface) and Unique (new ONNX op) (#1006 ) * Intial commit * Rename DynamicPad to Pad * More changes * Add Unique operator * Revert accidental check-in * Fix CUDA Pad to align with changes * More changes * Fix more CUDA pad source files * More fixes * More changes * More changes * Avoid vector copy * Update vector validation logic * Fix build failures * Fix build * Fix build failure * Fix tensorrt build	2019-05-13 13:10:18 -07:00
shahasad	306453f9d6	fix the link to the script in the doc. fix some error messages (#960 )	2019-05-02 19:21:41 -07:00
shahasad	2c46fff69a	Enable gen-doc on windows CI (#716 ) * add --gen_doc to ci_build * make gen-doc conditional to build/test step * some fix in the git diff check * some more trick on doc diff * updated for input/output * updated the contrib operator doc * fix on missing input output descriptions * fixed the problem of missing doc string, due to protobuf optimization * fix * revert last change * moved gen_doc.py to /tools/python * fixed typo	2019-05-01 14:58:21 -07:00
shahasad	83ae641425	add documentation for custom ops (#708 ) * added tools for doc gen, added doc * doc updated * some fixes * hooked up with build.py * hooked up with build.py and fail on nonupdated doc * update	2019-03-26 21:58:01 -07:00

1 2 3

150 commits