onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-17 21:10:43 +00:00

Author	SHA1	Message	Date
kunal-vaishnavi	901c2bc384	Whisper Model Optimization (#15473 ) ### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - https://github.com/huggingface/optimum/pull/872 - https://github.com/huggingface/optimum/pull/920 ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/15100 - https://github.com/microsoft/onnxruntime/issues/15235 - https://github.com/huggingface/optimum/issues/869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - https://github.com/microsoft/onnxruntime/pull/15247 - https://github.com/microsoft/onnxruntime/pull/15339 - https://github.com/microsoft/onnxruntime/pull/15362 - https://github.com/microsoft/onnxruntime/pull/15365 - https://github.com/microsoft/onnxruntime/pull/15427 This PR uses changes from the following merged PRs: - https://github.com/microsoft/onnxruntime/pull/14198 - https://github.com/microsoft/onnxruntime/pull/14146 - https://github.com/microsoft/onnxruntime/pull/14201 - https://github.com/microsoft/onnxruntime/pull/14928 (this introduced the new multi-head attention spec)	2023-04-18 17:13:54 -07:00
liqun Fu	919d8f2660	update with onnx main (#14929 )	2023-04-18 08:42:51 -07:00
Patrice Vignola	3be5bfe363	[DML EP] Add MatMul + SoftMax fusion (#15240 )	2023-04-11 08:31:04 -07:00
Patrice Vignola	7c927bb95c	[DML EP] Add BiasSplitGelu (#15197 )	2023-04-11 08:30:37 -07:00
Patrice Vignola	c5b6ee1a99	[DML EP] Add NhwcConv (#15194 )	2023-04-10 23:16:09 -07:00
Patrice Vignola	4a676b011a	[DML EP] Add BiasAdd (#15211 )	2023-04-10 14:46:33 -07:00
Patrice Vignola	9191e04259	[DML EP] Add QuickGelu (#15220 )	2023-04-05 10:49:34 -07:00
Aditya Goel	a4e9a48345	Reduce operators support for int64 type (#15358 )	2023-04-05 09:19:43 -07:00
Aditya Goel	1c1d386561	Adds int32_t and uint32_t clip kernels (#15306 )	2023-04-04 13:44:50 -07:00
petermcaughan	1251964f96	Petermca/beamsearch whisper (#15339 ) ### Description Adjust various code paths to allow Whisper model to function with BeamSearch op. Approach: Add a new kModelType enum value in IGenerationParameters as so: #### Old: 0 = GPT2, 1 = T5 #### New: 0 = GPT2, 1 = T5, 2 = Whisper When the user assigns this attribute value to 2, various shape and type checks are changed to accommodate Whisper inputs. ### Motivation and Context BeamSearch is currently designed to function with BERT-based models with inputs as vocab tokens, and needs changes to function with Whisper inputs (3-D float values processed from audio data). --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-04-04 09:09:10 -07:00
Ye Wang	fbfe92f66a	DecoderMaskedMultiHeadAttention enhancement (#15292 )	2023-04-02 21:53:03 -07:00
Patrice Vignola	67a6022c03	[DML EP] Add GroupNorm (#15189 ) Comparison between the different normalization operations: ![](https://user-images.githubusercontent.com/1041752/106491728-73d40680-64b7-11eb-8769-3f758996e959.png)	2023-03-27 12:52:53 -07:00
Ye Wang	44ba23e0f5	Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166 ) ### Description <!-- Describe your changes. --> As synced offline, rename this op and will create another op for mha that supports both self and cross attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-23 12:31:38 -07:00
Yufeng Li	c7ced7a5e9	Add PackedAttention for packing mode (#14858 ) ### Description <!-- Describe your changes. --> Transformer models can handle batch of inputs at once. However, sequences in a batch usually have different length. Then we have to pad the short one to have same length as the longest. This is not efficient especially for large batch with high variance. This PR introduces a PackedAttention operator which can take in packed sequences (no padding) and also produces output in packing mode. There will be another PR to use the PackedAttention to implement the encoder in packing mode. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-21 12:59:29 -07:00
Hariharan Seshadri	ed7ab1660d	[CUDA] Add option to use DecoderMaskedMultiheadAttention in BeamSearch (#14990 )	2023-03-15 17:16:32 -07:00
Ye Wang	538d64891a	[t5 optimization] kernel changes to t5 (#14928 ) ### Description <!-- Describe your changes. --> 1. support optional bias in Attention op (used in T5 encoder) 2. support broadcasting rel_pos_bias in attention_softmax.h 3. add scale in MHA op's attributes 4. support past_key/past_value and present_key/present_value in MHA 5. UT and parity tests are added 6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920 note: the fusions will be in another PR since mt5 needs to be tested and an issue from github will be investigated. Future works: 1. support shared buffer for past/present 2. enable trt kernels when possible and investigate (trt/cutlass)kernels with rel_pos_bias) 3. support KV/QKV packing with past/present ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-13 14:29:16 -07:00
Hariharan Seshadri	112a4d215a	[CUDA] Support decoding multihead self-attention implementation (#14848 )	2023-03-08 09:17:54 -08:00
Justin Stoecker	928289c414	STFT for DML EP (#14736 ) ### Description Implements the STFT operator for the DirectML execution provider. This is implemented as a custom op, just like the DFT kernel, because it's implemented as a composite of two operators (DML Mul/Identity + DFT). As such, this inherits the same restrictions as the existing DFT kernel (requires power-of-two window sizes for now). This change also adds a native FP16 shader to DFT so that both DFT/STFT kernels support float16 tensors. There is no typed UAV fallback or emulation path, so the HW _needs_ to support native float16. It also appears the stockham shader was compiled with all optimizations disabled and debug symbols (tsk tsk, Sheil), and this has been fixed. This is passing all existing STFT tests (i.e. all of 1). I'm adding some additional collateral in the Windows AI conformance tests in parallel to check some extra cases. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-02-23 21:12:22 -08:00
Sheil Kumar	1b7f65437e	Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442 ) Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP Opset 11 introduced the following sequence related operators: - SequenceAt - SequenceConstruct - SequenceEmpty - SequenceLength - SequenceErase - SequenceInsert - ConcatFromSequence With the exception of ConcatFromSequence, all of the above operators were implemented with CPU kernels that a) required all of the contained tensors to also be on CPU, and b) would clone each tensor into a new sequence as a side effect of each operator. The implementation of sequences are backend agnostic, as they dont affect actual tensor layout or manipulate the contents of the tensors. In addition, with the exception of SequenceAt, the other operators need not make copies of the underlying referenced tensors. Consequently, this change does the following: 1) Sequence* operators (except SequenceAt) no longer copies the contents of a sequence of tensors on every kernel execution. 2) SequenceAt uses the DataTransferManager to copy tensors agnostic to backend. 3) The internal container implemented by TensorSeq has changed from onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor does not support copy or assignment construction, so it must have a singular owner. However, is same tensor participates in multiple containers it would have multiple container "owners" and this would not be possible. 4) Other code that accessed values from TensorSeq have associated changes to extract Tensors from OrtValues now. In addition, DirectML execution was very slow when the above Sequence operators were added to a graph, as this caused MemcpyToHost and MemcpyFromHost kernels to be inserted between the graph and the sequence operators. To optimize DirectML, 1) The CPU implementations for the Sequence* ops were registered as DML implementations. Since the above changes also includes making the CPU kernel implementations EP agnostic, the CPU kernels can be added as is. 2) The ConcatFromSequence operator needed to be implemented on DirectML. However, there was little DirectML EP operator framework support for operators that accept/output sequences of tensors. This change has modified the internal COM interfaces to include new apis to interrogate for sequence shapes, and extract the needed tensors from TensorSeq. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-02-21 18:08:28 -08:00
Ryan Hill	892f59b31a	Add string support to tile op (#14686 ) ### Description Add std::string tensor type support to Tile operator ### Motivation and Context Multiple users are hitting this missing feature: https://github.com/microsoft/onnxruntime/issues/14511	2023-02-16 14:59:44 -08:00
Tianlei Wu	f638c5a2ae	Stable Diffusion CUDA Optimizations Part 3 (#14646 ) The third part for stable diffusion CUDA optimizations (1) Add BiasAdd operator to replace two Add (bias and residual); Add fusion for BiasAdd (2) Add Attention fusion for VAE decoder. (3) Update float16 conversion to handle Resize and GroupNorm. This could reduce two Cast nodes for each Resize op in fp16 model. (4) Force inputs and outputs to be float16 to avoid data casts in the pipeline. (5) Add options --force_fp32_ops, --inspect etc in optimize script so that user could force some operator to run in float32 to potentially get better image quality (with cost of performance). Performance tests show slight improvement in T4. Average latency reduced 0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.	2023-02-14 12:46:50 -08:00
Ye Wang	b539c364ee	Some kernel changes for TULR (#14517 ) ### Description <!-- Describe your changes. --> 1. fix a bug in relative position bias kernel where seq_len > 32 2. rename extra_add_qk to relative_position_bias 3. support relative_position_bias in multihead attention (B, N, S, S*) 4. gru_gate support by Lei ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>	2023-02-07 11:51:06 -08:00
Yufeng Li	8de885fdb1	reduce cuda library binary size (#14555 ) ### Description Reduce the cuda library size by: 1. refactoring beam_search_top_k to reduce template instantiation. It saves ~56MB 2. opt out TopK for type uint*, int8_t and int16_t. It saves ~50MB. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-07 09:03:14 -08:00
Patrice Vignola	b8fb9320ac	[DML EP] Fix ScatterElements registration (#14560 )	2023-02-06 10:01:02 -08:00
Tianlei Wu	a6c5ba0185	Stable Diffusion CUDA Optimizations (#14428 ) ### Description Add stable diffusion CUDA kernel optimizations. The following are included: (1) GroupNorm operator. This kernel is from TensorRT 8.5. (2) BiasSplitGelu operator. This kernel is modified from SplitGelu of TensorRT 8.5. We added bias to the SplitGelu. (3) NhwcConv operator. This adds support of NHWC format (ONNX Conv operator uses NCHW format). (3) Update MultiHeadAttention (packed kv and no bias) for cross attention. This could avoid transpose of kv for TRT fused cross attention kernel. (4) Optimization and benchmark script Not included: (1) Script to convert Conv to NhwcConv in onnx graph. (2) Update symbolic shape inference for NhwcConv. (3) Add SeqLen2Spatial operator (4) Documents Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are implemented based on stable diffusion usage. They might not be applicable to any input size or dimensions. For example, BiasSplitGelu requires hidden size to be 2560 \| 5120 \| 10240, and NhwcConv assumes 4D input/weight. There is minor increasement of binary size. For SM=75 only, python package wheel size adds (33757K - 33640K) = 117 KB. It is possible to move NHWC from template parameter to constructor to reduce binary size (with slight cost of performance). Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest cuDNN to get best performance.	2023-02-02 23:43:51 -08:00
Numfor Tiapo	3cc81460e0	Register ScatterElements-16 (#14425 ) This PR registers ScatterElements-16 to the DML EP - CPU fallback is added if the reduction attribute is in use, as this is not yet supported by DML. --------- Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2023-02-01 09:46:37 -08:00
liqun Fu	2b1a59f01a	cpu support of LpPool(18) (#14205 ) Signed-off-by: Liqun Fu <liqfu@microsoft.com> ### Description To support LpPool (18) ### Motivation and Context for Ort 1.14 release Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-01-25 23:14:56 -08:00
Thiago Crepaldi	32c05fcdd1	Add Col2Im CPU op (#12311 ) Description This PR implements N-dimensional Col2Im as a contrib CPU Op as specified by ONNX's https://github.com/onnx/onnx/pull/3948 Motivation and Context - Col2Im enables models such as: - [SS-DCNet](https://github.com/xhp-hust-2018-2011/SS-DCNet) - [DSTT](https://github.com/ruiliu-ai/DSTT) - It also serves to document the ORT's obscure `math::Col2ImNd` utility Signed-off-by: Liqun Fu <liqfu@microsoft.com> Co-authored-by: Liqun Fu <liqfu@microsoft.com>	2023-01-25 12:23:00 -08:00
liqun Fu	7b6d880b28	cpu to support bitwise ops (#14197 )	2023-01-23 16:42:18 -08:00
liqun Fu	05915d8393	support Pad(18) (#14219 )	2023-01-23 12:14:35 -08:00
liqun Fu	5d6a049141	support ScatterND(18) and ScatterElement(18) (#14224 )	2023-01-19 13:54:20 -08:00
Ye Wang	c9a53c9255	Some changes to Sampling Op (#14218 ) ### Description <!-- Describe your changes. --> 1. add an optional input to pass in seed 2. two UTs. one for top_p=0.5, another for top_p=0.01(create greedy search result, in convert_generation.py) 3. fix a bug in cpu kernel ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-12 14:15:26 -08:00
Numfor Tiapo	dee36f8ade	DML EP Register ScatterND-16 (#14240 ) This PR registers ScatterND-16 to the DML EP - CPU fallback is added if the reduction attribute is in use, as this is not yet supported by DML. Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2023-01-12 10:39:25 -08:00
Scott McKay	dd2df460b3	Split(18) (#14015 ) ### Description <!-- Describe your changes. --> Opset 18 Split changes. Adds ability to specify num_outputs which also allows uneven splitting. https://github.com/onnx/onnx/releases/tag/v1.13.0 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support ONNX opset 18.	2023-01-12 08:14:10 +10:00
Ye Wang	a01bf8dbb1	rename CrossAttention to MultiHeadAttention (#14201 ) ### Description <!-- Describe your changes. --> rename the CrossAttention to MultiheadAttention since this op can also be used as self attention ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-10 10:18:39 -08:00
Numfor Tiapo	f4ea781b81	DML EP Register Identity-16 (#14053 ) This PR Registers Identity-16 to the DML EP. ONNX Backend tests and optional type tests were skipped pending future additions. Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2023-01-10 09:16:09 -08:00
liqun Fu	1be36913cc	to work with onnx 1.13 rc, implement ver 18 reduce and optioanl ops, … (#13765 )	2023-01-09 10:26:16 -08:00
Ye Wang	5eac2c1f41	relational attention bias cuda op (#14149 ) ### Description This cuda op implements the compute_bias() method in T5 Attention including the permutation. note: 1. bias_table needs to be saved in col-major. be careful when implementing fusion script 2. second input(sequence length) is placed on cpu. (using Shape node's output should be good) 3. the first dimension of output is 1, so extra_add_qk in attention should support broadcasting 4. compute_bias() only used in self-attn in t5 TODO: docs change will be applied later ### Motivation and Context It's part of the process of optimizing t5 attention as well as t5 based generation model Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-06 17:32:58 -08:00
Tianlei Wu	2cacb24cb0	Add CrossAttention operator (#14146 ) Move separated Q, K and V (without input projection) from Attention to a new operator CrossAttention. The Attention operator is hard to maintain when we need support with and without input projection in one class. Add a new operator according to feedback. Some change might need in the future, but not in this PR: (1) bias could be optional (We will not proceed that route unless experiments show that fusing Add bias with MatMul instead of this op could improve performance). (2) support packed KV. There are two ways to support it: when key and value are same Tensor, they are packed; or we can make value as optional, and use packed mode when value is empty and the key has packed K/V. (3) support cached key and value, and other (like relative position bias), or more attention mask format. They can be added easily without breaking backward compatible. (4) ROCm/CPU implementation of this op.	2023-01-06 14:27:40 -08:00
Hariharan Seshadri	d0c5ffd5f7	Misc transformer fixes - 2 (#14156 ) ### Description 1. The graph pattern search introduced in https://github.com/microsoft/onnxruntime/pull/13914/ needs to be enhanced so that SkipLayerNormalization is supported 2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization` fusion. The optional output of SLN needs to also include the bias (if present) and the added output should be a sum of `input + skip + (bias)` ### Motivation and Context Fix some breaking tests	2023-01-06 07:27:10 -08:00
Ye Wang	ae148ebc05	T5 skip_layer_norm cuda op (#14093 ) ### Description T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean Square Layer Normalization. ORT already have the simplified_layer_norm which is the RMS layer_norm. This PR extends this T5 layer_norm with support of skip/bias and the residual output. This new op is named SkipSimplifiedLayerNorm and has similar interface as SkipLayerNorm but removes the beta as input ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-04 13:31:53 -08:00
Ye Wang	68518a1b72	Sampling op (#13426 ) ### Description <!-- Describe your changes. --> Sampling op for cpu and cuda support huggingface case and custom case ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2022-12-22 17:34:12 -08:00
Hariharan Seshadri	7ed8bd4f95	Support (Bias)SkipLayerNormalization fusion in GPT2 (#13988 )	2022-12-21 23:04:44 -08:00
Edward Chen	df8ff34f25	Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. (#13983 ) ### Description Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. With the way these kernels are currently registered, the documentation shows support for opset 11+. This is not accurate. ### Motivation and Context Fix #13781	2022-12-21 19:01:00 -05:00
Numfor Tiapo	8943d623a4	DML EP Register operators for Opset 16 (#14034 ) This PR Registers the following operators for opset 16 to the DML EP: - LeakyRelu-16 - PRelu-16 - Where-16 - GreaterOrEqual-16 - LessOrEqual-16 Identity-16 was not added in this PR due to pipeline failures Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-12-21 09:05:12 -08:00
Zhang Lei	fba09faf5b	Implement reuse past and present tensor in Attention Ops. (#13791 ) Implement reuse kv_cache past and present tensor in Attention Ops. Unit test for abover feature. Utilize the reuse kv_cache for past and present tensor in Greedy Search. Correctness test for it. Co-authored-by: Zhang Lei <phill.zhang@gmail.com>	2022-12-18 10:03:53 -08:00
Jakub Bachurski	3b17ab7c65	Add float64 kernels for Floor, Ceil, IsNaN (#13906 ) ### Description This PR adds support for `float64` kernels in the latest versions of operators: Floor, Ceil and IsNaN. ### Motivation and Context The lack of these kernels is non-trivial to work around and easily lead to performance losses when it is attempted. When equivalence with an existing implementation is required, precision is easily lost when casting to `float32` instead. IsNaN is common when cleaning up data in an ML pipeline. Floor and Ceil have uses for discretising values and single-precision floats are insufficient to round well when values get larger than a few million. According to my measurement this only increases the binary size by a few kilobytes (on the Python wheel of RelWithDebInfo). Closes #13673 (Round already has float64 support) Partially solves #8791 (Looks like there's parallel issues/PR open for Split, but it is also hard to work around and hence useful) Signed-off-by: jbachurski <kbachurski@gmail.com>	2022-12-14 14:57:14 -08:00
Patrice Vignola	8246ff015a	[DML EP] Add EmbedLayerNorm (#13868 ) ### Description Add EmbedLayerNorm to the DML EP	2022-12-13 13:23:53 -08:00
Jian Chen	d7d932c1c2	Cjian/where python operator (#12795 ) Description: This PR will enable the python tool to run QWhere and QDQWhere operation Limitation: s8s8 Where is still not supported.	2022-12-12 13:27:47 -08:00
Edward Chen	8cfbc4fe91	Add support for other data types to Split CPU kernel. (#13900 ) Split copies data - we can add support for all data types without too much binary size impact by using data type size-based implementations. The DispatchStridedCopy() function used here does this.	2022-12-12 09:29:15 -08:00
Patrice Vignola	96d8d2c278	[DML EP] Add SkipLayerNormalization (#13849 ) ### Description Add SkipLayerNormalization for the DML EP	2022-12-07 01:49:14 -08:00
Patrice Vignola	b53bbe7370	[DML EP] Add an implementation for NonZero (#13768 ) ### Description Add the NonZero op for DML ### Motivation and Context NonZero is used in a few transformer models, so having a DML implementation will stop large tensors from being transferred to the CPU and back to the GPU	2022-12-02 18:39:21 -08:00
Patrice Vignola	a0b470bc35	[DML EP] Add mixed datatype support for DML's LayerNorm contrib op (#13734 ) ### Description Add mixed datatype support for DML's LayerNorm contrib op. ### Motivation and Context The fusion logic removes casts around LayerNorm in the graph because the contrib version of the op supports mixed datatypes. Scale, Bias and Output's datatypes must match, but input's datatype can be different.	2022-12-01 14:08:18 -08:00
Patrice Vignola	e9b92fdf33	[DML EP] Add DML implementation for BiasGelu (#13795 ) ### Description Add DML implementation for BiasGelu	2022-12-01 09:23:19 -08:00
Tianlei Wu	8b0e0f4927	Add RemovePadding and RestorePadding for BERT model (#13701 ) Add two operators RemovePadding and RestorePadding based on ideal of effective transformer (https://github.com/bytedance/effective_transformer) to improve large batch size inference for BERT model.	2022-11-22 10:00:23 -08:00
Patrice Vignola	3482180ec2	DML EP add a registration for Shape and Size (#13442 ) ### Description Add a DML registration for Shape to avoid copying back to the CPU just to get the shape of a GPU tensor. ### Motivation and Context When using free dimensions, many Transformers models extensively use the `Shape` operator. This causes hundreds of GPU->CPU copy that should be completely avoidable. Note that this change also uses the same heuristics as other providers (e.g. CUDA) to force some tensors on the CPU in certain situations. Co-authored-by: Patrice Vignola <pavignol@microsoft.com>	2022-11-08 19:29:37 -08:00
Vincent Wang	8b0669bf63	QuickGelu Fusion (#12417 ) Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad. For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type: Before, FW takes 335us, BW takes 614us ![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png) After, FW takes 115us, BW takes 139us, which is much faster. ![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png) For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs] After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs] Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2022-10-28 18:12:07 +08:00
Changming Sun	07271b6c8a	Update docs/OperatorKernels.md (#13485 )	2022-10-27 20:11:49 -07:00
Scott McKay	ab71c4bbc0	Document generation CI is broken (#13308 ) ### Description <!-- Describe your changes. --> Fix document generation CI. It's not currently updating the docs as we're skipping the tests, which is the invocation of build.py that would have generated the documentation. Setup specific task to generate documentation for greater clarity. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Operator kernel documentation is not getting updated and is now out of date.	2022-10-28 07:20:48 +10:00
Tianlei Wu	7aafd86229	Update Attention operator to support separated Q/K/V inputs (#13410 ) ### Description Allow separated Q, K and V inputs to support cross attention: * Q: [batch_size, sequence_length, hidden_size] * K: [batch_size, kv_sequence_length, hidden_size] * V: [batch_size, kv_sequence_length, v_hidden_size] * Output: [batch_size, sequence_length, v_hidden_size] To use separated Q/K/V inputs, the input tensor is for query, and two optional inputs are added for key and value. Weights for input projection is not included for now, so the MatMul of input projection shall be done out of Attention operator, but Add bias is included for performance consideration.	2022-10-25 11:51:06 -07:00
Ye Wang	928c9889a3	A few fixes for generative model ops (#13363 ) ### Description <!-- Describe your changes. --> Fix a bug in GreedySearch Op when batch > 1 Support custom attention mask in GreedySearch and BeamSearch with GPT2 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-21 15:00:18 -07:00
Edward Chen	454f77cd94	Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791 ) # Motivation Currently, ORT minimal builds use kernel def hashes to map from nodes to kernels to execute when loading the model. As the kernel def hashes must be known ahead of time, this works for statically registered kernels. This works well for the CPU EP. For this approach to work, the kernel def hashes must also be known at ORT format model conversion time, which means the EP with statically registered kernels must also be enabled then. This is not an issue for the always-available CPU EP. However, we do not want to require that any EP which statically registers kernels is always available too. Consequently, we explore another approach to match nodes to kernels that does not rely on kernel def hashes. An added benefit of this is the possibility of moving away from kernel def hashes completely, which would eliminate the maintenance burden of keeping the hashes stable. # Approach In a full build, ORT uses some information from the ONNX op schema to match a node to a kernel. We want to avoid including the ONNX op schema in a minimal build to reduce binary size. Essentially, we take the necessary information from the ONNX op schema and make it available in a minimal build. We decouple the ONNX op schema from the kernel matching logic. The kernel matching logic instead relies on per-op information which can either be obtained from the ONNX op schema or another source. This per-op information must be available in a minimal build when there are no ONNX op schemas. We put it in the ORT format model. Existing uses of kernel def hashes to look up kernels are replaced with the updated kernel matching logic. We no longer store kernel def hashes in the ORT format model’s session state and runtime optimization representations. We no longer keep the logic to generate and ensure stability of kernel def hashes.	2022-09-20 14:24:59 -07:00
Dwayne Robinson	8e4eb24648	Update operator kernel table to include DML operators (#12887 ) * Fix bug in pybind get_all_operator_schema due to premature reference dropping * Add updated operator kernels markdown table * Update build.py to include documentation generation for DML operators too * Update GPU pipeline to include DML in the build to so operators can be generated. * Use a separate pipeline stage, feedback from Changming and Scott * Appease annoying Python linter * Add onnxruntime_BUILD_UNIT_TESTS=OFF and remove stale --use_dml in cuda stage	2022-09-09 10:21:25 -07:00
Hariharan Seshadri	ad69aac491	Introduce ordered quantization ops for the CUDA EP [1/n] (#12582 ) Initial core small set for the ordered quantization ops for cuda EP.	2022-09-07 11:58:15 -07:00
Yulong Wang	c144acc534	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00
Cheng	64e991a9fc	[Qlinearsoftmax] contrib cpu (#12177 ) * [Qlinearsoftmax] contrib cpu * int8 implementation * contrib operator md * qdq transformer test * new attribute: opset * doc * quantized tool * remove template to reduce Binary size * doc of contribe operators * enforce x_shape is valid * fix reduce_size if input-shape is dynamic * add UT * register one op for reducing binarysize * kernel hash update * docs/ContribOperators.md	2022-08-10 10:52:02 +08:00
Vincent Wang	cfa09d16d9	[CUDA] Mod Op Kernel (#12499 ) * mod for cuda and rocm * fix bfloat16 ut * change bf16 ut number * fix opset version * fix op kernel doc	2022-08-09 13:05:40 +08:00
Ye Wang	b622e5fa9b	Support vocab_mask/prefix_vocab_mask/no_repeat_number in greedysearch op (#12327 ) * support more inputs for greedy search * fix docs * refactor test * lint * review comments	2022-08-03 10:10:08 -07:00
Ye Wang	89ac61f4d4	support gpt2 model with greedy search (#12068 ) * greedy search gpt2 cpu checkin * add cuda support * add test * provider * update * fix some bugs * refactor impl class * refactor test * remove unused func * refactor parameters class * simplify padding * fix lint warnings * python format * Revert "python format" This reverts commit f25fe1017fa33d960b2418ebbb5dba6a4bd043cf. * python format * fix pipelines * fix pipeline * move bufferallocater to generate_impl_base * review comments(alignment, filename/namespace change) * rebase2 * python reformat * reformat * fix rocm build * review comment * review comments * review comments * fix a bug * rebase test files * python format * format import order * review comments * fix build	2022-07-22 15:45:16 -07:00
Gary Miguel	dc5d6b9515	register signal ops for opset 17 (#11778 ) * Register signal ops for op set 17 Note code is mostly being moved, not added. These ops were previously only registered as Microsoft contrib ops and only built if `BUILD_MS_EXPERIMENTAL_OPS=1`. They've been added to the ai.onnx standard op set in version 17. Main components of this change: * Move the kernels from the conrib_ops directory to the core directory. * Add function bodies for ms experimental ops. This will allow old models that use the contrib ops to continue to function. All the function bodies consist of a single op (the new standard op), so performance overhead should be minimal. Minor clean-up also in this change: * De-duplicate get_scalar_value_from_tensor: put it in a new utils.h. * Fix some bugs that caused compilation errors with the experimental ops. Tested with `build.sh --ms_experimental` * Fix some spelling errors and lint violations. * Replace a couple of switch statements with `MLTypeCallDispatcher`. * Use `InlineVector` instead of `std::vector`. Unblocks https://github.com/microsoft/onnxruntime/issues/11640	2022-06-27 10:26:55 +10:00
Gary Miguel	4bf22e2a40	Update ONNX to 1.12 (#11924 ) Follow-ups that need to happen after this and before the next ORT release: * Support SequenceMap with https://github.com/microsoft/onnxruntime/pull/11731 * Support signal ops with https://github.com/microsoft/onnxruntime/pull/11778 Follow-ups that need to happen after this but don't necessarily need to happen before the release: * Implement LayerNormalization kernel for opset version 17: https://github.com/microsoft/onnxruntime/issues/11916 Fixes #11640	2022-06-21 17:19:52 -07:00
Ye Wang	859ef277a0	apply zcode changes to the beam search op (#11880 ) * apply zcode changes to the beam search op * fix pipeline failure * add doc * workaround for C# * update * update * use name zcode * review comment * review comments * fix cpplint * review coments	2022-06-20 18:39:07 -07:00
Tianlei Wu	6ee2c1b5fc	Remove temperature input from BeamSearch operator (#11896 ) * remove temperature input * update index of remaining inputs	2022-06-20 09:50:45 -07:00
Vincent Wang	02724c54ff	[CUDA] Implement BitmaskDropout, BitmaskBiasDropout and BitmaskDropoutGrad (#11534 ) * Implement BitmaskDropout and associated unit tests. * Implement BitmaskDropoutGrad and associated unit tests. * Implement Dropout -> BitmaskDropout rewrite rule and associated unit tests. * Implement (Dropout,DropoutGrad) -> (BitmaskDropout,BitmaskDropoutGrad) rewrite rule. This commit does not yet include unit tests for this rewrite rule. This commit also introduces improved documentation for all changes which will be grouped into this PR. * bitmask dropout * fix win build * bugfix for rocm * bugfix * fix code format * fix ut * fix build break * fix ut in win * resolve comments * fix ut in trt * resolve comments * fix rocm build error * fix typo Co-authored-by: Aidan Beggs <aidanbeggs@microsoft.com>	2022-05-27 17:24:47 +08:00
Xavier Dupré	c37d2728bf	Implement TreeEnsemble for opset(ai.onnx.ml)==3 (#10821 ) * Implement TreeEnsemble for opset(ai.onnx.ml)==3 * use of InlineVector * refactoring * improve attributes retrieval * avoid creating a temporary buffer * modifies onnx.ml.cpu.json * use unordered_map * update docs/OperatorKernels.md * address PR comments (TH -> ThresholdType, ORT_RETURN...) * add a python unit test to load a TreeEnsembleRegressor following ai.onnx.ml==3 specifications	2022-03-30 12:53:12 +02:00
Vincent Wang	6a6840d5c6	Fuse LayerNormalization for Apex O2 (#10233 )	2022-03-29 21:22:04 +08:00
pengwa	89ef987ab1	Improve NonZero on CUDA/ROCM (#10307 ) * improve NonZero * fix megatron_fp16 optimzier, fix the doc * multi_tensor_applier * resolve comment * fix building warning * fix build error when enabling training and use tensorrt	2022-03-25 07:35:45 +08:00
Hariharan Seshadri	a9d9c6b486	Register CPU, CUDA and ROCM opset-16 kernels for some operators (#10643 )	2022-03-08 09:18:39 -08:00
liqun Fu	da885a72e8	update with onnx 1.11 release (#10441 )	2022-03-07 21:10:55 -08:00
Tianlei Wu	36c3271546	BeamSearch op cuda (#10556 ) Add BeamSearch cuda implementation with support of fp16 GPT-2 subgraph	2022-02-25 13:08:55 -08:00
Scott McKay	df841ee87d	Fix incorrect type constraint registration for operator kernels. (#10489 ) * Fix incorrect type constraint registration for RoiAlign. This led to the input type not actually being checked when matching a kernel as the invalid constraint name is treated as a missing optional input. * fix missing dependency for the unit test exe. Whilst it doesn't link against the CUDA providers lib, without the dependency VS doesn't know it needs to rebuild the library if there are changes. * Add check for invalid type constraints. * Fix invalid registrations for other kernels. * Add hash replacement logic to provide backwards compatibility in ORT format models when the registration is fixed. * Add tests	2022-02-18 16:55:32 +10:00
Viswanath Boga	ad9d2e2e89	Prefix match in first iteration of beam search OP (#10231 ) * Add BeamSearch op schema * Add ONNX conversion for beams search * remove attention_mask and change input order * add option to run baseline * add check data type NULL * applies VerifyNodeAndOpMatch to subgraph * update input_ids shape * Add node name for Cast node * expose API for topk * parse parameters * Add beam search scorer * output results * fix typo * use c++ template and format python * fix build pipeline errors * symbolic shape infer of input onnx * output scores * add kernel def hash * Handle vocab_mask; move CheckSubgraph * undo insert_cast_transformer.cc and fusion_utils.py * fix typo * fix merge * update doc * add repetition penalty * refactoring: add GptSubgraph class * move BeamSearchState from .h to .cc file * adjust logits processor order * add batch generation example * fix repetition penalty for dup words in sequence * Add test * Add no repeat ngram processor * refactoring: move logits processor to classes * fix build warning * show latency * use allocator in beam state * use allocator in sequences * fix build error * move next_positions to beam state * Changes for prefix matching * removing debugs * removing more debugs * clean up * clean up * cpu doc updated * Updated docs * updated prefix_vocab_mask dimension in convert script * changes to support bxs prefix_vocab_mask in beamsearchop kernel * doc update * OperatorKernels.md updated * matching docs from artifacts * minor change in logits processor * Addressing comments * Updated the prefix vocab mask usage properly Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2022-02-03 00:14:39 +05:30
Yufeng Li	1aa0789691	add qdq support for QGemm (#10414 ) * add qgemm in quantization tool * add qdq support for QGemm * fix build break * fix OperatorKernels.md	2022-02-02 10:35:29 -08:00
Yi-Hong Lyu	e27f2dc932	int8/uint8 support for Argmax for opset 1, 11, 12 (#10296 )	2022-01-18 14:37:34 -08:00
Vincent Wang	44e2db9397	CUDA BFloat16 Refactor (#10085 )	2022-01-14 19:38:56 +08:00
Yi-Hong Lyu	499f1d5fd7	Quantization of Argmax (#10213 ) This patch includes: * int8/uint8 support for Argmax * Quantization tool support for Argmax	2022-01-12 14:12:56 -08:00
Yufeng Li	12ee2e942f	add int8_t for Resize (#10067 ) As we support quantization for format s8s8, we need Resize to support int8_t.	2021-12-17 15:36:09 -08:00
Tianlei Wu	ef36488df0	Add BeamSearch operator for GPT-2 decoding (#9680 ) * Add BeamSearch operator and CPU implementation * Add ONNX conversion script	2021-12-16 16:08:05 -08:00
Yufeng Li	ffdafb2012	add fallback of s8s8 support on x64 (#9995 ) * add fallback of s8s8 support on x64	2021-12-10 11:33:19 -08:00
Yufeng Li	a0afd7303d	add int8_t support for pool operators (#9852 ) * add int8_t support for pool operators	2021-11-29 18:43:43 -08:00
Ye Wang	6856619b18	Decoder Attention CUDA Op (#9792 ) * add kernel interface * register kernel * add self/cross qkv projection without cache * add LaunchTransQkv2 for (S,B,X,N,H) -> (X,B,N,S,H) * refactor ConcatPastToPresent * DecoderQkvToContext interface * q,k,v buffer and cache as output * qk, pv and transctx * fix compiler error on linux machine * key_padding_mask * add test_parity file. However not runnable * add partial unittest * made partial attributes to inputs * --gen_doc * change kernel interface, add more tests * morre parity tests * fix test * fix typo * transpose optimizer has bug. remove it temporarily * add input shape checks * add type/shape inference * fix cache shape check * fix rocm build failure * fix rocm build error * review comments * review comments	2021-11-19 19:25:36 -08:00
Vincent Wang	f390347c11	Add CUDA Kernels of RandomNormal[Like], RandomUniform[Like] (#9761 )	2021-11-19 08:18:34 +08:00
satyajandhyala	229c9a4e1c	Added Trilu CUDA kernel. (#9633 ) * Added Trilu CUDA kernel. * Added TriluGrad. * Added a training testcase for Trilu. * Added Trilu gradient checker test.	2021-11-09 11:20:17 -08:00
Hariharan Seshadri	bbeceb7541	Support optional type in ORT (#8339 )	2021-11-04 15:01:42 -07:00
Viswanath Boga	85874bb315	embed layer fusion gpt2 (#9336 ) * Changes to fuse embed layer for gpt2, kernal changes pending * verified add output and regular add match * Test added for additional output embedlayernorm, working on CUDA * Test passing on CPU * updated convert_to_onnx toll to check parity correctly * removed some debugs * couple of TODO left as in optimizer.py * removed changes to optimizer.py * fixing build * fixing build * updated order of initilization * added a test case for float16 * updating the docs * updating tests failing due to embed layer fusion * update unit tests * updating CUDA documentation in operatorkernels.md * addressing comments * OperatorKernels.md updated with CUDA * adding TODO to qembed_layer * minor edit * updated docs * addressing comments * adding position ids to embed layer gpt2 * updating fused gpt2 model * added extra test * remove comments * addressing comments * contrib_defs.cc updated * all tests passing * fixing a typo * minor edit * trigger build * qembedlayernorm checkinputs updated * fixing build error * fixing build error * fixing build error	2021-10-28 11:06:26 -07:00
Bowen Bao	e983f37121	Bifurcation detector for aggressive decoding (#9432 ) ``` Component for aggressive decoding. Find the bifurcation index of predicted tokens, between source tokens, starting from previous suffix match index, and predicted tokens. Concat predicted tokens, starting from bifurcation index, to the back of current tokens. This forms the output tokens. Detect suffix match index in source tokens, between source tokens and output tokens. Detection is based on finding the appearances of last n-gram in output tokens in source tokens. A match is considered found if source tokens contain a single matching n-gram. Return the index of the start of the n-gram in source tokens. No matching if found if src tokens contain multiple or zero matching n-grams. Return -1. ```	2021-10-19 19:53:56 -07:00
ashbhandare	35c2102cfa	Fixes for GatherND, Multinomial (#9143 ) * register gathernd kernel, aten multinomial * fix CI, add test * review comments	2021-10-05 14:51:58 -07:00
ytaous	0193490cbf	ReduceMin - add int64 cuda kernel support for opset12/13 (#8966 ) * ReduceMin - int64 support * fix doc Co-authored-by: Ethan Tao <ettao@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>	2021-09-07 17:01:26 -07:00
Hariharan Seshadri	cee79526fd	Add opset 15 kernels for Pow, BatchNorm, and Shape (#8442 )	2021-08-25 12:04:20 -07:00
Hariharan Seshadri	17b0664e34	Optimize sequence type usage on CUDA [2/n] (#8720 )	2021-08-24 10:40:28 -07:00

1 2 3 4

171 commits