onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-21 02:18:09 +00:00

Author	SHA1	Message	Date
Nat Kershaw (MSFT)	9219615471	Fix python AP docs generation (#15760 ) Docs are failing on the operator generation step. Remove this temporarily so that we can publish.	2023-05-01 18:31:59 -07:00
liqun Fu	62fc6ed5a8	[Feature Request] Support Resize opset 19 (#15633 )	2023-05-01 10:49:17 -07:00
Linnea May	2c3697be00	User/linneamay/reduce 18 (#15701 ) ### Description <!-- Describe your changes. --> Add registration for DML reduce functions in opset 18. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Linnea May <linneamay@microsoft.com>	2023-04-27 20:32:11 -07:00
kunal-vaishnavi	39d6d7050d	Change EmbedLayerNormalization mask index output to optional (#15526 ) ### Description This PR changes an EmbedLayerNormalization node's mask index output to be an optional output if a mask input is not provided. ### Motivation and Context The documentation for EmbedLayerNormalization states ``` The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated. ``` However, if the mask input is not provided, the mask index output is still calculated and required.	2023-04-27 16:32:42 -07:00
Justin Chu	76ddc92fbd	Enable RUFF as a formatter (#15699 ) ### Description RUFF can now format since lintrunner-adapters v0.8. Removed the RUFF-FIX linter. ### Motivation and Context Better engineering	2023-04-26 14:04:07 -07:00
sfatimar	ebaafac3f5	Openvino ep ort 5.0 (#15626 ) ### Description The PR adds VPU support to OpenVINO Execution Provider Bug fixes for GPU, CPU. Changes to OpenVINO Backend in Serialized Model API for faster First Inference Latency. Deprecation to HDDL-VADM and MYRIAD, removed code Support OpenVINO 2023.0 Dynamic Shapes Support for iGPU ### Motivation and Context - VPU is an upcoming hardware that can provide AI Acceleration for Client Systems through OpenVINO - If it fixes an open issue, please link to the issue here. --> --------- Signed-off-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com> Co-authored-by: MaajidKhan <n.maajid.khan@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>	2023-04-25 20:59:42 -07:00
Baiju Meswani	5885abfb35	Training Documentation (#15612 )	2023-04-25 11:44:12 -07:00
Baiju Meswani	11b0a18de6	Add support for cuda 11.8 and python 3.11 for training (#15548 )	2023-04-20 12:56:45 -07:00
kunal-vaishnavi	901c2bc384	Whisper Model Optimization (#15473 ) ### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - https://github.com/huggingface/optimum/pull/872 - https://github.com/huggingface/optimum/pull/920 ### Motivation and Context This PR helps the following issues: - https://github.com/microsoft/onnxruntime/issues/15100 - https://github.com/microsoft/onnxruntime/issues/15235 - https://github.com/huggingface/optimum/issues/869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - https://github.com/microsoft/onnxruntime/pull/15247 - https://github.com/microsoft/onnxruntime/pull/15339 - https://github.com/microsoft/onnxruntime/pull/15362 - https://github.com/microsoft/onnxruntime/pull/15365 - https://github.com/microsoft/onnxruntime/pull/15427 This PR uses changes from the following merged PRs: - https://github.com/microsoft/onnxruntime/pull/14198 - https://github.com/microsoft/onnxruntime/pull/14146 - https://github.com/microsoft/onnxruntime/pull/14201 - https://github.com/microsoft/onnxruntime/pull/14928 (this introduced the new multi-head attention spec)	2023-04-18 17:13:54 -07:00
liqun Fu	919d8f2660	update with onnx main (#14929 )	2023-04-18 08:42:51 -07:00
Justin Chu	a36caba073	Bump ruff in CI (#15533 ) ### Description Bump ruff version in CI and fixed new lint errors. - This change enables the flake8-implicit-str-concat rules which helps detect unintended string concatenations: https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc - Update gitignore to include common python files that we want to exclude. ### Motivation and Context Code quality	2023-04-17 10:11:44 -07:00
pengwa	516c8e95fa	Optimize SCE loss compute (#15401 ) ### Optimize SCE loss compute Compute optimization based on label data sparsity: - Insert ShrunkenGather before SCELoss node, to filter out invalid labels for compute. - Support ShrunkenGather upstream. - Added test for the above. - Added flag to enable label sparsity optimization with env var, by default disabled now. Will enable after comprehensive benchmarking later. - Extract common logic into test_optimizer_utils.h/cc from core/optimizer/compute_optimzier_test.cc, then the common functions can be shared by both core/optimizer/compute_optimzier_test.cc and orttraining/core/optimizer/compute_optimzier_test.cc - Extract common logic into shared_utils.h/cc: `GetONNXOpSetVersion` and `Create1DInitializerFromVector` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-04-13 13:02:12 +08:00
Patrice Vignola	3be5bfe363	[DML EP] Add MatMul + SoftMax fusion (#15240 )	2023-04-11 08:31:04 -07:00
Patrice Vignola	7c927bb95c	[DML EP] Add BiasSplitGelu (#15197 )	2023-04-11 08:30:37 -07:00
Patrice Vignola	c5b6ee1a99	[DML EP] Add NhwcConv (#15194 )	2023-04-10 23:16:09 -07:00
Patrice Vignola	4a676b011a	[DML EP] Add BiasAdd (#15211 )	2023-04-10 14:46:33 -07:00
stevenlix	6d126f8996	Add FP16 support for Whisper model (#15427 ) Current ORT can only run inference for Whisper FP32 model. This PR adds FP16 support.	2023-04-08 21:36:10 -07:00
Chen Fu	8dce83a818	Fuse 'Add' operator into FP16 Conv (#15213 ) ### Description Adding 'Add' functionality to FP16 Conv operator. It takes a tensor that has the same shape of the output tensor, and add it to the result tensor. ### Motivation and Context Needed to run Resnet 50	2023-04-07 09:51:03 -07:00
Patrice Vignola	9191e04259	[DML EP] Add QuickGelu (#15220 )	2023-04-05 10:49:34 -07:00
Aditya Goel	a4e9a48345	Reduce operators support for int64 type (#15358 )	2023-04-05 09:19:43 -07:00
Aditya Goel	1c1d386561	Adds int32_t and uint32_t clip kernels (#15306 )	2023-04-04 13:44:50 -07:00
petermcaughan	1251964f96	Petermca/beamsearch whisper (#15339 ) ### Description Adjust various code paths to allow Whisper model to function with BeamSearch op. Approach: Add a new kModelType enum value in IGenerationParameters as so: #### Old: 0 = GPT2, 1 = T5 #### New: 0 = GPT2, 1 = T5, 2 = Whisper When the user assigns this attribute value to 2, various shape and type checks are changed to accommodate Whisper inputs. ### Motivation and Context BeamSearch is currently designed to function with BERT-based models with inputs as vocab tokens, and needs changes to function with Whisper inputs (3-D float values processed from audio data). --------- Co-authored-by: Peter McAughan <petermca@microsoft.com>	2023-04-04 09:09:10 -07:00
pengwa	5baf5f506b	log level control + fix typos (#15302 ) ### log level control + fix typos	2023-04-04 20:19:13 +08:00
Ye Wang	fbfe92f66a	DecoderMaskedMultiHeadAttention enhancement (#15292 )	2023-04-02 21:53:03 -07:00
Yufeng Li	c08d6b42e8	Add tool to support packing mode for BERT model (#15283 ) ### Description <!-- Describe your changes. --> Add a tool to convert fused BERT like model to packing mode ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-31 08:46:47 -07:00
Xavier Dupré	786f8b98f7	Add a page in the documentation for every operator in onnxruntime (#14340 )	2023-03-30 14:39:16 -07:00
Jian Chen	792d411135	Update python 3.11 and remove 3.7 for Linux (#15214 ) ### Description Update python 3.11 and remove 3.7 ### Motivation and Context Update python 3.11 and remove 3.7 --------- Co-authored-by: Ubuntu <chasun@chasunlinux.lw3b1xzoyrkuzm34swpscft0ff.dx.internal.cloudapp.net>	2023-03-27 14:46:30 -07:00
Patrice Vignola	67a6022c03	[DML EP] Add GroupNorm (#15189 ) Comparison between the different normalization operations: ![](https://user-images.githubusercontent.com/1041752/106491728-73d40680-64b7-11eb-8769-3f758996e959.png)	2023-03-27 12:52:53 -07:00
Justin Chu	d834ec895a	Adopt linrtunner as the linting tool - take 2 (#15085 ) ### Description `lintrunner` is a linter runner successfully used by pytorch, onnx and onnx-script. It provides a uniform experience running linters locally and in CI. It supports all major dev systems: Windows, Linux and MacOs. The checks are enforced by the `Python format` workflow. This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors in Python code. `lintrunner` now runs all required python lints including `ruff`(replacing `flake8`), `black` and `isort`. Future lints like `clang-format` can be added. Most errors are auto-fixed by `ruff` and the fixes should be considered robust. Lints that are more complicated to fix are applied `# noqa` for now and should be fixed in follow up PRs. ### Notable changes 1. This PR removed some suboptimal patterns: - `not xxx in` -> `xxx not in` membership checks - bare excepts (`except:` -> `except Exception`) - unused imports The follow up PR will remove: - `import *` - mutable values as default in function definitions (`def func(a=[])`) - more unused imports - unused local variables 2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than flake8 and is more robust. We are using it successfully in onnx and onnx-script. It also supports auto-fixing many flake8 errors. 3. Removed the legacy flake8 ci flow and updated docs. 4. The added workflow supports SARIF code scanning reports on github, example snapshot: ![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png) 5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Unified linting experience in CI and local. Replacing https://github.com/microsoft/onnxruntime/pull/14306 --------- Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-03-24 15:29:03 -07:00
Nat Kershaw (MSFT)	28f64066de	Auto deploy API docs (#15088 )	2023-03-23 15:08:49 -07:00
Ye Wang	44ba23e0f5	Rename DecoderMaskedMHA to DecoderMaskedSelfAttn (#15166 ) ### Description <!-- Describe your changes. --> As synced offline, rename this op and will create another op for mha that supports both self and cross attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-23 12:31:38 -07:00
Ye Wang	2ee822d483	Extend memory efficient attention coverage in Attention/MHA cuda op (#15064 ) ### Description <!-- Describe your changes. --> 1. upgrade cutlass to 3.0 that containing attn_bias support. 2. extend Attention/MHA to use memory efficient attention when rel_pos_bias with [1, num_head, s, s] and 1d mask with [2 batch_size + 1] are present. new mask format introduction: MASK_1D_KEY_SEQ_LEN_START, [3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1], query_start[0], ..., query_start[batch_size - 1], query_end[batch_size - 1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size - 1]] e.g 2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this 1D mask is [3, 5, 0, 6, 12, 0, 6, 12] ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> It potentially benefits tnlrv6 and t5(encoder) --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com> Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-03-23 11:05:17 -07:00
Hariharan Seshadri	7033346605	Support mask_filter_value attribute in DecoderMaskedMultiheadAttention (#15158 )	2023-03-23 11:00:09 -07:00
pengwa	1d32285536	Statistics tool for ORTModule convergence parity (#15020 ) ### Statistics tool for ORTModule convergence parity As ORTModule get more and more validated, it is pretty fast to intergrade PyTorch based model with ORT. The same time, we need make sure once there is convergence issue, we don't spend months of time to investigate. As part of this efforts, this PR is introducing a tool to dump activation statistics without much involvement from users. The dumping results contains only some statistic numbers plus sampled data, which is not big, compared with dumping all the tensors, it is much faster and space efficient. For us to use it, two single lines are needed before wrapping ORTModule. For baseline run, need also apply the same trick. ``` + from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber + SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)]) ``` Once you run the steps, following command can be used to merge result into per-step-summary respectively for ORT and baseline runs. ```bash python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output ``` Docs is added here as part of this PR [convergence investigation notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md) Based on the generated merged files, we can compare them with tools. ![image](https://user-images.githubusercontent.com/10530022/224653929-4e4480bd-bb02-4bbe-bd44-2672bdf91a87.png) ### Design and Implementation This PR introduced a common mechanism registering custom logic for nn.Module's post forward hooks. And statistics for activation (StatisticsSubscriber) is one of the implementations. If there is other needs, we can define another XXSubscriber to do the customized things.	2023-03-23 20:34:24 +08:00
Yufeng Li	c7ced7a5e9	Add PackedAttention for packing mode (#14858 ) ### Description <!-- Describe your changes. --> Transformer models can handle batch of inputs at once. However, sequences in a batch usually have different length. Then we have to pad the short one to have same length as the longest. This is not efficient especially for large batch with high variance. This PR introduces a PackedAttention operator which can take in packed sequences (no padding) and also produces output in packing mode. There will be another PR to use the PackedAttention to implement the encoder in packing mode. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-21 12:59:29 -07:00
Hariharan Seshadri	ed7ab1660d	[CUDA] Add option to use DecoderMaskedMultiheadAttention in BeamSearch (#14990 )	2023-03-15 17:16:32 -07:00
Ye Wang	538d64891a	[t5 optimization] kernel changes to t5 (#14928 ) ### Description <!-- Describe your changes. --> 1. support optional bias in Attention op (used in T5 encoder) 2. support broadcasting rel_pos_bias in attention_softmax.h 3. add scale in MHA op's attributes 4. support past_key/past_value and present_key/present_value in MHA 5. UT and parity tests are added 6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920 note: the fusions will be in another PR since mt5 needs to be tested and an issue from github will be investigated. Future works: 1. support shared buffer for past/present 2. enable trt kernels when possible and investigate (trt/cutlass)kernels with rel_pos_bias) 3. support KV/QKV packing with past/present ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-03-13 14:29:16 -07:00
Hariharan Seshadri	112a4d215a	[CUDA] Support decoding multihead self-attention implementation (#14848 )	2023-03-08 09:17:54 -08:00
pengwa	f6c81d8aca	Introduce padding inspector in ORTModule (#14652 ) ### Introduce padding inspector in ORTModule In some Transformer-based LLM training recipes, high data sparsity is observed due to 1). token padding (to max sequence length), 2). labels contains many ignore_index for calculate loss. This PR introduces a switch to enable data sparsity inspection, which 1). in short term, can inform training users to use techniques like dynamic batching to amortize the issue. 2). in medium and longer term, also helps us (training team) to have better understanding what our training customers' models looks like from perspective of data sparsity (and potentially motivate us to improve with runtime). Here is an example of different data sparsity with same training model arch, same training input, but with different user models. Low Embed Density, High Label Density Case - Sentence Classification ` python -m torch.distributed.launch --nproc_per_node=4 examples/onnxruntime/training/text-classification/run_glue.py --model_name_or_path roberta-large-openai-detector --task_name mnli --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3 --overwrite_output_dir --output_dir ./outputs/ --per_device_eval_batch_size 32 --seed 1137 --fp16 True --ignore_mismatched_sizes True --optim adamw_ort_fused ` ``` >>>Valid token/label density (e.g. valid/total) in passing 10 steps: \| STEP \| INPUT TYPE \| INPUT NAME \| PAD IDX \| DENSITY \| VALID TOKENS \| TOTAL TOKENS \| VALID TOKENS/BATCH \| \| 60 \| EMBED \| input_ids \| 1 \| 35.21 % \| 1442 \| 4096 \| [50, 81, 35, 11, 29, 36, 66, 19, 40, 22, 21, 42, 17, 37, 40, 41, 26, 58, 38, 54, 41, 73, 48, 57, 50, 51, 49, 85, 48, 36, 79, 62] \| \| 61 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 62 \| EMBED \| input_ids \| 1 \| 30.00 % \| 1229 \| 4096 \| [36, 73, 13, 47, 27, 33, 53, 25, 51, 28, 36, 42, 42, 32, 39, 52, 27, 13, 31, 66, 42, 45, 52, 45, 58, 42, 37, 66, 12, 18, 29, 17] \| \| 63 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 64 \| EMBED \| input_ids \| 1 \| 26.73 % \| 1095 \| 4096 \| [37, 28, 20, 53, 16, 20, 44, 52, 27, 28, 16, 19, 16, 24, 63, 31, 24, 42, 33, 41, 44, 60, 44, 67, 54, 30, 20, 19, 33, 23, 24, 43] \| \| 65 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 66 \| EMBED \| input_ids \| 1 \| 30.03 % \| 1230 \| 4096 \| [22, 46, 36, 41, 46, 43, 26, 50, 60, 16, 24, 42, 56, 35, 35, 59, 29, 39, 34, 20, 66, 23, 47, 53, 19, 35, 44, 23, 34, 81, 21, 25] \| \| 67 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| \| 68 \| EMBED \| input_ids \| 1 \| 31.62 % \| 1295 \| 4096 \| [75, 36, 48, 20, 38, 21, 49, 54, 38, 41, 26, 28, 80, 45, 48, 16, 22, 41, 34, 28, 37, 16, 74, 63, 62, 34, 22, 45, 23, 27, 37, 67] \| \| 69 \| LABEL \| labels \| -100 \| 100.00 % \| 32 \| 32 \| N/A \| <<< ``` High Embed Density, Low Label Density Case - masked language model ` python -m torch.distributed.launch --nproc_per_node=4 examples/onnxruntime/training/language-modeling/run_mlm.py --model_name_or_path bert-base-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --do_train --do_eval --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused ` ``` >>>Valid token/label density (e.g. valid/total) in passing 10 steps: \| STEP \| INPUT TYPE \| INPUT NAME \| PAD IDX \| DENSITY \| VALID TOKENS \| TOTAL TOKENS \| VALID TOKENS/BATCH \| \| 710 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 711 \| LABEL \| labels \| -100 \| 13.77 % \| 564 \| 4096 \| N/A \| \| 712 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 713 \| LABEL \| labels \| -100 \| 14.48 % \| 593 \| 4096 \| N/A \| \| 714 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 715 \| LABEL \| labels \| -100 \| 14.18 % \| 581 \| 4096 \| N/A \| \| 716 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 717 \| LABEL \| labels \| -100 \| 14.53 % \| 595 \| 4096 \| N/A \| \| 718 \| EMBED \| input_ids \| 0 \| 100.00 % \| 4096 \| 4096 \| [512, 512, 512, 512, 512, 512, 512, 512] \| \| 719 \| LABEL \| labels \| -100 \| 15.31 % \| 627 \| 4096 \| N/A \| <<< ``` #### Next Step Let's see how we leverage the data sparsity for improvement. Optimizations on the way around compute optimizer wave 2: > Loss compute flops reduction. > Flatten/Unflatten embedding tokens to save compute flops.	2023-03-03 18:36:08 +08:00
Justin Stoecker	928289c414	STFT for DML EP (#14736 ) ### Description Implements the STFT operator for the DirectML execution provider. This is implemented as a custom op, just like the DFT kernel, because it's implemented as a composite of two operators (DML Mul/Identity + DFT). As such, this inherits the same restrictions as the existing DFT kernel (requires power-of-two window sizes for now). This change also adds a native FP16 shader to DFT so that both DFT/STFT kernels support float16 tensors. There is no typed UAV fallback or emulation path, so the HW _needs_ to support native float16. It also appears the stockham shader was compiled with all optimizations disabled and debug symbols (tsk tsk, Sheil), and this has been fixed. This is passing all existing STFT tests (i.e. all of 1). I'm adding some additional collateral in the Windows AI conformance tests in parallel to check some extra cases. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-02-23 21:12:22 -08:00
James Yuzawa	d925055a3e	Fix broken and outdated links in documentation (#14092 ) ### Description <!-- Describe your changes. --> I fixed some broken links in the C API documentation, but then did a quick pass over all of the links I could find and then fixed those. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> I got some 404's when exploring the documentation and wanted to fix it.	2023-02-23 10:48:04 -08:00
Ye Wang	58da3cacdf	support NeoX-style rotary embedding (#14785 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-02-22 18:21:34 -08:00
Sheil Kumar	1b7f65437e	Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP (#14442 ) Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP Opset 11 introduced the following sequence related operators: - SequenceAt - SequenceConstruct - SequenceEmpty - SequenceLength - SequenceErase - SequenceInsert - ConcatFromSequence With the exception of ConcatFromSequence, all of the above operators were implemented with CPU kernels that a) required all of the contained tensors to also be on CPU, and b) would clone each tensor into a new sequence as a side effect of each operator. The implementation of sequences are backend agnostic, as they dont affect actual tensor layout or manipulate the contents of the tensors. In addition, with the exception of SequenceAt, the other operators need not make copies of the underlying referenced tensors. Consequently, this change does the following: 1) Sequence* operators (except SequenceAt) no longer copies the contents of a sequence of tensors on every kernel execution. 2) SequenceAt uses the DataTransferManager to copy tensors agnostic to backend. 3) The internal container implemented by TensorSeq has changed from onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor does not support copy or assignment construction, so it must have a singular owner. However, is same tensor participates in multiple containers it would have multiple container "owners" and this would not be possible. 4) Other code that accessed values from TensorSeq have associated changes to extract Tensors from OrtValues now. In addition, DirectML execution was very slow when the above Sequence operators were added to a graph, as this caused MemcpyToHost and MemcpyFromHost kernels to be inserted between the graph and the sequence operators. To optimize DirectML, 1) The CPU implementations for the Sequence* ops were registered as DML implementations. Since the above changes also includes making the CPU kernel implementations EP agnostic, the CPU kernels can be added as is. 2) The ConcatFromSequence operator needed to be implemented on DirectML. However, there was little DirectML EP operator framework support for operators that accept/output sequences of tensors. This change has modified the internal COM interfaces to include new apis to interrogate for sequence shapes, and extract the needed tensors from TensorSeq. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>	2023-02-21 18:08:28 -08:00
Ryan Hill	892f59b31a	Add string support to tile op (#14686 ) ### Description Add std::string tensor type support to Tile operator ### Motivation and Context Multiple users are hitting this missing feature: https://github.com/microsoft/onnxruntime/issues/14511	2023-02-16 14:59:44 -08:00
Tianlei Wu	eb2ac72fa9	Stable Diffusion CUDA Optimizations Part 4 (#14680 ) (1) Support packed QKV format in MultiHeadAttention. This format could avoid add bias transpose when TRT fused kernel is used. (2) Add cache for cumulated sequence length computation. For SD, it only need computed once since sequence length is fixed. (3) Do not allocate qkv workspace to save memory for packed KV or QKV. (4) Add unit tests for packed kv and packed qkv format in MultiHeadAttention (5) Mark some fusion options for SD only Performance tests show slight improvement in T4. Average latency reduced 0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5 models. Memory usage drops from 5.1GB to 4.8GB.	2023-02-15 14:55:42 -08:00
Tianlei Wu	f638c5a2ae	Stable Diffusion CUDA Optimizations Part 3 (#14646 ) The third part for stable diffusion CUDA optimizations (1) Add BiasAdd operator to replace two Add (bias and residual); Add fusion for BiasAdd (2) Add Attention fusion for VAE decoder. (3) Update float16 conversion to handle Resize and GroupNorm. This could reduce two Cast nodes for each Resize op in fp16 model. (4) Force inputs and outputs to be float16 to avoid data casts in the pipeline. (5) Add options --force_fp32_ops, --inspect etc in optimize script so that user could force some operator to run in float32 to potentially get better image quality (with cost of performance). Performance tests show slight improvement in T4. Average latency reduced 0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.	2023-02-14 12:46:50 -08:00
Ye Wang	b539c364ee	Some kernel changes for TULR (#14517 ) ### Description <!-- Describe your changes. --> 1. fix a bug in relative position bias kernel where seq_len > 32 2. rename extra_add_qk to relative_position_bias 3. support relative_position_bias in multihead attention (B, N, S, S*) 4. gru_gate support by Lei ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net> Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>	2023-02-07 11:51:06 -08:00
Yufeng Li	8de885fdb1	reduce cuda library binary size (#14555 ) ### Description Reduce the cuda library size by: 1. refactoring beam_search_top_k to reduce template instantiation. It saves ~56MB 2. opt out TopK for type uint*, int8_t and int16_t. It saves ~50MB. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-02-07 09:03:14 -08:00
Patrice Vignola	b8fb9320ac	[DML EP] Fix ScatterElements registration (#14560 )	2023-02-06 10:01:02 -08:00
Nat Kershaw (MSFT)	638f21b969	Upgrade doxygen to fix C API docs build issue (#13950 )	2023-02-03 09:43:29 -08:00
Tianlei Wu	a6c5ba0185	Stable Diffusion CUDA Optimizations (#14428 ) ### Description Add stable diffusion CUDA kernel optimizations. The following are included: (1) GroupNorm operator. This kernel is from TensorRT 8.5. (2) BiasSplitGelu operator. This kernel is modified from SplitGelu of TensorRT 8.5. We added bias to the SplitGelu. (3) NhwcConv operator. This adds support of NHWC format (ONNX Conv operator uses NCHW format). (3) Update MultiHeadAttention (packed kv and no bias) for cross attention. This could avoid transpose of kv for TRT fused cross attention kernel. (4) Optimization and benchmark script Not included: (1) Script to convert Conv to NhwcConv in onnx graph. (2) Update symbolic shape inference for NhwcConv. (3) Add SeqLen2Spatial operator (4) Documents Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are implemented based on stable diffusion usage. They might not be applicable to any input size or dimensions. For example, BiasSplitGelu requires hidden size to be 2560 \| 5120 \| 10240, and NhwcConv assumes 4D input/weight. There is minor increasement of binary size. For SM=75 only, python package wheel size adds (33757K - 33640K) = 117 KB. It is possible to move NHWC from template parameter to constructor to reduce binary size (with slight cost of performance). Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest cuDNN to get best performance.	2023-02-02 23:43:51 -08:00
Numfor Tiapo	3cc81460e0	Register ScatterElements-16 (#14425 ) This PR registers ScatterElements-16 to the DML EP - CPU fallback is added if the reduction attribute is in use, as this is not yet supported by DML. --------- Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2023-02-01 09:46:37 -08:00
Rui Ren	eacd829d23	Bump ORT version number (#14226 ) ### Description Bump ort version after the creation of release candidate of 1.14 Co-authored-by: ruiren <ruiren@microsoft.com>	2023-01-26 12:33:47 -08:00
liqun Fu	2b1a59f01a	cpu support of LpPool(18) (#14205 ) Signed-off-by: Liqun Fu <liqfu@microsoft.com> ### Description To support LpPool (18) ### Motivation and Context for Ort 1.14 release Signed-off-by: Liqun Fu <liqfu@microsoft.com>	2023-01-25 23:14:56 -08:00
Thiago Crepaldi	32c05fcdd1	Add Col2Im CPU op (#12311 ) Description This PR implements N-dimensional Col2Im as a contrib CPU Op as specified by ONNX's https://github.com/onnx/onnx/pull/3948 Motivation and Context - Col2Im enables models such as: - [SS-DCNet](https://github.com/xhp-hust-2018-2011/SS-DCNet) - [DSTT](https://github.com/ruiliu-ai/DSTT) - It also serves to document the ORT's obscure `math::Col2ImNd` utility Signed-off-by: Liqun Fu <liqfu@microsoft.com> Co-authored-by: Liqun Fu <liqfu@microsoft.com>	2023-01-25 12:23:00 -08:00
Edward Chen	3bc092b1ea	Update ORT format v5 change docs to cover limited backwards compatibility in 1.14. (#14413 )	2023-01-25 08:23:12 -08:00
liqun Fu	7b6d880b28	cpu to support bitwise ops (#14197 )	2023-01-23 16:42:18 -08:00
Scott McKay	c252a7f992	Remove exclusions for ONNX model tests that now pass. (#14337 ) ### Description <!-- Describe your changes. --> Remove exclusions for ONNX model tests that now pass due to kernels being implemented. Update ONNX update doc to point to correct location for tests. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Run as many tests as possible. Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-01-24 08:04:27 +10:00
liqun Fu	05915d8393	support Pad(18) (#14219 )	2023-01-23 12:14:35 -08:00
Nat Kershaw (MSFT)	abaed6f474	Add link to Python API examples (#14345 )	2023-01-21 16:23:16 -08:00
Nat Kershaw (MSFT)	e57c312f9d	Pin sphinx to avoid broken link (#14383 )	2023-01-21 09:50:56 -08:00
Ye Wang	de7a868d5f	Update quantization_defs.cc (#14380 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-01-20 15:03:50 -08:00
Ye Wang	668586e8f8	Support muP in Attention (#14348 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-19 20:36:55 -08:00
liqun Fu	5d6a049141	support ScatterND(18) and ScatterElement(18) (#14224 )	2023-01-19 13:54:20 -08:00
Tianlei Wu	477cad3051	[CUDA] Add trt cross attention kernels (#14328 ) Add TRT cross attention kernels for stable diffusion optimization.	2023-01-17 17:55:45 -08:00
Ye Wang	2db57a53a3	Add mask_filter in Attention related ops' attribute (#14274 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/12843 Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-17 12:28:11 -08:00
Zhang Lei	15141a40b4	Add present_past_share_buff to QAttention Defs to enable QAttention related tests. (#14297 )	2023-01-14 09:19:06 -08:00
Ye Wang	c9a53c9255	Some changes to Sampling Op (#14218 ) ### Description <!-- Describe your changes. --> 1. add an optional input to pass in seed 2. two UTs. one for top_p=0.5, another for top_p=0.01(create greedy search result, in convert_generation.py) 3. fix a bug in cpu kernel ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-12 14:15:26 -08:00
Numfor Tiapo	dee36f8ade	DML EP Register ScatterND-16 (#14240 ) This PR registers ScatterND-16 to the DML EP - CPU fallback is added if the reduction attribute is in use, as this is not yet supported by DML. Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2023-01-12 10:39:25 -08:00
sfatimar	7654cd50e8	Openvino ep 2022.3 v4.3 (#14210 ) ### Description Changes to incorporate OpenVINO EP 2022.3 ### Motivation and Context This change is required to incorportate OpenVINO EP 2022.3 - If it fixes an open issue, please link to the issue here. --> Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Aravind <aravindx.gunda@intel.com> Co-authored-by: mayavijx <mayax.vijayan@intel.com> Co-authored-by: flexci <mohsinmx>	2023-01-11 16:31:26 -08:00
Scott McKay	dd2df460b3	Split(18) (#14015 ) ### Description <!-- Describe your changes. --> Opset 18 Split changes. Adds ability to specify num_outputs which also allows uneven splitting. https://github.com/onnx/onnx/releases/tag/v1.13.0 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support ONNX opset 18.	2023-01-12 08:14:10 +10:00
Ye Wang	a01bf8dbb1	rename CrossAttention to MultiHeadAttention (#14201 ) ### Description <!-- Describe your changes. --> rename the CrossAttention to MultiheadAttention since this op can also be used as self attention ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-10 10:18:39 -08:00
Numfor Tiapo	f4ea781b81	DML EP Register Identity-16 (#14053 ) This PR Registers Identity-16 to the DML EP. ONNX Backend tests and optional type tests were skipped pending future additions. Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2023-01-10 09:16:09 -08:00
liqun Fu	1be36913cc	to work with onnx 1.13 rc, implement ver 18 reduce and optioanl ops, … (#13765 )	2023-01-09 10:26:16 -08:00
Ye Wang	5eac2c1f41	relational attention bias cuda op (#14149 ) ### Description This cuda op implements the compute_bias() method in T5 Attention including the permutation. note: 1. bias_table needs to be saved in col-major. be careful when implementing fusion script 2. second input(sequence length) is placed on cpu. (using Shape node's output should be good) 3. the first dimension of output is 1, so extra_add_qk in attention should support broadcasting 4. compute_bias() only used in self-attn in t5 TODO: docs change will be applied later ### Motivation and Context It's part of the process of optimizing t5 attention as well as t5 based generation model Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-06 17:32:58 -08:00
Tianlei Wu	2cacb24cb0	Add CrossAttention operator (#14146 ) Move separated Q, K and V (without input projection) from Attention to a new operator CrossAttention. The Attention operator is hard to maintain when we need support with and without input projection in one class. Add a new operator according to feedback. Some change might need in the future, but not in this PR: (1) bias could be optional (We will not proceed that route unless experiments show that fusing Add bias with MatMul instead of this op could improve performance). (2) support packed KV. There are two ways to support it: when key and value are same Tensor, they are packed; or we can make value as optional, and use packed mode when value is empty and the key has packed K/V. (3) support cached key and value, and other (like relative position bias), or more attention mask format. They can be added easily without breaking backward compatible. (4) ROCm/CPU implementation of this op.	2023-01-06 14:27:40 -08:00
Hariharan Seshadri	d0c5ffd5f7	Misc transformer fixes - 2 (#14156 ) ### Description 1. The graph pattern search introduced in https://github.com/microsoft/onnxruntime/pull/13914/ needs to be enhanced so that SkipLayerNormalization is supported 2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization` fusion. The optional output of SLN needs to also include the bias (if present) and the added output should be a sum of `input + skip + (bias)` ### Motivation and Context Fix some breaking tests	2023-01-06 07:27:10 -08:00
Ye Wang	ae148ebc05	T5 skip_layer_norm cuda op (#14093 ) ### Description T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean Square Layer Normalization. ORT already have the simplified_layer_norm which is the RMS layer_norm. This PR extends this T5 layer_norm with support of skip/bias and the residual output. This new op is named SkipSimplifiedLayerNorm and has similar interface as SkipLayerNorm but removes the beta as input ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-04 13:31:53 -08:00
Ashwini Khade	68b5b2d7d3	Refactor training build options (#13964 ) ### Description 1. Renames all references of on device training to training apis. This is to keep the naming general. Nothing really prevents us from using the same apis on servers\non-edge devices. 2. Update ENABLE_TRAINING option: With this PR when this option is enabled, training apis and torch interop is also enabled. 3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: - Removed user facing option - Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop. Once this PR is merged when --enable_training is selected we will do a "FULL Build" for training (with all the training entry points and features). Training entry points include: 1. ORTModule 2. Training APIs Features include: 1. ATen Fallback 2. All Training OPs includes communication and collectives 3. Strided Tensor Support 4. Python Op (torch interop) 5. ONNXBlock (Front end tools for training artifacts prep when using trianing apis) ### Motivation and Context Intention is to simply the options for building training enabled builds. This is part of the larger work item to create dedicated build for learning on the edge scenarios with just training apis enabled.	2023-01-03 13:28:16 -08:00
Ye Wang	68518a1b72	Sampling op (#13426 ) ### Description <!-- Describe your changes. --> Sampling op for cpu and cuda support huggingface case and custom case ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2022-12-22 17:34:12 -08:00
pengwa	2f5bf75e51	Optimize computation orders (#13672 ) ### Optimize computation orders In `Roberta/Electra`, when `ClassificationHead` is used, there is slicing operation on features on sequence_length dimensions, then loss calculations only depend on this sliced data. This is a slicing at axis 1. Before slicing the shape is [batch, sequence_length, hidden], after slicing, it becomes [batch , hidden_stage] We had opportunities to bring this slicing earlier as much as possible, by passing through simple elementwise ops (like Add/Div), or Layernorm/Softmax(if their reduce axis is after the slicing axis), or even MatMul's the left operand (if only it did not affect the last dims). For operators like Reshape/Transpose, it is special since they have either data specified (after slicing we need update), or they have perm specified, which requires the input rank remain unchanged. So for those kinds of operators, we can remain the original rank, but just leave the sliced dim to be 1, after the compute completed, we do a Squeeze. ``` class RobertaClassificationHead(nn.Module): """Head for sentence-level classification tasks.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) self.dropout = nn.Dropout(classifier_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) def forward(self, features, **kwargs): x = features[:, 0, :] # take <s> token (equiv. to [CLS]) x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x ``` src\transformers\models\roberta\modeling_roberta.py src\transformers\models\electra\modeling_electra.py #### Benchmark A simple benchmark shows Robeta training latency dropped from 208ms ~ 199ms. 4.5+% reduction. More comprehensive tests are on the way. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-22 15:12:52 +08:00
Hariharan Seshadri	7ed8bd4f95	Support (Bias)SkipLayerNormalization fusion in GPT2 (#13988 )	2022-12-21 23:04:44 -08:00
Edward Chen	df8ff34f25	Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. (#13983 ) ### Description Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. With the way these kernels are currently registered, the documentation shows support for opset 11+. This is not accurate. ### Motivation and Context Fix #13781	2022-12-21 19:01:00 -05:00
Numfor Tiapo	8943d623a4	DML EP Register operators for Opset 16 (#14034 ) This PR Registers the following operators for opset 16 to the DML EP: - LeakyRelu-16 - PRelu-16 - Where-16 - GreaterOrEqual-16 - LessOrEqual-16 Identity-16 was not added in this PR due to pipeline failures Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-12-21 09:05:12 -08:00
Zhang Lei	fba09faf5b	Implement reuse past and present tensor in Attention Ops. (#13791 ) Implement reuse kv_cache past and present tensor in Attention Ops. Unit test for abover feature. Utilize the reuse kv_cache for past and present tensor in Greedy Search. Correctness test for it. Co-authored-by: Zhang Lei <phill.zhang@gmail.com>	2022-12-18 10:03:53 -08:00
Jakub Bachurski	3b17ab7c65	Add float64 kernels for Floor, Ceil, IsNaN (#13906 ) ### Description This PR adds support for `float64` kernels in the latest versions of operators: Floor, Ceil and IsNaN. ### Motivation and Context The lack of these kernels is non-trivial to work around and easily lead to performance losses when it is attempted. When equivalence with an existing implementation is required, precision is easily lost when casting to `float32` instead. IsNaN is common when cleaning up data in an ML pipeline. Floor and Ceil have uses for discretising values and single-precision floats are insufficient to round well when values get larger than a few million. According to my measurement this only increases the binary size by a few kilobytes (on the Python wheel of RelWithDebInfo). Closes #13673 (Round already has float64 support) Partially solves #8791 (Looks like there's parallel issues/PR open for Split, but it is also hard to work around and hence useful) Signed-off-by: jbachurski <kbachurski@gmail.com>	2022-12-14 14:57:14 -08:00
Hariharan Seshadri	abc5c25a85	Updates to GreedySearch/BeamSearch (#13943 )	2022-12-13 20:25:26 -08:00
Patrice Vignola	8246ff015a	[DML EP] Add EmbedLayerNorm (#13868 ) ### Description Add EmbedLayerNorm to the DML EP	2022-12-13 13:23:53 -08:00
Jian Chen	d7d932c1c2	Cjian/where python operator (#12795 ) Description: This PR will enable the python tool to run QWhere and QDQWhere operation Limitation: s8s8 Where is still not supported.	2022-12-12 13:27:47 -08:00
Edward Chen	8cfbc4fe91	Add support for other data types to Split CPU kernel. (#13900 ) Split copies data - we can add support for all data types without too much binary size impact by using data type size-based implementations. The DispatchStridedCopy() function used here does this.	2022-12-12 09:29:15 -08:00
Nat Kershaw (MSFT)	21dd341e52	Add Google Analytics to python apidocs (#13901 )	2022-12-09 15:44:12 -08:00
Patrice Vignola	96d8d2c278	[DML EP] Add SkipLayerNormalization (#13849 ) ### Description Add SkipLayerNormalization for the DML EP	2022-12-07 01:49:14 -08:00
Hariharan Seshadri	004a1538d3	Extend vocab padding for logits MatMul for fp16 GPT2 GreedySearch (#13842 )	2022-12-06 19:39:20 -08:00
Patrice Vignola	b53bbe7370	[DML EP] Add an implementation for NonZero (#13768 ) ### Description Add the NonZero op for DML ### Motivation and Context NonZero is used in a few transformer models, so having a DML implementation will stop large tensors from being transferred to the CPU and back to the GPU	2022-12-02 18:39:21 -08:00
Patrice Vignola	a0b470bc35	[DML EP] Add mixed datatype support for DML's LayerNorm contrib op (#13734 ) ### Description Add mixed datatype support for DML's LayerNorm contrib op. ### Motivation and Context The fusion logic removes casts around LayerNorm in the graph because the contrib version of the op supports mixed datatypes. Scale, Bias and Output's datatypes must match, but input's datatype can be different.	2022-12-01 14:08:18 -08:00
Patrice Vignola	e9b92fdf33	[DML EP] Add DML implementation for BiasGelu (#13795 ) ### Description Add DML implementation for BiasGelu	2022-12-01 09:23:19 -08:00
Tianlei Wu	8b0e0f4927	Add RemovePadding and RestorePadding for BERT model (#13701 ) Add two operators RemovePadding and RestorePadding based on ideal of effective transformer (https://github.com/bytedance/effective_transformer) to improve large batch size inference for BERT model.	2022-11-22 10:00:23 -08:00
Hariharan Seshadri	c7329e004d	Improve fp16 performance of GPT-2's logits MatMul while using BeamSearch (#13686 )	2022-11-18 18:50:19 -08:00
Ye Wang	38a74af45d	Support position_ids broadcasting in EmbedLayerNorm (#13677 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> fix https://github.com/microsoft/onnxruntime/issues/13508	2022-11-17 17:56:27 -08:00
pengwa	d5721b3464	Fix wrong import path in docs (#13680 ) ### Fix wrong import path in docs	2022-11-17 18:15:02 +08:00

1 2 3 4 5 ...

584 commits