onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-14 20:48:00 +00:00

Author	SHA1	Message	Date
PeixuanZuo	33367fa2dc	[MIGraphX] update the MIGraphX version used in ORT to rocm-5.4.0 (#14184 ) ### Description Update the MIGraphX version used in ORT to rocm-5.4.0 ### Motivation and Context The previous branch migraphx_for_ort has stopped updating, it is too far away from the MIgraphX latest release branch. More discussion here: https://github.com/microsoft/onnxruntime/issues/14126#issuecomment-1373201049 Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-01-10 13:40:25 +08:00
Yi Zhang	6463f4383b	make WITHCACHE as an option in MacOS workflow (#14188 ) ### Description 1. Set the WithCache default value as false in Mac OS CI workflow too. 2. Add date of today in cache key to avoid cache size keep increasing too. WithCache, the pipeline duration reduced from 70 more minutes to 10 more minutes	2023-01-10 10:54:19 +08:00
Tianlei Wu	7e751ac6e6	update convert_generation for Attention op change (#14191 ) We remove key and value inputs in https://github.com/microsoft/onnxruntime/pull/14146, need update the convert_generation as well.	2023-01-09 18:04:44 -08:00
Patrice Vignola	c151afec71	[DML EP] Fix unconnected node removal logic (#14193 ) ### Description Fix unconnected node removal logic ### Motivation and Context The edges need to be removed before the nodes themselves, otherwise the indices will reference the wrong nodes.	2023-01-09 15:40:09 -08:00
Sumit Agarwal	906f578be8	[DML EP] Update DML_FEATURE_LEVEL 5.0 (#14172 ) ### Description DML EP was using very old feature level (2.0) which may lead to model (having latest operator) execution failure, if model is running against old DirectML.dll. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-01-09 13:00:56 -08:00
liqun Fu	1be36913cc	to work with onnx 1.13 rc, implement ver 18 reduce and optioanl ops, … (#13765 )	2023-01-09 10:26:16 -08:00
Xavier Dupré	79dc39600f	Replace distutils by setuptools to import build_ext (#14108 ) ### Description Uses setuptools instead of distutils. ### Motivation and Context Fixes #14107.	2023-01-09 11:48:01 +01:00
Patrice Vignola	64541a587d	[DML EP] Remove unconnected nodes from the graph (#14155 ) ### Description Remove unconnected nodes from the DML EP graph. ### Motivation and Context Some operators like `EmbedLayerNorm` have many outputs, and some of the outputs are non-optional. But in practice, they act like optional outputs because they can have a value of 0, which means that the rest of the model doesn't need to depend on those. The problem with that is that DML will implicitly remove those output from the graph, but the nodes that feed into that output will stay and become unconnected from the rest of the graph, which is illegal in DML. Removing unconnected nodes as a last pass will make sure that those nodes are getting removed and will simplify the logic of individual operators by not having to account for these special cases.	2023-01-08 15:20:52 -08:00
Zhang Lei	74fe45bf09	activate past_present_share_buffer for sampling node (#14166 )	2023-01-07 19:36:39 -08:00
cloudhan	be879c11ee	Add batched and strided batched gemm as TunableOp (#13841 )	2023-01-07 19:11:40 +08:00
Ye Wang	5eac2c1f41	relational attention bias cuda op (#14149 ) ### Description This cuda op implements the compute_bias() method in T5 Attention including the permutation. note: 1. bias_table needs to be saved in col-major. be careful when implementing fusion script 2. second input(sequence length) is placed on cpu. (using Shape node's output should be good) 3. the first dimension of output is 1, so extra_add_qk in attention should support broadcasting 4. compute_bias() only used in self-attn in t5 TODO: docs change will be applied later ### Motivation and Context It's part of the process of optimizing t5 attention as well as t5 based generation model Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-06 17:32:58 -08:00
cloudhan	8e2163018d	Ignore more build directories and clangd files (#14154 ) Ignore all `build_*` directories in repo root. Ignore `.cache` and `compile_commands.json` which are related to clangd cache and configuration.	2023-01-07 06:58:57 +08:00
Tianlei Wu	2cacb24cb0	Add CrossAttention operator (#14146 ) Move separated Q, K and V (without input projection) from Attention to a new operator CrossAttention. The Attention operator is hard to maintain when we need support with and without input projection in one class. Add a new operator according to feedback. Some change might need in the future, but not in this PR: (1) bias could be optional (We will not proceed that route unless experiments show that fusing Add bias with MatMul instead of this op could improve performance). (2) support packed KV. There are two ways to support it: when key and value are same Tensor, they are packed; or we can make value as optional, and use packed mode when value is empty and the key has packed K/V. (3) support cached key and value, and other (like relative position bias), or more attention mask format. They can be added easily without breaking backward compatible. (4) ROCm/CPU implementation of this op.	2023-01-06 14:27:40 -08:00
Baiju Meswani	c6ff5bac9d	Update torch in eager mode CI pipeline (#14094 )	2023-01-06 11:46:44 -08:00
we1559	c65a03699a	add ThreadingOptions, wraps OrtThreadingOptions (#13711 ) …threadpools' options of The Env. ### Description <!-- Describe your changes. --> add a c++ class ThreadingOptions, wraps OrtThreadingOptions as I described in issue #13710 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> close #13710 Co-authored-by: zengxiangneng <zengxiangneng@360.cn>	2023-01-06 11:21:10 -08:00
Jian Chen	babc1323e3	Consolidate Identical Children Nodes (#14026 ) ### Description In case where Q have multiple DQ children, we want to keep only 1 DQ. The only remaining DQ's will channel its output to deleted DQ children's outputs. ex Q->N(DQ). => Q->DQ ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-01-06 09:03:10 -08:00
Hariharan Seshadri	d0c5ffd5f7	Misc transformer fixes - 2 (#14156 ) ### Description 1. The graph pattern search introduced in https://github.com/microsoft/onnxruntime/pull/13914/ needs to be enhanced so that SkipLayerNormalization is supported 2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization` fusion. The optional output of SLN needs to also include the bias (if present) and the added output should be a sum of `input + skip + (bias)` ### Motivation and Context Fix some breaking tests	2023-01-06 07:27:10 -08:00
PeixuanZuo	3702806653	[ROCm] add softmax, topk, layernorm to microbench (#13997 ) ### Description Add softmax, layernorm, topk benchmark to microbench. Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-01-06 18:06:24 +08:00
PeixuanZuo	b222a8e01b	[Fix] build error with MIGraphX tag rocm-5.4.0 (#14141 ) ### Description <!-- Describe your changes. --> Fix the error https://github.com/microsoft/onnxruntime/issues/14126 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-01-06 15:51:25 +08:00
zhijiang	0ed7277bbe	fix training compilation option (#14151 ) fix the pipeline failure for compilation option error	2023-01-06 14:25:03 +08:00
Yi Zhang	2ce7b1c1dc	Enable cache for msbuild (#14085 ) ### Description Enable ccache in windows CPU compilation. The windows compilation in CI could be reduced to 1 more minute at most. ![image](https://user-images.githubusercontent.com/16190118/210294061-86742cf4-65c7-4cc2-9725-e102c3c64abd.png)	2023-01-06 11:19:57 +08:00
Abhishek Udupa	d460c01b8c	Fix skew between GPU/CPU timestamps in ORT profiler (#14004 ) ### Description This PR fixes the skew between GPU/CPU timestamps with a more reliable algorithm. ### Motivation and Context An earlier implementation attempted to guess the right correction to apply, but this led to misleading profile outputs. This PR fixes this problem by utilizing a more reliable technique to normalize GPU timestamps. Attached are sample profile outputs and visualization screen-grabs from a run of a transformer-based model before and after the fix. Before Fix: ![profile_visualization_cuda_without_fix](https://user-images.githubusercontent.com/17418420/208197234-7390d8e3-4354-4e67-93cf-958c319146ee.png) After Fix: ![profile_visualization_cuda_with_fix](https://user-images.githubusercontent.com/17418420/208197230-3e108b82-8dfa-476b-9277-7895639a3785.png) Profiler outputs that are rendered in the visualizations above: [sample_outputs.zip](https://github.com/microsoft/onnxruntime/files/10249689/sample_outputs.zip) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>	2023-01-05 11:07:26 -08:00
Tang, Cheng	90cff21fa7	Avoid the lock for device stream impact the cpu build (#14131 ) ### Description Introduce a runtime flag in SessionState about whether any EP in current session using stream feature, if no, avoid trigger the lock. This will avoid the impact to CPU build. ### Motivation and Context Currently we use a lock in SessionState when retrieve device stream collection, this is mainly for reusing the device stream for EP like GPU eps, so it shouldn't impact the build which doesn't using stream feature, like cpu build. Instead of play with build flags, this PR introduce a runtime flag in SessionState to indicate whether current session has any EP that using the stream feature. if no, we don't need to trigger the lock. Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>	2023-01-05 09:01:33 -08:00
PeixuanZuo	4eac0db3af	[ROCm] Add GemmFastGelu CK implementation (#13759 ) ### Description <!-- Describe your changes. --> Add GemmFastGelu CK implementation. TODO 1. The performance of CK GemmFastGelu in ORT is not good as using CK directly, still need to investigate the reason and improve the CK in ORT. `GemmFastGeluUnfused float16 NN m=49152 n=3072 k=768 2298.8064 us 100.89 tflops` `withbias DeviceGemmMultipleD_Xdl_CShuffle<256, 256, 128, 32, 8, 8, Default> LoopScheduler: Default, PipelineVersion: v1 float16 NN m=49152 n=3072 k=768 2401.9799 us 96.56 tflops` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2023-01-05 17:53:30 +08:00
Adrian Lizarraga	2b45410e52	Fix Prefast warning in CUDA contrib op (#14074 ) ### Description Fixes Prefast C26814 ```shell onnxruntime::contrib::cuda::QAttention<onnxruntime::MLFloat16,signed char>::ComputeInternal onnxruntime/contrib_ops/cuda/quantization/attention_quantization.cc The const variable 'element_size' can be computed at compile-time. Consider using constexpr (con.5). ```	2023-01-04 19:32:06 -08:00
Adrian Lizarraga	68794d0ac1	Improve custom op library handle cleanup (#14099 ) ### Description - Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages the lifetime of dynamic library handles (i.e., calls `dlclose` or `FreeLibrary`). - Deprecates C API `OrtApi::RegisterCustomOpsLibrary`. - Adds C++ API wrapper for convenient registering of custom op libraries. - `PySessionOptions` is now an alias of `OrtSessionOptions` ### Motivation and Context The current API for registering custom op libraries loads dynamic libraries but requires users to handle the release of the corresponding library handles. Additionally, the user has to make sure to release the library handle _after_ the session has been destroyed (or the program segfaults). The new API automatically cleans up the library and allows the user to write more straightforward code.	2023-01-04 17:56:29 -08:00
cloudhan	dc997af695	Use RegisterOp to add Op instead of directly manipulate base class field (#14123 ) Add API `RegisterOp` to TunableOp.	2023-01-05 09:02:46 +08:00
Nat Kershaw (MSFT)	b313055ad6	Updated issue router to migrated project (#14114 )	2023-01-04 14:47:43 -08:00
Ye Wang	ae148ebc05	T5 skip_layer_norm cuda op (#14093 ) ### Description T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean Square Layer Normalization. ORT already have the simplified_layer_norm which is the RMS layer_norm. This PR extends this T5 layer_norm with support of skip/bias and the residual output. This new op is named SkipSimplifiedLayerNorm and has similar interface as SkipLayerNorm but removes the beta as input ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>	2023-01-04 13:31:53 -08:00
Patrice Vignola	b6ea60436d	[DML EP] Decouple the bucketized allocator from the individual block allocation logic (#14056 ) ### Description Decouple the DML bucketized allocator from the individual block allocation logic ### Motivation and Context This is the first step into using tiled/placed resources instead of committed resources. Given the potential impact of changing the allocation logic and the large number of edge cases, I decided to take a step-by-step approach. It will also reduce the size of the PRs to a reasonable length, while making sure each PR has a single responsibility. Decoupling the logic that way will make it easier in the future to easily plug in different kind of "suballocators" if we want to play around with the allocation logic. Currently, the only suballocator is a committed resource, but placed resources are the next step and will come in a future PR.	2023-01-04 13:13:54 -08:00
Nat Kershaw (MSFT)	f344d4b3d1	Label issues with mobile when android or ios are present (#14033 )	2023-01-04 13:03:25 -08:00
Ye Wang	821baa5b83	Support generation script with custom eos/pad token id (#14113 ) ### Description <!-- Describe your changes. --> when custom decoder onnx model passes in, user can specify eos/pad token id instead of populating from torch config. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-01-04 10:51:53 -08:00
Ashwini Khade	e5e3570ac5	fix cg issue (#14112 ) ### Description Update torch version to 1.13.1 to fix CG issue: https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/10666/	2023-01-04 09:07:13 -08:00
JiCheng	d279ce2f67	bug fix, NNAPI filter out un-supported graph (#14040 ) ### Description "Consttant Folding" need to enhance to support "function" in onnx spec. If those nodes are inlined into sub-graph and captured by a EP, espeicially this EP doesn't support that, error occured. There are many test failure in Onnx 1.13 agaist NNAPI, these are listed bellow; ``` prelu_broadcast_expanded selu_example_expanded_ver18 layer_normalization_2d_axis0 shrink_hard_expanded_ver18 elu_expanded_ver18 softsign_example_expanded_ver18 leakyrelu_example_expanded hardsigmoid_example_expanded_ver18 thresholdedrelu_default_expanded_ver18 split_variable_parts_2d_opset18 efault_expanded prelu_example_expanded thresholdedrelu_example_expanded_ver18 selu_default_expanded_ver18 elu_example_expanded_ver18 hardsigmoid_default_expanded_ver18 softsign_expanded_ver18 hardsigmoid_expanded_ver18 leakyrelu_expanded scatter_with_axis selu_expanded_ver18 shrink_soft_expanded_ver18 relu_expanded_ver18 thresholdedrelu_expanded_ver18 elu_default_expanded_ver18 ``` Solution: To prevent NNAPI capture it for now, we can revert it once a better CF implemented. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-01-04 20:20:00 +08:00
Vincent Wang	15c1157ef2	New Pattern Support for LayerNormFusion (#14118 ) Latest torch exporter changed the LayerNorm exporting code to add two more Cast nodes (to make it logically correct in compute), but our current LayerNormFusion doesn't support the new pattern. The PR is to add support of this.	2023-01-04 17:51:14 +08:00
Yi Zhang	f864b54393	Use today's cache only (#14120 ) ### Description Add date value of today into the cache key. ### Motivation and Context Microsoft-host agent has only 10GB for build. To limit cache size, pipeline only use cache generated today.	2023-01-04 17:48:52 +08:00
dependabot[bot]	bdeba4e31c	Bump json5 from 1.0.1 to 1.0.2 in /js (#14109 )	2023-01-04 08:54:59 +00:00
Baiju Meswani	0ff61f7b97	Update torch to 1.13.1 in CI and packaging pipelines for ort training (#14055 )	2023-01-03 20:03:33 -08:00
cao lei	b29a1c7348	Address follow-up comments on multistream pr #13495 (#13992 ) ### Description This PR is to address follow-up comments for the multi-stream pr https://github.com/microsoft/onnxruntime/pull/13495 Changes including: - Make StreamAwareArena transparent to minimal build - Make DeviceStreamCollection transparent to minimal build - Replace ORT_MUST_USE_RESULT with [[nodiscard]] - Remove unnecessary shared_ptr ### Motivation and Context This PR is to address follow-up comments for the multi-stream pr https://github.com/microsoft/onnxruntime/pull/13495 Co-authored-by: Lei Cao <leca@microsoft.com>	2023-01-03 16:33:36 -08:00
Nat Kershaw (MSFT)	a78ab4fbef	Add labeled to docs workflow trigger (#13979 ) Capture the case where issue is manually labeled	2023-01-03 15:19:22 -08:00
Ashwini Khade	68b5b2d7d3	Refactor training build options (#13964 ) ### Description 1. Renames all references of on device training to training apis. This is to keep the naming general. Nothing really prevents us from using the same apis on servers\non-edge devices. 2. Update ENABLE_TRAINING option: With this PR when this option is enabled, training apis and torch interop is also enabled. 3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: - Removed user facing option - Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop. Once this PR is merged when --enable_training is selected we will do a "FULL Build" for training (with all the training entry points and features). Training entry points include: 1. ORTModule 2. Training APIs Features include: 1. ATen Fallback 2. All Training OPs includes communication and collectives 3. Strided Tensor Support 4. Python Op (torch interop) 5. ONNXBlock (Front end tools for training artifacts prep when using trianing apis) ### Motivation and Context Intention is to simply the options for building training enabled builds. This is part of the larger work item to create dedicated build for learning on the edge scenarios with just training apis enabled.	2023-01-03 13:28:16 -08:00
Hariharan Seshadri	d43e0ec9ba	Misc transformer fixes (#14103 ) ### Description 1. SkipLayerNormalization has a new output (https://github.com/microsoft/onnxruntime/pull/13988) and the symbolic shape inference script needs corresponding updates 2. The greedy sampling op (https://github.com/microsoft/onnxruntime/pull/13426) shouldn't re-use the logits buffer as its corresponding kernel doesn't seem to support it yet. ### Motivation and Context Fix some transformer issues	2023-01-03 13:05:55 -08:00
Patrice Vignola	e7f9d40dde	[DML EP] Force instance norm inputs to be 4D to better target metacommands (#14020 ) ### Description Force instance norm inputs to be 4D to better target metacommands ### Motivation and Context This may improve performance on some hardware by allowing the driver to return valid layouts to DML when querying for metacommand support.	2023-01-03 12:47:10 -08:00
Patrice Vignola	589612106a	[DML EP] Force layer norm inputs to be 4D to better target metacommands (#14022 ) ### Description Force layer norm inputs to be 4D to better target metacommands ### Motivation and Context This may improve performance on some hardware by allowing the driver to return valid layouts to DML when querying for metacommand support.	2023-01-03 12:46:33 -08:00
RandySheriffH	587e891cae	CloudEP (#13855 ) Implement CloudEP for hybrid inferencing. The PR introduces zero new API, customers could configure session and run options to do inferencing with Azure [triton endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint) Sample configuration in python be like: ``` sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton'); sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com'); sess_opt.add_session_config_entry('cloud.model_name', 'detection2'); sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1 sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose ... run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint. run_opt.add_run_config_entry('cloud.auth_key', '...') ... sess.run(None, {'input':input_}, run_opt) ``` Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2023-01-03 10:03:15 -08:00
Yi Zhang	52e3fe961d	add dnnl dependency in unittest.cmake (#14104 ) ### Description It's from the PR #14085 On multiple running msbuilds , it throws the exception of ``` 22-12-30T16:35:34.2423207Z ##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(155,5): Error MSB3073: The command "setlocal "C:\Program Files\CMake\bin\cmake.exe" -E copy D:/a/_work/1/b/RelWithDebInfo/dnnl/install/bin/dnnl.dll D:/a/_work/1/b/RelWithDebInfo/RelWithDebInfo if %errorlevel% neq 0 goto :cmEnd :cmEnd endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone :cmErrorLevel exit /b %1 :cmDone if %errorlevel% neq 0 goto :VCEnd :VCEnd" exited with code 1. ``` https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=847423&view=logs&j=249e9d58-0012-5814-27cf-6a201adbd9cf&t=182b9780-832e-5dcb-3957-d6aa3ece582f It should make sure that the onnxruntime_test_all project depends on dnnl project.	2023-01-03 11:24:06 +08:00
cloudhan	613920d6c5	Remove all usages of hipLaunchKernelGGL (#14089 ) All thoes macro syntaxes are mistake by https://github.com/ROCm-Developer-Tools/HIP-CPU/issues/8#issuecomment-756188453, they should be corrected in documentation but is not. We moved away hipThreadIdx_* in some previous commits, now we move away from hipLaunchKernelGGL.	2023-01-02 12:55:44 +08:00
Tianlei Wu	6a9dc6c993	[CUDA] Update fused MHA to support flash attention and causal mask (#13953 ) ### Description Update fused attention kernels to support flash attention and causal mask (GPT-2 initial decoder run). Note: Causal kernels are from FasterTransformer 5.2. Flash attention kernels that is not causal are from TensorRT 8.5.1. #### Performance Test of bert-base model Test like the following: ``` python -m onnxruntime.transformers.benchmark -m bert-base-cased -b 1 4 8 16 32 64 -s 512 -t 1000 -o by_script -g -p fp16 -i 3 --use_mask_index ``` Original Flash Attention is from https://github.com/HazyResearch/flash-attention. RemovePadding and RestorePadding is added before/after the original flash attention but not for this PR, so the result is not apple-to-apple comparison. It is added for reference only. Average latency (ms) of float16 bert-base-cased model: * A100 Kernel \| b1_s512 \| b4_s512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 \| b128_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 1.83 \| 5.00 \| 9.31 \| 17.76 \| 34.47 \| 67.43 \| 133.38 TRT Fused \| 2.05 \| 3.58 \| 5.70 \| 10.96 \| 21.22 \| 41.23 \| 80.56 Flash Attention (from FT) \| 1.43 \| 3.20 \| 5.71 \| 10.95 \| 22.19 \| 42.96 \| 84.54 Flash Attention (from TRT) \| 1.44 \| 3.28 \| 5.70 \| 10.86 \| 21.00 \| 40.56 \| 79.53 Original Flash Attention \| 1.81 \| 4.04 \| 6.82 \| 13.06 \| 24.62 \| 46.58 \| 91.10 * T4 \| b1_s512 \| b4_s512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 8.17 \| 29.86 \| 59.56 \| 115.77 \| 236.66 \| 461.43 Flash Attention (from FT) \| 5.65 \| 21.12 \| 44.94 \| 86.83 \| 174.16 \| 351.38 Flash Attention (from TRT) \| 5.73\| 21.49\| 45.49 \| 89.15 \| 174.37 \| 352.08 Original Flash Attention \| 6.22 \| 22.16 \| 43.39 \| 83.8 \| 168.77 \| 337.04 * V100 Kernel \| b1_s512 \| b4_512 \| b8_s512 \| b16_s512 \| b32_s512 \| b64_s512 -- \| -- \| -- \| -- \| -- \| -- \| -- Unfused \| 3.77 \| 10.48 \| 19.53 \| 37.63 \| 73.68 \| 145.58 Flash Attention (from FT) \| 3.21 \| 8.25 \| 14.95 \| 28.83 \| 56.28 \| 111.15 #### Performance Test of GPT-2 model Test like the following: ` python benchmark_gpt2.py -m distilgpt2 -o --stage 1 --use_gpu -p fp16 -b 1 4 8 16 32 64 128 -s 0 --sequence_lengths 8 16 32 64 128 256 512 ` * A100 Note that flash attention is used as fused attention when sequence_length > 128. batch_size \| sequence_length \| with Fused Attention \| without Fused Attention \| A100 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 0.93 \| 1 \| 7.0% 4 \| 8 \| 0.82 \| 0.88 \| 6.8% 8 \| 8 \| 0.84 \| 0.88 \| 4.5% 16 \| 8 \| 0.92 \| 0.97 \| 5.2% 32 \| 8 \| 1.15 \| 1.17 \| 1.7% 64 \| 8 \| 1.68 \| 1.72 \| 2.3% 128 \| 8 \| 2.76 \| 2.78 \| 0.7% 1 \| 16 \| 0.95 \| 0.95 \| 0.0% 4 \| 16 \| 0.83 \| 0.88 \| 5.7% 8 \| 16 \| 0.91 \| 0.97 \| 6.2% 16 \| 16 \| 1.12 \| 1.17 \| 4.3% 32 \| 16 \| 1.67 \| 1.72 \| 2.9% 64 \| 16 \| 2.73 \| 2.76 \| 1.1% 128 \| 16 \| 4.96 \| 4.95 \| -0.2% 1 \| 32 \| 0.94 \| 0.88 \| -6.8% 4 \| 32 \| 0.91 \| 0.97 \| 6.2% 8 \| 32 \| 1.12 \| 1.17 \| 4.3% 16 \| 32 \| 1.65 \| 1.71 \| 3.5% 32 \| 32 \| 2.69 \| 2.76 \| 2.5% 64 \| 32 \| 4.86 \| 4.94 \| 1.6% 128 \| 32 \| 9.35 \| 9.38 \| 0.3% 1 \| 64 \| 0.84 \| 0.88 \| 4.5% 4 \| 64 \| 1.1 \| 1.17 \| 6.0% 8 \| 64 \| 1.64 \| 1.73 \| 5.2% 16 \| 64 \| 2.66 \| 2.77 \| 4.0% 32 \| 64 \| 4.82 \| 4.97 \| 3.0% 64 \| 64 \| 9.23 \| 9.4 \| 1.8% 128 \| 64 \| 18.54 \| 19.12 \| 3.0% 1 \| 128 \| 0.91 \| 0.98 \| 7.1% 4 \| 128 \| 1.68 \| 1.74 \| 3.4% 8 \| 128 \| 2.71 \| 2.83 \| 4.2% 16 \| 128 \| 4.85 \| 5.09 \| 4.7% 32 \| 128 \| 9.32 \| 9.69 \| 3.8% 64 \| 128 \| 18.54 \| 19.44 \| 4.6% 128 \| 128 \| 36.86 \| 38.47 \| 4.2% 1 \| 256 \| 1.15 \| 1.23 \| 6.5% 4 \| 256 \| 2.71 \| 2.95 \| 8.1% 8 \| 256 \| 4.87 \| 5.3 \| 8.1% 16 \| 256 \| 9.32 \| 10.23 \| 8.9% 32 \| 256 \| 18.6 \| 20.53 \| 9.4% 64 \| 256 \| 36.93 \| 40.41 \| 8.6% 128 \| 256 \| 72.84 \| 80.14 \| 9.1% 1 \| 512 \| 1.68 \| 1.96 \| 14.3% 4 \| 512 \| 4.9 \| 6.02 \| 18.6% 8 \| 512 \| 9.4 \| 11.59 \| 18.9% 16 \| 512 \| 18.71 \| 23.05 \| 18.8% 32 \| 512 \| 37.13 \| 45.46 \| 18.3% 64 \| 512 \| 74.04 \| 89.88 \| 17.6% 128 \| 512 \| NA \| NA \| NA * T4: batch_size \| sequence_length \| with Fused Attention \| with Unfused Attention \| T4 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 1.97 \| 2.11 \| 6.6% 4 \| 8 \| 2.2 \| 2.25 \| 2.2% 8 \| 8 \| 2.77 \| 3.1 \| 10.6% 16 \| 8 \| 4.17 \| 4.2 \| 0.7% 32 \| 8 \| 6.86 \| 6.82 \| -0.6% 64 \| 8 \| 14.88 \| 14.92 \| 0.3% 128 \| 8 \| 31.4 \| 31.29 \| -0.4% 1 \| 16 \| 1.61 \| 1.71 \| 5.8% 4 \| 16 \| 2.13 \| 2.31 \| 7.8% 8 \| 16 \| 3.38 \| 3.67 \| 7.9% 16 \| 16 \| 6.16 \| 6.54 \| 5.8% 32 \| 16 \| 14.16 \| 14.76 \| 4.1% 64 \| 16 \| 30.36 \| 30.57 \| 0.7% 128 \| 16 \| 63.14 \| 63.57 \| 0.7% 1 \| 32 \| 1.53 \| 1.69 \| 9.5% 4 \| 32 \| 3.34 \| 3.66 \| 8.7% 8 \| 32 \| 6.25 \| 6.64 \| 5.9% 16 \| 32 \| 14.12 \| 14.9 \| 5.2% 32 \| 32 \| 28.96 \| 29.82 \| 2.9% 64 \| 32 \| 61.07 \| 61.77 \| 1.1% 128 \| 32 \| 116.38 \| 117.98 \| 1.4% 1 \| 64 \| 2.01 \| 2.21 \| 9.0% 4 \| 64 \| 6.18 \| 6.67 \| 7.3% 8 \| 64 \| 13.72 \| 14.49 \| 5.3% 16 \| 64 \| 28.71 \| 29.83 \| 3.8% 32 \| 64 \| 58.65 \| 60.68 \| 3.3% 64 \| 64 \| 113.09 \| 113.17 \| 0.1% 128 \| 64 \| 205.21 \| 209.4 \| 2.0% 1 \| 128 \| 3.37 \| 3.76 \| 10.4% 4 \| 128 \| 13.54 \| 14.85 \| 8.8% 8 \| 128 \| 28.32 \| 30.22 \| 6.3% 16 \| 128 \| 58.16 \| 62.09 \| 6.3% 32 \| 128 \| 109.17 \| 113.99 \| 4.2% 64 \| 128 \| 198.9 \| 207.1 \| 4.0% 128 \| 128 \| 413.25 \| 421.82 \| 2.0% 1 \| 256 \| 6.33 \| 7.05 \| 10.2% 4 \| 256 \| 28.09 \| 31.49 \| 10.8% 8 \| 256 \| 57.47 \| 62.76 \| 8.4% 16 \| 256 \| 106.77 \| 117.95 \| 9.5% 32 \| 256 \| 197.02 \| 208.58 \| 5.5% 64 \| 256 \| 406.81 \| 431.36 \| 5.7% 128 \| 256 \| NA \| NA \| NA 1 \| 512 \| 13.84 \| 16.32 \| 15.2% 4 \| 512 \| NA \| NA \| NA 8 \| 512 \| NA \| NA \| NA 16 \| 512 \| NA \| NA \| NA 32 \| 512 \| NA \| NA \| NA 64 \| 512 \| NA \| NA \| NA 128 \| 512 \| NA \| NA \| NA * V100: batch_size \| sequence_length \| with Fused Attention \| with Unfused Attention \| V100 Gain -- \| -- \| -- \| -- \| -- 1 \| 8 \| 1.31 \| 1.6 \| 18.1% 4 \| 8 \| 1.17 \| 1.26 \| 7.1% 8 \| 8 \| 1.43 \| 1.79 \| 20.1% 16 \| 8 \| 2.14 \| 1.96 \| -9.2% 32 \| 8 \| 2.91 \| 3.08 \| 5.5% 64 \| 8 \| 5.32 \| 5.27 \| -0.9% 128 \| 8 \| 9.34 \| 8.97 \| -4.1% 1 \| 16 \| 1.41 \| 1.58 \| 10.8% 4 \| 16 \| 1.38 \| 1.49 \| 7.4% 8 \| 16 \| 1.81 \| 2.2 \| 17.7% 16 \| 16 \| 2.8 \| 2.83 \| 1.1% 32 \| 16 \| 4.94 \| 4.99 \| 1.0% 64 \| 16 \| 8.88 \| 8.84 \| -0.5% 128 \| 16 \| 17.35 \| 17.2 \| -0.9% 1 \| 32 \| 1.38 \| 1.77 \| 22.0% 4 \| 32 \| 1.77 \| 1.93 \| 8.3% 8 \| 32 \| 2.71 \| 2.86 \| 5.2% 16 \| 32 \| 5.03 \| 4.92 \| -2.2% 32 \| 32 \| 8.8 \| 8.79 \| -0.1% 64 \| 32 \| 17.29 \| 17.23 \| -0.3% 128 \| 32 \| 33.27 \| 33.1 \| -0.5% 1 \| 64 \| 1.67 \| 1.87 \| 10.7% 4 \| 64 \| 2.69 \| 2.76 \| 2.5% 8 \| 64 \| 4.87 \| 4.94 \| 1.4% 16 \| 64 \| 8.73 \| 8.81 \| 0.9% 32 \| 64 \| 16.92 \| 17.24 \| 1.9% 64 \| 64 \| 33 \| 33.38 \| 1.1% 128 \| 64 \| 65.33 \| 65.86 \| 0.8% 1 \| 128 \| 2.03 \| 2.22 \| 8.6% 4 \| 128 \| 4.9 \| 5.04 \| 2.8% 8 \| 128 \| 8.76 \| 8.81 \| 0.6% 16 \| 128 \| 17.06 \| 17.29 \| 1.3% 32 \| 128 \| 33.25 \| 33.56 \| 0.9% 64 \| 128 \| 65.54 \| 66.5 \| 1.4% 128 \| 128 \| 130.44 \| 131.44 \| 0.8% 1 \| 256 \| 2.78 \| 2.86 \| 2.8% 4 \| 256 \| 8.75 \| 9.04 \| 3.2% 8 \| 256 \| 17 \| 17.68 \| 3.8% 16 \| 256 \| 33.19 \| 34.32 \| 3.3% 32 \| 256 \| 65.43 \| 67.86 \| 3.6% 64 \| 256 \| 129.92 \| 134.68 \| 3.5% 128 \| 256 \| NA \| NA \| NA 1 \| 512 \| 4.95 \| 5.32 \| 7.0% 4 \| 512 \| NA \| NA \| NA 8 \| 512 \| NA \| NA \| NA 16 \| 512 \| NA \| NA \| NA 32 \| 512 \| NA \| NA \| NA 64 \| 512 \| NA \| NA \| NA 128 \| 512 \| NA \| NA \| NA	2022-12-31 10:33:54 -08:00
Dmitri Smirnov	5d729839b5	Support loading widechar paths on windows (#14066 ) ### Description Make GetRuntimePath() and LoadDynamicLibrary() operate on platform specific paths ### Motivation and Context This addresses https://github.com/microsoft/onnxruntime/issues/14063	2022-12-30 16:30:11 -08:00
Baiju Meswani	b85878953f	Fix nightly ort training ci pipeline (#14007 )	2022-12-30 12:28:57 -08:00

1 2 3 4 5 ...

7946 commits