Commit graph

7946 commits

Author SHA1 Message Date
PeixuanZuo
33367fa2dc
[MIGraphX] update the MIGraphX version used in ORT to rocm-5.4.0 (#14184)
### Description
Update the MIGraphX version used in ORT to rocm-5.4.0

### Motivation and Context
The previous branch migraphx_for_ort has stopped updating, it is too far
away from the MIgraphX latest release branch. More discussion here:
https://github.com/microsoft/onnxruntime/issues/14126#issuecomment-1373201049

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-10 13:40:25 +08:00
Yi Zhang
6463f4383b
make WITHCACHE as an option in MacOS workflow (#14188)
### Description
1. Set the WithCache default value as false in Mac OS CI workflow too.
2. Add date of today in cache key to avoid cache size keep increasing
too.

WithCache, the pipeline duration reduced from 70 more minutes to 10 more
minutes
2023-01-10 10:54:19 +08:00
Tianlei Wu
7e751ac6e6
update convert_generation for Attention op change (#14191)
We remove key and value inputs in
https://github.com/microsoft/onnxruntime/pull/14146, need update the
convert_generation as well.
2023-01-09 18:04:44 -08:00
Patrice Vignola
c151afec71
[DML EP] Fix unconnected node removal logic (#14193)
### Description
Fix unconnected node removal logic



### Motivation and Context
The edges need to be removed before the nodes themselves, otherwise the
indices will reference the wrong nodes.
2023-01-09 15:40:09 -08:00
Sumit Agarwal
906f578be8
[DML EP] Update DML_FEATURE_LEVEL 5.0 (#14172)
### Description
DML EP was using very old feature level (2.0) which may lead to model
(having latest operator) execution failure, if model is running against
old DirectML.dll.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-09 13:00:56 -08:00
liqun Fu
1be36913cc
to work with onnx 1.13 rc, implement ver 18 reduce and optioanl ops, … (#13765) 2023-01-09 10:26:16 -08:00
Xavier Dupré
79dc39600f
Replace distutils by setuptools to import build_ext (#14108)
### Description
Uses setuptools instead of distutils.



### Motivation and Context
Fixes #14107.
2023-01-09 11:48:01 +01:00
Patrice Vignola
64541a587d
[DML EP] Remove unconnected nodes from the graph (#14155)
### Description
Remove unconnected nodes from the DML EP graph.



### Motivation and Context
Some operators like `EmbedLayerNorm` have many outputs, and some of the
outputs are non-optional. But in practice, they act like optional
outputs because they can have a value of 0, which means that the rest of
the model doesn't need to depend on those. The problem with that is that
DML will implicitly remove those output from the graph, but the nodes
that feed into that output will stay and become unconnected from the
rest of the graph, which is illegal in DML. Removing unconnected nodes
as a last pass will make sure that those nodes are getting removed and
will simplify the logic of individual operators by not having to account
for these special cases.
2023-01-08 15:20:52 -08:00
Zhang Lei
74fe45bf09
activate past_present_share_buffer for sampling node (#14166) 2023-01-07 19:36:39 -08:00
cloudhan
be879c11ee
Add batched and strided batched gemm as TunableOp (#13841) 2023-01-07 19:11:40 +08:00
Ye Wang
5eac2c1f41
relational attention bias cuda op (#14149)
### Description

This cuda op implements the compute_bias() method in T5 Attention
including the permutation.

note:
1. bias_table needs to be saved in col-major. be careful when
implementing fusion script
2. second input(sequence length) is placed on cpu. (using Shape node's
output should be good)
3. the first dimension of output is 1, so extra_add_qk in attention
should support broadcasting
4. compute_bias() only used in self-attn in t5

TODO: docs change will be applied later

### Motivation and Context
It's part of the process of optimizing t5 attention as well as t5 based
generation model

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-06 17:32:58 -08:00
cloudhan
8e2163018d
Ignore more build directories and clangd files (#14154)
Ignore all `build_*` directories in repo root. Ignore `.cache` and
`compile_commands.json` which are related to clangd cache and
configuration.
2023-01-07 06:58:57 +08:00
Tianlei Wu
2cacb24cb0
Add CrossAttention operator (#14146)
Move separated Q, K and V (without input projection) from Attention to a
new operator CrossAttention.

The Attention operator is hard to maintain when we need support with and
without input projection in one class. Add a new operator according to
feedback.

Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless
experiments show that fusing Add bias with MatMul instead of this op
could improve performance).
(2) support packed KV. There are two ways to support it: when key and
value are same Tensor, they are packed; or we can make value as
optional, and use packed mode when value is empty and the key has packed
K/V.
(3) support cached key and value, and other (like relative position
bias), or more attention mask format. They can be added easily without
breaking backward compatible.
(4) ROCm/CPU implementation of this op.
2023-01-06 14:27:40 -08:00
Baiju Meswani
c6ff5bac9d
Update torch in eager mode CI pipeline (#14094) 2023-01-06 11:46:44 -08:00
we1559
c65a03699a
add ThreadingOptions, wraps OrtThreadingOptions (#13711)
…threadpools' options of The Env.

### Description
<!-- Describe your changes. -->
add a c++ class ThreadingOptions, wraps OrtThreadingOptions
as I described in issue #13710 


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

close #13710

Co-authored-by: zengxiangneng <zengxiangneng@360.cn>
2023-01-06 11:21:10 -08:00
Jian Chen
babc1323e3
Consolidate Identical Children Nodes (#14026)
### Description
In case where Q have multiple DQ children, we want to keep only 1 DQ.
The only remaining DQ's will channel its output to deleted DQ children's
outputs.

ex
Q->N(DQ).    => Q->DQ


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-06 09:03:10 -08:00
Hariharan Seshadri
d0c5ffd5f7
Misc transformer fixes - 2 (#14156)
### Description
1. The graph pattern search introduced in
https://github.com/microsoft/onnxruntime/pull/13914/ needs to be
enhanced so that SkipLayerNormalization is supported

2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization`
fusion. The optional output of SLN needs to also include the bias (if
present) and the added output should be a sum of `input + skip + (bias)`

### Motivation and Context
Fix some breaking tests
2023-01-06 07:27:10 -08:00
PeixuanZuo
3702806653
[ROCm] add softmax, topk, layernorm to microbench (#13997)
### Description

Add softmax, layernorm, topk benchmark to microbench.

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-06 18:06:24 +08:00
PeixuanZuo
b222a8e01b
[Fix] build error with MIGraphX tag rocm-5.4.0 (#14141)
### Description
<!-- Describe your changes. -->

Fix the error https://github.com/microsoft/onnxruntime/issues/14126


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-06 15:51:25 +08:00
zhijiang
0ed7277bbe
fix training compilation option (#14151)
fix the pipeline failure for compilation option error
2023-01-06 14:25:03 +08:00
Yi Zhang
2ce7b1c1dc
Enable cache for msbuild (#14085)
### Description
Enable ccache in windows CPU compilation.
The windows compilation in CI could be reduced to 1 more minute at most.

![image](https://user-images.githubusercontent.com/16190118/210294061-86742cf4-65c7-4cc2-9725-e102c3c64abd.png)
2023-01-06 11:19:57 +08:00
Abhishek Udupa
d460c01b8c
Fix skew between GPU/CPU timestamps in ORT profiler (#14004)
### Description
This PR fixes the skew between GPU/CPU timestamps with a more reliable
algorithm.

### Motivation and Context
An earlier implementation attempted to guess the right correction to
apply, but this led to misleading profile outputs. This PR fixes this
problem by utilizing a more reliable technique to normalize GPU
timestamps. Attached are sample profile outputs and visualization
screen-grabs from a run of a transformer-based model before and after
the fix.

Before Fix:

![profile_visualization_cuda_without_fix](https://user-images.githubusercontent.com/17418420/208197234-7390d8e3-4354-4e67-93cf-958c319146ee.png)

After Fix:

![profile_visualization_cuda_with_fix](https://user-images.githubusercontent.com/17418420/208197230-3e108b82-8dfa-476b-9277-7895639a3785.png)

Profiler outputs that are rendered in the visualizations above:

[sample_outputs.zip](https://github.com/microsoft/onnxruntime/files/10249689/sample_outputs.zip)

Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
2023-01-05 11:07:26 -08:00
Tang, Cheng
90cff21fa7
Avoid the lock for device stream impact the cpu build (#14131)
### Description
Introduce a runtime flag in SessionState about whether any EP in current
session using stream feature, if no, avoid trigger the lock. This will
avoid the impact to CPU build.

### Motivation and Context
Currently we use a lock in SessionState when retrieve device stream
collection, this is mainly for reusing the device stream for EP like GPU
eps, so it shouldn't impact the build which doesn't using stream
feature, like cpu build. Instead of play with build flags, this PR
introduce a runtime flag in SessionState to indicate whether current
session has any EP that using the stream feature. if no, we don't need
to trigger the lock.

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2023-01-05 09:01:33 -08:00
PeixuanZuo
4eac0db3af
[ROCm] Add GemmFastGelu CK implementation (#13759)
### Description
<!-- Describe your changes. -->

Add GemmFastGelu CK implementation.

TODO 
1. The performance of CK GemmFastGelu in ORT is not good as using CK
directly, still need to investigate the reason and improve the CK in
ORT.
`GemmFastGeluUnfused float16 NN m=49152 n=3072 k=768 2298.8064 us 100.89
tflops`
`withbias DeviceGemmMultipleD_Xdl_CShuffle<256, 256, 128, 32, 8, 8,
Default> LoopScheduler: Default, PipelineVersion: v1 float16 NN m=49152
n=3072 k=768 2401.9799 us 96.56 tflops`

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
2023-01-05 17:53:30 +08:00
Adrian Lizarraga
2b45410e52
Fix Prefast warning in CUDA contrib op (#14074)
### Description
Fixes Prefast C26814

```shell
onnxruntime::contrib::cuda::QAttention<onnxruntime::MLFloat16,signed char>::ComputeInternal
onnxruntime/contrib_ops/cuda/quantization/attention_quantization.cc
The const variable 'element_size' can be computed at compile-time. Consider using constexpr (con.5).
```
2023-01-04 19:32:06 -08:00
Adrian Lizarraga
68794d0ac1
Improve custom op library handle cleanup (#14099)
### Description
- Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages
the lifetime of dynamic library handles (i.e., calls `dlclose` or
`FreeLibrary`).
- Deprecates C API `OrtApi::RegisterCustomOpsLibrary`.
- Adds C++ API wrapper for convenient registering of custom op
libraries.
- `PySessionOptions` is now an alias of `OrtSessionOptions`

### Motivation and Context
The current API for registering custom op libraries loads dynamic
libraries but requires users to handle the release of the corresponding
library handles. Additionally, the user has to make sure to release the
library handle _after_ the session has been destroyed (or the program
segfaults).

The new API automatically cleans up the library and allows the user to
write more straightforward code.
2023-01-04 17:56:29 -08:00
cloudhan
dc997af695
Use RegisterOp to add Op instead of directly manipulate base class field (#14123)
Add API `RegisterOp` to TunableOp.
2023-01-05 09:02:46 +08:00
Nat Kershaw (MSFT)
b313055ad6
Updated issue router to migrated project (#14114) 2023-01-04 14:47:43 -08:00
Ye Wang
ae148ebc05
T5 skip_layer_norm cuda op (#14093)
### Description

T5 uses a layer_norm which only scales and doesn't shift, which is also
known as Root Mean Square Layer Normalization.
ORT already have the simplified_layer_norm which is the RMS layer_norm.
This PR extends this T5 layer_norm with support of skip/bias and the
residual output.
This new op is named SkipSimplifiedLayerNorm and has similar interface
as SkipLayerNorm but removes the beta as input


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
2023-01-04 13:31:53 -08:00
Patrice Vignola
b6ea60436d
[DML EP] Decouple the bucketized allocator from the individual block allocation logic (#14056)
### Description
Decouple the DML bucketized allocator from the individual block
allocation logic



### Motivation and Context
This is the first step into using tiled/placed resources instead of
committed resources. Given the potential impact of changing the
allocation logic and the large number of edge cases, I decided to take a
step-by-step approach. It will also reduce the size of the PRs to a
reasonable length, while making sure each PR has a single
responsibility.

Decoupling the logic that way will make it easier in the future to
easily plug in different kind of "suballocators" if we want to play
around with the allocation logic. Currently, the only suballocator is a
committed resource, but placed resources are the next step and will come
in a future PR.
2023-01-04 13:13:54 -08:00
Nat Kershaw (MSFT)
f344d4b3d1
Label issues with mobile when android or ios are present (#14033) 2023-01-04 13:03:25 -08:00
Ye Wang
821baa5b83
Support generation script with custom eos/pad token id (#14113)
### Description
<!-- Describe your changes. -->

when custom decoder onnx model passes in, user can specify eos/pad token
id instead of populating from torch config.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-04 10:51:53 -08:00
Ashwini Khade
e5e3570ac5
fix cg issue (#14112)
### Description
Update torch version to 1.13.1 to fix CG issue:
https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/10666/
2023-01-04 09:07:13 -08:00
JiCheng
d279ce2f67
bug fix, NNAPI filter out un-supported graph (#14040)
### Description
"Consttant Folding" need to enhance to support "function" in onnx spec.
If those nodes are inlined into sub-graph and captured by a EP,
espeicially this EP doesn't support that, error occured.

There are many test failure in Onnx 1.13 agaist NNAPI, these are listed
bellow;
```
 prelu_broadcast_expanded
 selu_example_expanded_ver18
 layer_normalization_2d_axis0
 shrink_hard_expanded_ver18
 elu_expanded_ver18
 softsign_example_expanded_ver18
 leakyrelu_example_expanded
 hardsigmoid_example_expanded_ver18
 thresholdedrelu_default_expanded_ver18
 split_variable_parts_2d_opset18
efault_expanded
 prelu_example_expanded
 thresholdedrelu_example_expanded_ver18
 selu_default_expanded_ver18
 elu_example_expanded_ver18
 hardsigmoid_default_expanded_ver18
 softsign_expanded_ver18
 hardsigmoid_expanded_ver18
 leakyrelu_expanded
 scatter_with_axis
 selu_expanded_ver18
 shrink_soft_expanded_ver18
 relu_expanded_ver18
 thresholdedrelu_expanded_ver18
 elu_default_expanded_ver18
```


Solution: To prevent NNAPI capture it for now, we can revert it once a
better CF implemented.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2023-01-04 20:20:00 +08:00
Vincent Wang
15c1157ef2
New Pattern Support for LayerNormFusion (#14118)
Latest torch exporter changed the LayerNorm exporting code to add two
more Cast nodes (to make it logically correct in compute), but our
current LayerNormFusion doesn't support the new pattern. The PR is to
add support of this.
2023-01-04 17:51:14 +08:00
Yi Zhang
f864b54393
Use today's cache only (#14120)
### Description
Add date value of today into the cache key.

### Motivation and Context
Microsoft-host agent has only 10GB for build.
To limit cache size, pipeline only use cache generated today.
2023-01-04 17:48:52 +08:00
dependabot[bot]
bdeba4e31c
Bump json5 from 1.0.1 to 1.0.2 in /js (#14109) 2023-01-04 08:54:59 +00:00
Baiju Meswani
0ff61f7b97
Update torch to 1.13.1 in CI and packaging pipelines for ort training (#14055) 2023-01-03 20:03:33 -08:00
cao lei
b29a1c7348
Address follow-up comments on multistream pr #13495 (#13992)
### Description
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Changes including:

- Make StreamAwareArena transparent to minimal build
- Make DeviceStreamCollection transparent to minimal build
- Replace ORT_MUST_USE_RESULT with [[nodiscard]]
- Remove unnecessary shared_ptr


### Motivation and Context
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495

Co-authored-by: Lei Cao <leca@microsoft.com>
2023-01-03 16:33:36 -08:00
Nat Kershaw (MSFT)
a78ab4fbef
Add labeled to docs workflow trigger (#13979)
Capture the case where issue is manually labeled
2023-01-03 15:19:22 -08:00
Ashwini Khade
68b5b2d7d3
Refactor training build options (#13964)
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option: 
   -  Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.

Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs

Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)

### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
2023-01-03 13:28:16 -08:00
Hariharan Seshadri
d43e0ec9ba
Misc transformer fixes (#14103)
### Description
1. SkipLayerNormalization has a new output
(https://github.com/microsoft/onnxruntime/pull/13988) and the symbolic
shape inference script needs corresponding updates

2. The greedy sampling op
(https://github.com/microsoft/onnxruntime/pull/13426) shouldn't re-use
the logits buffer as its corresponding kernel doesn't seem to support it
yet.

### Motivation and Context
Fix some transformer issues
2023-01-03 13:05:55 -08:00
Patrice Vignola
e7f9d40dde
[DML EP] Force instance norm inputs to be 4D to better target metacommands (#14020)
### Description
Force instance norm inputs to be 4D to better target metacommands

### Motivation and Context
This may improve performance on some hardware by allowing the driver to
return valid layouts to DML when querying for metacommand support.
2023-01-03 12:47:10 -08:00
Patrice Vignola
589612106a
[DML EP] Force layer norm inputs to be 4D to better target metacommands (#14022)
### Description
Force layer norm inputs to be 4D to better target metacommands

### Motivation and Context
This may improve performance on some hardware by allowing the driver to
return valid layouts to DML when querying for metacommand support.
2023-01-03 12:46:33 -08:00
RandySheriffH
587e891cae
CloudEP (#13855)
Implement CloudEP for hybrid inferencing.
The PR introduces zero new API, customers could configure session and
run options to do inferencing with Azure [triton
endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint)
Sample configuration in python be like:

```
sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton');
sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com');
sess_opt.add_session_config_entry('cloud.model_name', 'detection2');
sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1
sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose
...
run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint.
run_opt.add_run_config_entry('cloud.auth_key', '...')
...
sess.run(None, {'input':input_}, run_opt)
```

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
2023-01-03 10:03:15 -08:00
Yi Zhang
52e3fe961d
add dnnl dependency in unittest.cmake (#14104)
### Description
It's from the PR #14085 
On multiple running msbuilds , it throws the exception of
```
22-12-30T16:35:34.2423207Z ##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(155,5): Error MSB3073: The command "setlocal
"C:\Program Files\CMake\bin\cmake.exe" -E copy D:/a/_work/1/b/RelWithDebInfo/dnnl/install/bin/dnnl.dll D:/a/_work/1/b/RelWithDebInfo/RelWithDebInfo
if %errorlevel% neq 0 goto :cmEnd
:cmEnd
endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone
:cmErrorLevel
exit /b %1
:cmDone
if %errorlevel% neq 0 goto :VCEnd
:VCEnd" exited with code 1.
```

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=847423&view=logs&j=249e9d58-0012-5814-27cf-6a201adbd9cf&t=182b9780-832e-5dcb-3957-d6aa3ece582f
It should make sure that the onnxruntime_test_all project depends on
dnnl project.
2023-01-03 11:24:06 +08:00
cloudhan
613920d6c5
Remove all usages of hipLaunchKernelGGL (#14089)
All thoes macro syntaxes are mistake by
https://github.com/ROCm-Developer-Tools/HIP-CPU/issues/8#issuecomment-756188453,
they should be corrected in documentation but is not. We moved away
hipThreadIdx_* in some previous commits, now we move away from
hipLaunchKernelGGL.
2023-01-02 12:55:44 +08:00
Tianlei Wu
6a9dc6c993
[CUDA] Update fused MHA to support flash attention and causal mask (#13953)
### Description
Update fused attention kernels to support flash attention and causal
mask (GPT-2 initial decoder run).

Note: Causal kernels are from FasterTransformer 5.2. Flash attention
kernels that is not causal are from TensorRT 8.5.1.

#### Performance Test of bert-base model

Test like the following:
```
 python -m onnxruntime.transformers.benchmark -m bert-base-cased -b 1 4 8 16 32 64 -s 512 -t 1000 -o by_script -g -p fp16 -i 3 --use_mask_index
```

Original Flash Attention is from
https://github.com/HazyResearch/flash-attention. RemovePadding and
RestorePadding is added before/after the original flash attention but
not for this PR, so the result is not apple-to-apple comparison. It is
added for reference only.

Average latency (ms) of float16 bert-base-cased model:

* A100

Kernel  | b1_s512 | b4_s512 | b8_s512 | b16_s512 | b32_s512 | b64_s512 |
b128_s512
-- | -- | -- | -- | -- | -- | -- | --
Unfused | 1.83 | 5.00 | 9.31 | 17.76 | 34.47 | 67.43 | 133.38
TRT Fused | 2.05 | 3.58 | 5.70 | 10.96 | 21.22 | 41.23 | 80.56
Flash Attention (from FT) | 1.43 | 3.20 | 5.71 | 10.95 | 22.19 | 42.96 |
84.54
Flash Attention (from TRT) | 1.44 | 3.28 | 5.70 | 10.86 | 21.00 | 40.56
| 79.53
Original Flash Attention | 1.81 | 4.04 | 6.82 | 13.06 | 24.62 | 46.58 |
91.10

* T4

  | b1_s512 | b4_s512 | b8_s512 | b16_s512 | b32_s512 | b64_s512
-- | -- | -- | -- | -- | -- | --
Unfused | 8.17 | 29.86 | 59.56 | 115.77 | 236.66 | 461.43
Flash Attention (from FT) | 5.65 | 21.12 | 44.94 | 86.83 | 174.16 |
351.38
Flash Attention (from TRT) | 5.73| 21.49| 45.49 | 89.15 | 174.37 |
352.08
Original Flash Attention | 6.22 | 22.16 | 43.39 | 83.8 | 168.77 | 337.04

* V100

Kernel | b1_s512 | b4_512 | b8_s512 | b16_s512 | b32_s512 | b64_s512
-- | -- | -- | -- | -- | -- | --
Unfused | 3.77 | 10.48 | 19.53 | 37.63 | 73.68 | 145.58
Flash Attention (from FT) | 3.21 | 8.25 | 14.95 | 28.83 | 56.28 | 111.15

#### Performance Test of GPT-2 model
Test like the following:
`
python benchmark_gpt2.py -m distilgpt2 -o --stage 1 --use_gpu -p fp16 -b
1 4 8 16 32 64 128 -s 0 --sequence_lengths 8 16 32 64 128 256 512
`
* A100

Note that flash attention is used as fused attention when
sequence_length > 128.

batch_size | sequence_length | with Fused Attention | without Fused
Attention | A100 Gain
-- | -- | -- | -- | --
1 | 8 | 0.93 | 1 | 7.0%
4 | 8 | 0.82 | 0.88 | 6.8%
8 | 8 | 0.84 | 0.88 | 4.5%
16 | 8 | 0.92 | 0.97 | 5.2%
32 | 8 | 1.15 | 1.17 | 1.7%
64 | 8 | 1.68 | 1.72 | 2.3%
128 | 8 | 2.76 | 2.78 | 0.7%
1 | 16 | 0.95 | 0.95 | 0.0%
4 | 16 | 0.83 | 0.88 | 5.7%
8 | 16 | 0.91 | 0.97 | 6.2%
16 | 16 | 1.12 | 1.17 | 4.3%
32 | 16 | 1.67 | 1.72 | 2.9%
64 | 16 | 2.73 | 2.76 | 1.1%
128 | 16 | 4.96 | 4.95 | -0.2%
1 | 32 | 0.94 | 0.88 | -6.8%
4 | 32 | 0.91 | 0.97 | 6.2%
8 | 32 | 1.12 | 1.17 | 4.3%
16 | 32 | 1.65 | 1.71 | 3.5%
32 | 32 | 2.69 | 2.76 | 2.5%
64 | 32 | 4.86 | 4.94 | 1.6%
128 | 32 | 9.35 | 9.38 | 0.3%
1 | 64 | 0.84 | 0.88 | 4.5%
4 | 64 | 1.1 | 1.17 | 6.0%
8 | 64 | 1.64 | 1.73 | 5.2%
16 | 64 | 2.66 | 2.77 | 4.0%
32 | 64 | 4.82 | 4.97 | 3.0%
64 | 64 | 9.23 | 9.4 | 1.8%
128 | 64 | 18.54 | 19.12 | 3.0%
1 | 128 | 0.91 | 0.98 | 7.1%
4 | 128 | 1.68 | 1.74 | 3.4%
8 | 128 | 2.71 | 2.83 | 4.2%
16 | 128 | 4.85 | 5.09 | 4.7%
32 | 128 | 9.32 | 9.69 | 3.8%
64 | 128 | 18.54 | 19.44 | 4.6%
128 | 128 | 36.86 | 38.47 | 4.2%
1 | 256 | 1.15 | 1.23 | 6.5%
4 | 256 | 2.71 | 2.95 | 8.1%
8 | 256 | 4.87 | 5.3 | 8.1%
16 | 256 | 9.32 | 10.23 | 8.9%
32 | 256 | 18.6 | 20.53 | 9.4%
64 | 256 | 36.93 | 40.41 | 8.6%
128 | 256 | 72.84 | 80.14 | 9.1%
1 | 512 | 1.68 | 1.96 | 14.3%
4 | 512 | 4.9 | 6.02 | 18.6%
8 | 512 | 9.4 | 11.59 | 18.9%
16 | 512 | 18.71 | 23.05 | 18.8%
32 | 512 | 37.13 | 45.46 | 18.3%
64 | 512 | 74.04 | 89.88 | 17.6%
128 | 512 | NA | NA | NA

* T4:

batch_size | sequence_length | with Fused Attention | with Unfused
Attention | T4 Gain
-- | -- | -- | -- | --
1 | 8 | 1.97 | 2.11 | 6.6%
4 | 8 | 2.2 | 2.25 | 2.2%
8 | 8 | 2.77 | 3.1 | 10.6%
16 | 8 | 4.17 | 4.2 | 0.7%
32 | 8 | 6.86 | 6.82 | -0.6%
64 | 8 | 14.88 | 14.92 | 0.3%
128 | 8 | 31.4 | 31.29 | -0.4%
1 | 16 | 1.61 | 1.71 | 5.8%
4 | 16 | 2.13 | 2.31 | 7.8%
8 | 16 | 3.38 | 3.67 | 7.9%
16 | 16 | 6.16 | 6.54 | 5.8%
32 | 16 | 14.16 | 14.76 | 4.1%
64 | 16 | 30.36 | 30.57 | 0.7%
128 | 16 | 63.14 | 63.57 | 0.7%
1 | 32 | 1.53 | 1.69 | 9.5%
4 | 32 | 3.34 | 3.66 | 8.7%
8 | 32 | 6.25 | 6.64 | 5.9%
16 | 32 | 14.12 | 14.9 | 5.2%
32 | 32 | 28.96 | 29.82 | 2.9%
64 | 32 | 61.07 | 61.77 | 1.1%
128 | 32 | 116.38 | 117.98 | 1.4%
1 | 64 | 2.01 | 2.21 | 9.0%
4 | 64 | 6.18 | 6.67 | 7.3%
8 | 64 | 13.72 | 14.49 | 5.3%
16 | 64 | 28.71 | 29.83 | 3.8%
32 | 64 | 58.65 | 60.68 | 3.3%
64 | 64 | 113.09 | 113.17 | 0.1%
128 | 64 | 205.21 | 209.4 | 2.0%
1 | 128 | 3.37 | 3.76 | 10.4%
4 | 128 | 13.54 | 14.85 | 8.8%
8 | 128 | 28.32 | 30.22 | 6.3%
16 | 128 | 58.16 | 62.09 | 6.3%
32 | 128 | 109.17 | 113.99 | 4.2%
64 | 128 | 198.9 | 207.1 | 4.0%
128 | 128 | 413.25 | 421.82 | 2.0%
1 | 256 | 6.33 | 7.05 | 10.2%
4 | 256 | 28.09 | 31.49 | 10.8%
8 | 256 | 57.47 | 62.76 | 8.4%
16 | 256 | 106.77 | 117.95 | 9.5%
32 | 256 | 197.02 | 208.58 | 5.5%
64 | 256 | 406.81 | 431.36 | 5.7%
128 | 256 | NA | NA | NA
1 | 512 | 13.84 | 16.32 | 15.2%
4 | 512 | NA | NA | NA
8 | 512 | NA | NA | NA
16 | 512 | NA | NA | NA
32 | 512 | NA | NA | NA
64 | 512 | NA | NA | NA
128 | 512 | NA | NA | NA

* V100:

batch_size | sequence_length | with Fused Attention | with Unfused
Attention | V100 Gain
-- | -- | -- | -- | --
1 | 8 | 1.31 | 1.6 | 18.1%
4 | 8 | 1.17 | 1.26 | 7.1%
8 | 8 | 1.43 | 1.79 | 20.1%
16 | 8 | 2.14 | 1.96 | -9.2%
32 | 8 | 2.91 | 3.08 | 5.5%
64 | 8 | 5.32 | 5.27 | -0.9%
128 | 8 | 9.34 | 8.97 | -4.1%
1 | 16 | 1.41 | 1.58 | 10.8%
4 | 16 | 1.38 | 1.49 | 7.4%
8 | 16 | 1.81 | 2.2 | 17.7%
16 | 16 | 2.8 | 2.83 | 1.1%
32 | 16 | 4.94 | 4.99 | 1.0%
64 | 16 | 8.88 | 8.84 | -0.5%
128 | 16 | 17.35 | 17.2 | -0.9%
1 | 32 | 1.38 | 1.77 | 22.0%
4 | 32 | 1.77 | 1.93 | 8.3%
8 | 32 | 2.71 | 2.86 | 5.2%
16 | 32 | 5.03 | 4.92 | -2.2%
32 | 32 | 8.8 | 8.79 | -0.1%
64 | 32 | 17.29 | 17.23 | -0.3%
128 | 32 | 33.27 | 33.1 | -0.5%
1 | 64 | 1.67 | 1.87 | 10.7%
4 | 64 | 2.69 | 2.76 | 2.5%
8 | 64 | 4.87 | 4.94 | 1.4%
16 | 64 | 8.73 | 8.81 | 0.9%
32 | 64 | 16.92 | 17.24 | 1.9%
64 | 64 | 33 | 33.38 | 1.1%
128 | 64 | 65.33 | 65.86 | 0.8%
1 | 128 | 2.03 | 2.22 | 8.6%
4 | 128 | 4.9 | 5.04 | 2.8%
8 | 128 | 8.76 | 8.81 | 0.6%
16 | 128 | 17.06 | 17.29 | 1.3%
32 | 128 | 33.25 | 33.56 | 0.9%
64 | 128 | 65.54 | 66.5 | 1.4%
128 | 128 | 130.44 | 131.44 | 0.8%
1 | 256 | 2.78 | 2.86 | 2.8%
4 | 256 | 8.75 | 9.04 | 3.2%
8 | 256 | 17 | 17.68 | 3.8%
16 | 256 | 33.19 | 34.32 | 3.3%
32 | 256 | 65.43 | 67.86 | 3.6%
64 | 256 | 129.92 | 134.68 | 3.5%
128 | 256 | NA | NA | NA
1 | 512 | 4.95 | 5.32 | 7.0%
4 | 512 | NA | NA | NA
8 | 512 | NA | NA | NA
16 | 512 | NA | NA | NA
32 | 512 | NA | NA | NA
64 | 512 | NA | NA | NA
128 | 512 | NA | NA | NA
2022-12-31 10:33:54 -08:00
Dmitri Smirnov
5d729839b5
Support loading widechar paths on windows (#14066)
### Description
Make GetRuntimePath() and LoadDynamicLibrary() operate on platform
specific paths

### Motivation and Context
This addresses https://github.com/microsoft/onnxruntime/issues/14063
2022-12-30 16:30:11 -08:00
Baiju Meswani
b85878953f
Fix nightly ort training ci pipeline (#14007) 2022-12-30 12:28:57 -08:00