### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).
Some of the added optimizations include:
- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
- For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
- For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
- CPU:
- Different Q and KV sequence lengths
- Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
- CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
- CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
- Different Q, K, V input shapes
- Pass past key and past value directly as key and value
- CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)
### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```
Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline
import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/
directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'
# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
directory,
use_io_binding=(device == 'cuda'),
provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=(-1 if device == 'cpu' else 0),
)
# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```
Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920
### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)
This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427
This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
### Description
<!-- Describe your changes. -->
Improvement with Tulrv6 on A100

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.
Excluded
```
'onnxruntime/core/mlas/**',
'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```
because they contain assembly or is data heavy
### Motivation and Context
Coding style consistency
Add Bluestein Z-Chirp CPU EP implementation for the DFT operator
While the current DFT operator has an FFT implementation for signal
lengths of size 2^N, it currently only has a naive implementation for
completeness sake. The non-power of 2 case is very slow.
The appropriate algorithm to use here is the Bluestein Z-Chirp
algorithm, which evalutates a single DFT with 3 FFT calculations (2
forwards and 1 inverse) and a chirp signal. Luckily, the chirp signal
and one of these FFT operations can be precomputed (B).
The resulting computation performs multiple DFTs on longer signals, but
in the end is faster because the individual sub-DFT computations can
leverage the faster FFT implementation under the hood.
---------
Co-authored-by: stevenlix <38092805+stevenlix@users.noreply.github.com>
### Minor fix for differently scoped cpu_ep
cpu_ep is under `#ifndef DISABLE_CONTRIB_OPS`, but one of its usage is
not under the same condition.
```
#ifndef DISABLE_CONTRIB_OPS
const InlinedHashSet<std::string_view> cpu_ep = {onnxruntime::kCpuExecutionProvider};
#endif
```
### Motivation and Context
Postmoterm: https://github.com/microsoft/onnxruntime/pull/15461 passed
all CIs except Linux/Windows TVM CIs. I did not check the detailed error
message then because they are failed for some reason for a few days at
least. While checking the details, after PR 15461, the error messge
changes from
Before constant sharing change: TVM CI error message:
```
https://github.com/microsoft/onnxruntime/actions/runs/4700368634/jobs/8334955814
ERROR: testBooleanInputs (__main__.TestInferenceSession)
----------------------------------------------------------------------
Traceback (most recent call last):
File "onnxruntime_test_python.py", line 617, in testBooleanInputs
sess = onnxrt.InferenceSession(get_name("logicaland.onnx"), providers=available_providers)
File "D:\a\onnxruntime\onnxruntime\build\Release\Release\onnxruntime\capi\onnxruntime_inference_collection.py", line 383, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\a\onnxruntime\onnxruntime\build\Release\Release\onnxruntime\capi\onnxruntime_inference_collection.py", line 435, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_api.cc:49 onnxruntime::tvm::TVMCompile compile != nullptr was false. Unable to retrieve 'tvm_onnx_import_and_compile'.
```
to
```
D:\a\onnxruntime\onnxruntime\onnxruntime\core\optimizer\graph_transformer_utils.cc(213,67): error C2065: 'cpu_ep': undeclared identifier [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_optimizer.vcxproj]
D:\a\onnxruntime\onnxruntime\onnxruntime\core\optimizer\graph_transformer_utils.cc(213,19): error C2672:
```
This PR fixes the build the issue, The error message of Windows/Linux
TVM CIs are back to the original ones.
After SkipLayernorm using fp32 for internal calculation and using
numeric stable algorithm, enable it for fp16 here.
Make the op_block_list a command line argument to help future tools.
Other minor changes.
### Description
Bump ruff version in CI and fixed new lint errors.
- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.
### Motivation and Context
Code quality
### Description
Create a stream in DeviceStreamCollection for memory pattern case to fix
the thread safe issue 15154
### Motivation and Context
This is to fix the bug 15154
https://github.com/microsoft/onnxruntime/issues/15154
### Description
This will add a few TRT options, some of them are only available on TRT
8.6:
- heuristics
- sparsity
- optimization level (8.6 only)
- auxiliary stream (8.6 only)
- tactic source selection
I am no sure yet which tests is should add for these options. As those
are mostly simple TRT flags i am not sure to what level i should test.
For heuristics something similar to
44dda08b51/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc (L510-L538)
should be possible for, but for all other essentially we would only be
testing if there is a crash or not if the option is set.
Also if i forgot some option that would be good to have feel free to
speak up !
### Share more constant initializers.
`ConstantSharing` transformer originally only handle single value
initializer (scalar or 1D).
This PR tried to share more cases to make common subexpression
elimination transformer to remove more duplicated nodes.
Originally, we used a single
vector<std::variant<float,half,int32,int64>> to store different scalar
values. In this PR, we create a unordered map with its key being
data_type + rank + element count, and its value is a vector of
`InitializerValue`.
For one specific initializer, if it fulfils the condition, then finally
will find the corresponding vector of `InitializerValue` by its
<data_type + rank + element count>, then search from the vector whether
the constant tensor already exist or not. After that, a value id is
returned, which will be combined together with <data_type + rank +
element count> to form the pattern key to decide which tensor to reuse
(legacy code).
### Motivation and Context
One example we see here is:
```mermaid
stateDiagram
[*] --> LayerNorm(b,s,64)
LayerNorm(b,s,64) --> Reshape1
Shape1_Const[b*s,64] --> Reshape1
LayerNorm(b,s,64) --> Reshape2
Shape2_Const[b*s,64] --> Reshape2
Reshape1 --> AttentionSubGraph
Reshape2 --> Add
AttentionSubGraph--> Add
Add --> [*]
```
Ideally CommonSubexpressionElimination can remove one of `Reshape1` and
`Reshape2`, while since `Shape1_Const` and `Shape2_Const` are different
NodeArg*, so it did not remove the duplication.
This is an example: removing the duplication will bring more
opportunities to apply graph transformations.
### Description
Add hipBLASLt to GEMM Tunable op, which supports GEMM and
StridedBatchedGEMM.
To enable hipBLASLt implementation, add an extra flag to the building
command: `--cmake_extra_defines onnxruntime_USE_HIPBLASLT=ON`.
SkipLayerNorm fusion fuses LayerNorm and one or more Add kernels now.
While LayerNormalization kernel allows different input and output type
by definition, SkipLayerNormalization must have the same input and
output type.
This graph is valid as the output of Add node is float16 and two inputs
from initializers are float.

But, when Add and LayerNormalization are fused, it fails because two
inputs of Add node are float16 type and SkipLayerNormalization must have
the same input types. To avoid this failure, this PR adds Cast node
before inputs of SkipLayerNormalization when input and output type are
different and output type is float. The above graph is fused as follows,

For performance, it'd better for SkipLayerNormalization to support
different input and output type, but this PR is to unblock Turing NLR v5
base mode in Babel. When we have more cases, we can support it.
Add support to use sequence as input ids in decoder inputs to Beam
Search CUDA Op
### Description
Currently Beam search Op is only supported for CPU EP, added support for
CUDA EP.
### Motivation and Context
- For Turing models inference was throwing segmentation fault due to
copy failing in cuda memory, also beam search support was not present in
cuda.
### Description
1. Disable XNNPack EP's tests in Windows CI pipeline
The EP code has a known problem(memory alignment), but the problem does
not impact the usages that we ship the code to. Now we only use XNNPack
EP in mobile apps and web usages. We have already pipelines to cover
these usages. We need to prioritize fixing the bugs found in these
pipelines, and there no resource to put on this Windows one. We can
re-enable the tests once we reached an agreement on how to fix the
memory alignment bug.
2. Delete anybuild.yml which was for an already deleted pipeline.
3. Move Windows CPU pipelines to AMD CPU machine pools which are
cheaper.
4. Disable some qdq/int8 model tests that will fail if the CPU doesn't
have Intel AVX512 8-bit instructions.
### Description
Temporarily disable BatchNormalizationGrad test due to random failure.
Example:
```
2023-04-12T06:33:24.1593811Z 1: [ RUN ] GradientCheckerTest.BatchNormalizationGrad
2023-04-12T06:33:27.5603881Z 1: D:\a\_work\1\s\orttraining\orttraining\test\gradient\gradient_ops_test.cc(1468): error: Value of: IsErrorWithinTolerance(max_error, error_tolerance)
2023-04-12T06:33:27.5604509Z 1: Actual: false
2023-04-12T06:33:27.5604719Z 1: Expected: true
2023-04-12T06:33:27.5604997Z 1: max_error: 1.776702880859375; tolerance: 0.019999999552965164; ORT test random seed: 2552121240;
2023-04-12T06:33:27.5605266Z 1: Google Test trace:
2023-04-12T06:33:27.5605531Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 8910
2023-04-12T06:33:27.5605843Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 5678
2023-04-12T06:33:27.5606478Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 1234
2023-04-12T06:33:27.8285560Z 1: D:\a\_work\1\s\orttraining\orttraining\test\gradient\gradient_ops_test.cc(1493): error: Value of: IsErrorWithinTolerance(max_error, error_tolerance)
2023-04-12T06:33:27.8286181Z 1: Actual: false
2023-04-12T06:33:27.8286404Z 1: Expected: true
2023-04-12T06:33:27.8286669Z 1: max_error: 1.776702880859375; tolerance: 0.019999999552965164; ORT test random seed: 2552121240;
2023-04-12T06:33:27.8286942Z 1: Google Test trace:
2023-04-12T06:33:27.8287208Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 8910
2023-04-12T06:33:27.8287532Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 5678
2023-04-12T06:33:27.8287849Z 1: D:\a\_work\1\s\onnxruntime\test\common\tensor_op_test_utils.cc(14): ORT test random seed: 1234
2023-04-12T06:33:51.6368960Z 1: [ FAILED ] GradientCheckerTest.BatchNormalizationGrad (27475 ms)
```
### Description
The following three lines are needed before including some cutlass
header files, because cutlass uses "and"/"or" keywords. Generally it
should not be a problem without this header, but nvcc is not strictly
compliant to C++ standard.
```c++
#ifdef __cplusplus
#include <ciso646>
#endif
```
We didn't hit this problem because the above code exists in absl. We
always include absl headers first. However, ABSL recently deleted them!
https://github.com/abseil/abseil-cpp/pull/1246
The cutlass dependency was introduced in #14343 , after we had abseil.
### Optimize SCE loss compute
Compute optimization based on label data sparsity:
- Insert ShrunkenGather before SCELoss node, to filter out invalid
labels for compute.
- Support ShrunkenGather upstream.
- Added test for the above.
- Added flag to enable label sparsity optimization with env var, by
default disabled now. Will enable after comprehensive benchmarking
later.
- Extract common logic into test_optimizer_utils.h/cc from
core/optimizer/compute_optimzier_test.cc, then the common functions can
be shared by both core/optimizer/compute_optimzier_test.cc and
orttraining/core/optimizer/compute_optimzier_test.cc
- Extract common logic into shared_utils.h/cc: `GetONNXOpSetVersion` and
`Create1DInitializerFromVector`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The code handling variadic parameters when creating a schema for a
function has a minor bug.
The checking logic was nested inside a conditional, instead of being
outside.
Fix the logic, and add a test-case. This bugs manifests itself when the
first parameter in the
variadic list is not an input/output of the enclosing function.
### Motivation and Context
Fixes https://github.com/microsoft/onnxruntime/issues/15404
---------
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
### Description
<!-- Describe your changes. -->
* Integrate TRT 8.6EA on relevant Linux/Windows/pkg pipelines
* Update onnx-tensorrt to 8.6
* Add new dockerfiles for TRT 8.6 and clean old ones
* Update
[CGManifest](https://github.com/microsoft/onnxruntime/tree/main/cgmanifests)
files and ort build deps version
* yml/script update
* Enable built-in TRT parser option on TRT related pipelines by default
* Exclude test TopKOperator.Top3ExplicitAxisInfinity out of TRT EP tests
(8.6-EA has issue with topk operator)
This change moves the DML CI pipeline to the A10 machines and fixes or
disables tests that were failing from this change.
- Max error rate threshold was increased for Image Tests
- Some failing batch tests were disabled
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
### Description
Recently Visual Studio and python started to provide native Windows
ARM64 packages. This PR is to provide better support for building on
Windows ARM64. You can do it as what you did for x64. Like:
```
python tools\ci_build\build.py --config Debug --update --skip_submodule_sync --build_dir b --cmake_generator "Visual Studio 17 2022"
```
You do not need to append the "--arm64" build arg, and do not need to
cross-compile protoc for a different arch as you are not cross-compiling.
**caveat:** it does not work with the latest cmake release(3.26.x). It
only works fine with cmake 3.25.x and below. Filed a bug to them:
https://gitlab.kitware.com/cmake/cmake/-/issues/24797
### Motivation and Context
Provide better support for building on Windows ARM64.
### Description
<!-- Describe your changes. -->
As title
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/15110
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
add script to validate generated NPM packages and publish it to
artifacts, so that release pipeline can use it.
once this PR is merged, I will update the NPM package release pipeline.
### Description
Implement Optional Type metadata support in the library.
Implement optional support in C# API along with metadata.
Implement Sequence, Map, Optional test data support
and test execution.
Prune tests and provide more details for failing tests in C# code.
Note, this PR does not enable running onnx test models in C++.
### Motivation and Context
Opset18 optional type support.
Add support for kMSInternalNHWCDomain and kPytorchAtenDomain op domains to op reduction script.
Make it an error if the op reduction script encounters unknown op domains.