* Fixed default dilatation value for poolgrad ops
### Description
Changed default dilatation value to 0 in poolgrad ops
### Motivation and Context
Fixes error on unit tests when --enable_training --use_dnnl flags are
active and
### Description
Disable a test with random failure in Windows GPU CI Pipeline like the
following:
```
11: [ OK ] BatchTest/BatchTest.BatchSupport/163 (0 ms)
11: [ RUN ] BatchTest/BatchTest.BatchSupport/164
11: D:\a\_work\1\s\winml\test\image\imagetests.cpp(186): error: Expected: m_model_binding.Bind(output_data_binding_name, output_video_frames) doesn't throw an exception.
11: Actual: it throws.
11: D:\a\_work\1\s\winml\test\image\imagetests.cpp(211): error: Expected: m_result = m_session.Evaluate(m_model_binding, L"") doesn't throw an exception.
11: Actual: it throws.
11: total errors is 0/2073600, errors rate is 0total errors is 0/2073600, errors rate is 0total errors is 0/2073600, errors rate is 0[ FAILED ] BatchTest/BatchTest.BatchSupport/164, where GetParam() = ((L"fns-candy_Bgr8_Batch3.onnx", 0, { L"1080.jpg", L"fish_720_Gray.png", L"fish_720.png" }, 3, false), 0, 1, 1, 1, 4-byte object <02-00 00-00>) (3203 ms)
```
Since https://github.com/microsoft/onnxruntime/pull/15468 merged to
main, about 10~15% build job failed in the test.
### Description
<!-- Describe your changes. -->
1. moved onnxruntime/contrib_ops/cuda/decoder to
onnxruntime/contrib_ops/cuda/bert
2. create utils.cuh under /bert for shared implementations in
decoder_masked_multihead_attention_impl_utils.h and
rotary_embedding_util.h
3. refactored relative_attn_bias_impl.cu by reusing the template
specializations in utils.cuh
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
fine tune cuda layernorm block size considering number of rows to
process together with column number, and hardware resources (number of
SMs, etc)
Co-authored-by: Lei Zhang <phill.zhang@gmail.com>
### Description
<!-- Describe your changes. -->
Integrate react native e2e test framework with detox.
https://wix.github.io/Detox/
Good build in CI:
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=946695&view=results
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Write cross-platform end-to-end tests in JavaScript.
Resolve flaky e2e tests in react native ci pipelines.
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
- Updates the QNN Windows ARM64 pipeline to use a new image with Visual
Studio 2022 (updated from VS 2019)
- Creates a new gtest fixture class that skips tests for the QNN CPU
backend if we detect that the QNN CPU backend is not
available/functional. The current windows arm64 vm does not support any
QNN backend.
### Motivation and Context
Visual Studio 2022 adds support for native arm64 compilation. This
pipeline will help catch any build regressions on Windows ARM64 w/ VS
2022.
### Description
This PR fixes an issue with calling the ORT transformer optimizer script
on the custom export of Whisper with beam search. It also includes the
[fix](https://github.com/microsoft/onnxruntime/pull/15616) for the GPU
out-of-memory issue.
### Motivation and Context
With this PR fix, the optimizer runs as described in the [Whisper model
optimization PR](https://github.com/microsoft/onnxruntime/pull/15473).
### Description
Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from `IntPtr` to `byte[]`
(marshalling pinning is more efficient).
Allocate input/output UTF-8 names in unmanaged heap for the lifetime of
InferenceSession. So we do not keep converting them and pinning on every
Run.
Introduce a new native API that allows to allocate and convert/copy
strings directly into a native tensor.
The PR delivers around 50% latency improvements and less GC pauses.
Inspired by: https://github.com/microsoft/onnxruntime/pull/15520
### Motivation and Context
Client experience GC pressure and performance degradation when dealing
with string tensors.
Co-Authored-By: @tannergooding
### Description
<!-- Describe your changes. -->
Add dynamic quantization support for whisper model.
There are 3 options to try out:
- quantize_embedding_layer: enable to quantize embedding layer of
decoder model or not
- quantize_per_channel: enable to quantize per channel for Gemm or
MatMul
- quantize_reduce_range: use 7bit to quantize MatMul or Gemm. Use when
hitting accuracy issue on x64 cpus without VNNI.
### Description
- Fix lintrunner configurations to always use `python` instead of
`python3`.
- Set up dependabot
- Moved dependencies to requirements-lintrunner to allow dependabot to
update it similar to https://github.com/onnx/onnx/pull/5124
### Description
Adds schema for NHWC Resize that uses the default ONNX type/shape
inferencing.
### Motivation and Context
The QNN EP requires the Resize operator to be NHWC. Currently, the
Resize operator fails type and shape inference because the current
schema changes the input to NCHW, but the `scales` and `sizes` inputs
remain in NHWC.
This PR adds a schema for NHWC Resize that allows it to use the default
ONNX type/shape inference while still remaining in the internal NHWC
domain.
### Description
<!-- Describe your changes. -->
Add Swift Package Manager (SPM) support for ORT based on #14621
- uses the existing objective-c bindings
- some re-organization of the directory structure was required but the
contents of the files are unchanged, apart from adjustments due to file
movements
Add tool for updating ORT native pod used in the SPM package
Update CIs to use ORT native pod from build, and build/test using SPM
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
iOS developers are using SPM as much as cocoapods, so adding SPM means
both are catered for.
### Description
Setting proper default value for attributes of pool operators
### Motivation and Context
Fixed AB#14719
Global pooling and pooling operators usually share the same underlying
implementation. When we detect the operator is global, code for setting
up the attributes is skipped. This may cause un-deterministic behavior.
A recent commit added an assembler check if the ASM dialect was ATT
This unfortunately broke the AMX build for systems that don't have the
ASM-ATT dialect.
This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and
the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed
checks AMX is supported by the compiler and assembler.
### Description
### Motivation and Context
On my build system the recent change to add the ASM-ATT version check
disabled AMX code from the build.
---------
Signed-off-by: George Nash <george.nash@intel.com>
### Description
https://github.com/microsoft/onnxruntime/pull/15538
Above pull request breaks Windows build on cmake 3.25 or earlier. This
should fix it.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR makes ORT support TRT plugin using custom ops. ORT TRT can
automatically register all TRT plugins from TRT plugins registry as
custom ops. There is no code change needed for ORT when new TRT plugins
are introduced.
Previous way for ORT to support TRT plugins was using contrib ops, but
there are some concerns about it:
- Contrib ops are shipped as part of the ORT binary by default. TRT
related plugins should not be in the default ORT.
- Contrib ops are designed for internal ops and developed for cpu and
cuda EPs.
Therefore, using custom ops is a good approach to support TRT plugins.
Followings are the major modifications:
1. Add new `GetCustomOpDomainList` provider api which allows provider to
create its own custom op domain list and ORT can register this domain
list. Provider has the responsibility to free all the custom op domain
instances it created.
2. Move OrtCustomOpDomain struct definition to
framework_provider_common.h since this struct is being used by framework
and EPs now.
3. There are several TRT plugins registered as onnx schema op through
contrib op with onnx domain. In order not to break the old models using
those TRT plugins which were registered with ONNX domain and maintain
backward compatible, we need to keep the old/legacy TRT plugins with
onnx domain. Moving forward, all newly added TRT plugins should be
registered with `trt.plugins` domain.
4. TRT plugin doesn't have an api to get number of inputs/outputs of the
registered plugins, so ORT TRT uses variadic inputs/outputs to bypass
the onnx node validation.
5. Add new trt provider option, `trt_extra_plugin_lib_paths`, user can
specify any extra plugin lib, for example,
`fastertransformer/build/lib/libvit_plugin.so` or
`fastertransformer/build/lib/libvit_plugin.so;fastertransformer/build/lib/libvit_plugin_v2.so`
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).
Some of the added optimizations include:
- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
- For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
- For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
- CPU:
- Different Q and KV sequence lengths
- Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
- CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
- CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
- Different Q, K, V input shapes
- Pass past key and past value directly as key and value
- CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)
### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```
Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline
import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/
directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'
# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
directory,
use_io_binding=(device == 'cuda'),
provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=(-1 if device == 'cpu' else 0),
)
# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```
Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920
### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)
This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427
This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
### Description
<!-- Describe your changes. -->
Improvement with Tulrv6 on A100

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.
Excluded
```
'onnxruntime/core/mlas/**',
'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```
because they contain assembly or is data heavy
### Motivation and Context
Coding style consistency
Add Bluestein Z-Chirp CPU EP implementation for the DFT operator
While the current DFT operator has an FFT implementation for signal
lengths of size 2^N, it currently only has a naive implementation for
completeness sake. The non-power of 2 case is very slow.
The appropriate algorithm to use here is the Bluestein Z-Chirp
algorithm, which evalutates a single DFT with 3 FFT calculations (2
forwards and 1 inverse) and a chirp signal. Luckily, the chirp signal
and one of these FFT operations can be precomputed (B).
The resulting computation performs multiple DFTs on longer signals, but
in the end is faster because the individual sub-DFT computations can
leverage the faster FFT implementation under the hood.
---------
Co-authored-by: stevenlix <38092805+stevenlix@users.noreply.github.com>
### Minor fix for differently scoped cpu_ep
cpu_ep is under `#ifndef DISABLE_CONTRIB_OPS`, but one of its usage is
not under the same condition.
```
#ifndef DISABLE_CONTRIB_OPS
const InlinedHashSet<std::string_view> cpu_ep = {onnxruntime::kCpuExecutionProvider};
#endif
```
### Motivation and Context
Postmoterm: https://github.com/microsoft/onnxruntime/pull/15461 passed
all CIs except Linux/Windows TVM CIs. I did not check the detailed error
message then because they are failed for some reason for a few days at
least. While checking the details, after PR 15461, the error messge
changes from
Before constant sharing change: TVM CI error message:
```
https://github.com/microsoft/onnxruntime/actions/runs/4700368634/jobs/8334955814
ERROR: testBooleanInputs (__main__.TestInferenceSession)
----------------------------------------------------------------------
Traceback (most recent call last):
File "onnxruntime_test_python.py", line 617, in testBooleanInputs
sess = onnxrt.InferenceSession(get_name("logicaland.onnx"), providers=available_providers)
File "D:\a\onnxruntime\onnxruntime\build\Release\Release\onnxruntime\capi\onnxruntime_inference_collection.py", line 383, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "D:\a\onnxruntime\onnxruntime\build\Release\Release\onnxruntime\capi\onnxruntime_inference_collection.py", line 435, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: D:\a\onnxruntime\onnxruntime\onnxruntime\core\providers\tvm\tvm_api.cc:49 onnxruntime::tvm::TVMCompile compile != nullptr was false. Unable to retrieve 'tvm_onnx_import_and_compile'.
```
to
```
D:\a\onnxruntime\onnxruntime\onnxruntime\core\optimizer\graph_transformer_utils.cc(213,67): error C2065: 'cpu_ep': undeclared identifier [D:\a\onnxruntime\onnxruntime\build\Release\onnxruntime_optimizer.vcxproj]
D:\a\onnxruntime\onnxruntime\onnxruntime\core\optimizer\graph_transformer_utils.cc(213,19): error C2672:
```
This PR fixes the build the issue, The error message of Windows/Linux
TVM CIs are back to the original ones.
After SkipLayernorm using fp32 for internal calculation and using
numeric stable algorithm, enable it for fp16 here.
Make the op_block_list a command line argument to help future tools.
Other minor changes.
### Description
Bump ruff version in CI and fixed new lint errors.
- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.
### Motivation and Context
Code quality