Support cross qk in beam search for whisper model and related features
Make whisper exporting tools support cross qk and some related features,
* extra_decoding_ids
* no_speech_prob
Implement DTW kernel, unfold tensor kernel with unit test Several fix
related with multiple session running parallel, like:
* guard multihead_attention, fused_fp16_runner_
* some memory allocation with stream awareness
* add use_ep_level_unified_stream option
### Description
Improve the QNN context binary cache feature to reduce the memory
overhead and initialization time overhead.
Instead of dumping a Qnn context binary file with metadata as header, we
dump a Onnx format file with metadata inside Onnx node.
### Motivation and Context
reduce the memory overhead and initialization time overhead
### Description
this is for ORT 1.17.0 - make ORT to use ONNX release 1.15.0 branch. Eventually will update to the release tag once ONNX 1.15.0 is released
### Motivation and Context
Prepare for ORT 1.17.0 release. People can start work on new and updated ONNX ops in ORT.
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
Updated a couple of old links in the technical documentation that where
pointing to files present prior to the migration to
https://onnxruntime.ai/docs.
### Introduce ZeROOffloadSubscriber for ORTModule
As part of the work: integrate ORTModule with DeepSpeed stage3, this PR
mainly focus on moving original PyTorch-based (leveraging hooks) param
partition/offload implementation to ORTModule compatible implementation.
Changes include:
1. Refactor `SubscriberBase`/`SubcriberManager` to support
pre-forward/post_forward hooks.
2. Implement new `ZeROOffloadSubscriber` by re-using DeepSpeed hook
function as much as possible. Since all hook functions are defined in
`DeepSpeedZeRoOffload._register_hooks_recursively` and
`DeepSpeedZeRoOffload.setup_zero_stage3_hooks`, and the good thing is,
the closure is not complex, all hooks are referencing the owning
`DeepSpeedZeRoOffload` instance, so we can create new hook function with
`FunctionType` by binding the owning `DeepSpeedZeRoOffload` instance,
then call the new created function in subscriber's
`pre_forward_module_apply_impl` and `post_forward_module_apply_impl`
interfaces.
3. Monkey patch `DeepSpeedZeRoOffload.setup_zero_stage3_hooks` to
register the `ZeROOffloadSubscriber` for the model, then we don't need
change any code on the DeepSpeed repo (at least so far).
4. Fix the ATen embedding custom symbolic exporter function by
tolerating weights size be (0) (changed by DeepSpeed zero stage 3).
UT will be added once stage3 is fully supported.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR fixes broken hyperlinks in the documentation that should lead
users to Jupyter notebooks. Currently, the hyperlinks are not working as
intended. The PR resolves this issue by updating the hyperlinks to
correctly direct users to the Jupyter notebooks.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
It fixes broken hyperlinks leading to the Jupyter notebooks.
### Description
Remove the onnxruntime-extensions submodule since it now was used via
cmake FetchContent
### Motivation and Context
The submodule relies on an outdated version of the extensions, and the
build instructions should be updated to eliminate any confusion.
OpenVINO EP ORT 5.1 Branch
Changes for the new API to take in OpenVINO Provider Options
and compatibility with OV 2023.1
### Motivation and Context
The change is required for the new API to take in OpenVINO Provider
Options
and make it seamless.
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Use full qualified name for PythonOp export
Originally, when there are duplicate named torch.autograd.Function in
different module, for example:
`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`
We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.
This PR finds a way to construct a full qualified name.
`def _export_pt_1_10(g, n, *args, **kwargs):`
1. in exporter function, kwargs contains `name` and `module`, in the
above example:
`a.b.c.Gelu` --> name: `Gelu`, module: `a.b.c`
`d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.
The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).
2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.
With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
### Description
Enhanced SkipLayerNorm by implementing broadcasting for both CPU and
CUDA
### Motivation and Context
The input and skip tensors no longer have to be the same size which
means that it can accept data where the skip shape can be the same size
as the input shape, have a shape of {1, sequence_length, hidden_size},
or {sequence_length, hidden_size}.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Being able to leverage I/O binding for DML and registering `If` for the
DML EP allows us to avoid copying the past/present key/values back and
forth between the CPU and the GPU after every token.
This gives us a 25% performance increase for Dolly V2 with 128 tokens on
an RTX 4090.
### Description
Fixes the issue with IRFFT output dimension calculation as described in
#13236
### Motivation and Context
Please refer to #13236 for detailed description.
Specifically, [this code](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/cuda/math/fft_ops.cc#L103) computes the output dimension as:
```
out_dim = in_dim * 2 - 1
```
while it should be this instead:
```
out_dim = 2 * (in_dim - 1)
```
(assuming the original signal has even number of samples, of course).
For example, if the original signal has 4 samples, then the round trip should look something like:
```
4 -> (one-sided RFFT) -> 3 (complex) -> (one-sided IRFFT) -> 4
```
with the current code the output will be a signal with 5 points.
---------
Co-authored-by: Alexey Kamenev <akamenev@nvidia.com>
Co-authored-by: Nick Geneva <nicholasgeneva@gmail.com>
### Description
Remove VS 2019 code.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR adds support to cache the exported training/evaluation ONNX
model in `ORTModule`. On future runs, instead of exporting the model
again, we can pick up the model from a location on disc and run
`ORTModule` training/evaluation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ORT Training DRI Contribution
---------
Co-authored-by: root <root@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Prathik Rao <prathikrao@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: pengwa <pengwa@microsoft.com>
This will remove transposes that are non needed in the DML kernel. To
keep backward compatiblity, the default behavior is to set NHWC when no
attribute is set.
### Description
Disable two PERF* rules in ruff to allow better readability. Rational
commented inline. This change also removes the unused noqa directives
because of the rule change.
### Motivation and Context
Readability
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ #16789
Bump ruff to 0.0.278 and fix new lint errors. I added noqa to all
existing RUF012 errors which requires mutable class variables to be
annotated with `ClassVar`, as well as all PERF issues.
Signed-off-by: Justin Chu <justinchu@microsoft.com>
### Description
This PR is includes changes in the documentation of _readmeOV.rst_ file
and also the changes in the dockerfile which enables to build ORT with
latest OpenVINO 2023.0.0
### Motivation and Context
Modified the dockerfile to incorporate the latest version of OpenVINO
(2023.0.0) for building Onnxruntime.
The changes in the PR aim to improve the overall user experience by
providing accurate and up-to-date documentation while leveraging latest
OpenVINO 2023.0.0
### Description
<!-- Describe your changes. -->
This PR adds support for rotary embeddings in decoder masked
self-attention
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
The [ONNX
standard](https://github.com/onnx/onnx/blob/main/docs/Operators.md#type-constraints-181)
permits the `Unique` operator to have `double` input tensor element
type, however this was not supported in onnxruntime. This PR enables
this kernel.
### Motivation and Context
The lack of support for `float64` forces users currently to cast to
`float32` instead. This loss of precision can be severely problematic in
feature engineering pipelines downstream of the `Unique` operator. It
would be good to prevent this by updating ORT to reflect the standard
and support `double` input tensors.
---------
Signed-off-by: Aditya Goel <agoel4512@gmail.com>
### Description
This PR includes documentation updates, providing step-by-step
instructions on how to implement the ModuleWithLoss wrapper in a
different codebase.
The documentation outlines the necessary code changes and offers
customization options based on specific requirements.
---------
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Manage ORTModule options
Move all env vars that used for feature ON/OFF into runtime options for
consistent managements.
Be noted: the features' switch are assigned in 2 phases: default values,
overwritten by env vars (if specified by users). So env vars take the
highest priority when all 2 phases both given value explicitly for one
feature.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Optimize compute graph by eliminating padding in embedding.
### Motivation and Context
The computation for padding in nodes after embedding is unnecessary and
waste computation resources.
This pr just add an Optimizer of PaddingElimination to check and
eliminate the padding after embedding automatically by modifying the
graph.
### Implementation:
1. Find and check embedding node in graph.
2. Iterate the subgraph afterward the embedding node and record all the
input nodes and output nodes to this subgraph.
3. Insert 'Reshape + ShrunkenGather' to flatten each input node shape
from [batch_size, seqlen, ...] to [valid_token_without_padding, ...],
and insert 'GatherGrad + Reshape' to unflatten each output node shape
from [valid_token_without_padding, ...] to [batch_size, seqlen, ...]
---------
Co-authored-by: mindest <linminuser@gmail.com>
### Description
This PR enables execution of subgraphs in OVEP and currently, when OVEP
developers install the onnxruntime-openvino package on windows from
pypi, they would have to additionally download OpenVINO windows binaries
and run the setupvars.bat script which sets the environment PATH to
locate the OV dll's. Also this PR fixes issues of OVEP windows io buffer
sample.
### Motivation and Context
Fix: We want to make the user experience easy for OVEP Python developers
on windows platform.
This fix, introduces a function add_openvino_libs_to_path at the
location tools/python/util/add_openvino_win_libs.py.
The above function, can be called by OVEP python users in the
application code and that takes care of setting
the OpenVINO dll's to the path from the OpenVINO pypi packge (openvino)
which was installed.
This change also makes sure that add_openvino_libs_to_path() function is
added to onnxruntime python package
only when it is build for OpenVINO Execution Provider for ONNXRuntime
and not for default ORT python package builds.
New user experience for Python OVEP developers on windows platform:
step 1: pip install onnxruntime-openvino
step 2: pip install openvino
step 3: <Add these 2 lines in the application code>
import onnxruntime.tools.add_openvino_win_libs as utils
utils.add_openvino_libs_to_path()
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
This fixes the type lists used to register DML kernels for Microsoft
domain QuantizeLinear and DequantizeLinear. These previously did not
include FP16 and incorrectly used the same type list for both operators.
The new type lists are the same as opset 19 ONNX which aren't
implemented yet in the DML EP.
### Enhance StatisticsSubscriber
There are few improvements for `StatisticsSubscriber`:
- Reduce peak memory impact for tensors (having many many many elements,
consuming too much GPU memory, causing original recipe run failed with
OOM), by split the statistics into two phases (split into buckets, and
merge result across buckets).
- Allow dump intermediate tensors. Originally only nn.Module forward()'s
return value are dumped, there are requirements we want to inspect some
specific intermediate tensor in the forward() function, now we support
it.
- Add documents for collecting dumps on multiple ranks
Docs link on this branch for better view:
https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool_v2/docs/ORTModule_Convergence_Notes.md
---------
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
### Description
The PR implements FloatE4M3FN, FloatE5M2, FloatE4MEFNUZ, FloatE5M2FNUZ
as described in PR https://github.com/onnx/onnx/pull/4805. It uses CUDA
API to cast float/half to float8 if CUDA>=11.8, a custom implementation
if CUDA<11.8.
* It implements, Cast, QuantizeLinear, DequantizeLinear for all types on
CPU, only for types FloatE4M3FN, FloatE5M2 on CUDA.
* It extends the supported types for control flow operator, Shape,
Reshape, Identity, If, Loop, Scan, Reshape
* It implements Equal(19).
* Cast, QuantizeLinear, DequantizeLinear operators now support a
parameter `saturate` only valid for float 8 types. It is true by
default. In that case, any value out of range is converted into the
maximum float 8 value. If false, it is infinite.
* QuantizeLinear, DequantizeLinear now supports multiple scales on CUDA
(and ROCm by extension), scale = 1D tensor with one scale per channel
### Motivation and Context
Supports latest onnx version.
Fixes
[AB#15395](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15395)
---------
Co-authored-by: Xavier Dupre <xadupre@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
### Description
<!-- Describe your changes. -->
Pad18 adds the `axes` input, which is used to indicate what axes the
padding values should be applied to. Add logic to manipulate paddings
into DML padding operator inputs.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Linnea May <linneamay@microsoft.com>
### Enable conditional optimization on inputs
Label sparsity based optimization can be enabled depending on the input
inspection result.
So this PR introduce a conditional optimization path for ORTModule,
where we automatically detect data sparsity from label or embedding, and
enable the graph optimization accordingly without any user interaction.
This feature had a new requirement of delaying passing pre_grad graph
transformation config to OrtModuleGraphBuilder, from `Initialize` phase
to its `Build` phase. Because once after `_initialize_graph_builder` we
can detect the input sparsity, and make a decision to enable the
label/embed sparisty based graph optimizations.
Add UT cases for label/embed input runtime inspector.
* graph tools update
* cuda kernel update
* operator spec update and implementation update
* greed search bug fix on wrong assumption for cross/self attention
input length
* avoid use of "" name in value info when loading graph which
historically in many model
### Description
<!-- Describe your changes. -->
Add dml registration for bitwise and, or, xor and not added in opset 18.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Linnea May <linneamay@microsoft.com>
### Description
In some scenarios, the triton written kernels are more performant than
CK or other handwritten kernels, so we implement a framework that
onnxruntime can use these triton written kernels.
This PR is to integrate triton into ort, so that ort can use kernels
that written and compiled by triton.
The main change focus on two part:
1. a build part to compile triton written kernel and combine these
kernels into libonnxruntime_providers_rocm.so
2. a loader and launcher in c++, for loading and launch triton written
kernels.
#### Build
To compile triton written kernel, add a script
`tools/ci_build/compile_triton.py`. This script will dynamic load all
kernel files, compile them, and generate `triton_kernel_infos.a` and
`triton_kernel_infos.h`.
`triton_kernel_infos.a` contains all compiled kernel instructions, this
file will be combined into libonnxruntime_providers_rocm.so, using
--whole-archive flag.
`triton_kernel_infos.h` defines a const array that contains all the
metadata for each compiled kernel. These metadata will be used for load
and launch. So this header file is included by 'triton_kernel.cu' which
defines load and launch functions.
Add a build flag in build.py and CMakeList.txt, when building rocm
provider, it will call triton_kernel build command, and generate all
necessary files.
#### C++ Load and Launch
On c++ part, we implement load and launch functions in triton_kernel.cu
and triton_kernel.h.
These two files located in `providers/cuda`, and when compiling rocm,
they will be hipified. so this part supports both cuda and rocm. But
currently we only call triton kernel in rocm.
We also implement a softmax triton op for example. Because there will
generate many kernels for different input shape of softmax, we use
TunableOp to select the best one.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Register Split18 for DirectML
Split13 was previously implemented. Split18 adds a new attribute called
"num_outputs" that must be used mutually exclusively with the "split"
input.
The "num_outputs" attribute wil split the tensor evenly (and handles odd
uneven splits). To implement, the DML split tensor just needs to be
overridden in the presence of the num_output attribute.
---------
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
This PR enables Whisper's multitask format and allows a user to use
Whisper for multiple tasks (e.g. transcription, translation) and for
multilingual purposes (e.g. English, Spanish). This PR also removes
`attention_mask` as a required input for Whisper with beam search.
### Usage
Here is an example of how you can use Whisper for English transcription.
```
import numpy as np
import onnxruntime as ort
from datasets import load_dataset
from transformers import AutoConfig, AutoProcessor
model = "openai/whisper-tiny"
config = AutoConfig.from_pretrained(model)
processor = AutoProcessor.from_pretrained(model)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="english", task="transcribe")
# forced_decoder_ids is of the format [(1, 50259), (2, 50359), (3, 50363)] and needs to be
# of the format [50258, 50259, 50359, 50363] where 50258 is the start token id
forced_decoder_ids = [config.decoder_start_token_id] + list(map(lambda token: token[1], forced_decoder_ids))
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
input_features = processor(ds[0]["audio"]["array"], return_tensors="np").input_features
inputs = {
"input_features": np.float32(input_features),
"max_length": np.array([26], dtype=np.int32),
"min_length": np.array([1], dtype=np.int32),
"num_beams": np.array([2], dtype=np.int32),
"num_return_sequences": np.array([1], dtype=np.int32),
"length_penalty": np.array([1.0], dtype=np.float32),
"repetition_penalty": np.array([1.0], dtype=np.float32),
"decoder_input_ids": np.array([forced_decoder_ids], dtype=np.int32),
}
sess = ort.InferenceSession("whisper-tiny_beamsearch.onnx", providers=["CPUExecutionProvider"])
outputs = sess.run(None, inputs)
# Print tokens and decoded output
print(outputs[0][0][0])
print(processor.decode(outputs[0][0][0]))
```
If you don't want to provide specific decoder input ids or you want
Whisper to predict the output language and task, you can set
`forced_decoder_ids = [config.decoder_start_token_id]` instead.
### Motivation and Context
As seen in the figure below from the [OpenAI Whisper
paper](https://cdn.openai.com/papers/whisper.pdf), Whisper can be used
for multiple tasks and languages.

### Description
<!-- Describe your changes. -->
V100, b_4_s_128, max_output_len=64, beam=4
before:
t5_small: 101.28ms
t5_base: 200.07ms
after:
t5_small: 87.65ms
t5_base: 174.44ms
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Register CPU OptionalGetElement, OptionalHasElement on DirectML
Graphs with OptionalGetElement and OptionalHasElement should work in a
DML graph without extra memcpy operation on and off the GPU.
CopyCpuTensor is swapped with DataTransferManager.CopyTensor() to make
the CPU operator usable by other providers.
---------
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
<!-- Describe your changes. -->
Add registration for DML RoiAlign-16 and tests for new
coordinate_transform_mode attribute. PR
[7354](https://github.com/microsoft/onnxruntime/pull/7354) is still open
to fix the CPU EP version, which is why there are skipped tests right
now. That will be completed separately so that, for now, we can
officially support opset16 with the next release.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Linnea May <linneamay@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
### Description
- Update DML version to 1.11.0
- Disable Gemm+Softmax fusion
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
1. Update VERSION_NUMBER for preparing the upcoming release. This PR's
commit will not be included in the 1.15 release branch
2. Delete package/rpm/onnxruntime.spec since it was not used in past
years.
### Motivation and Context
Preparing the release.
Fixed
[AB#15311](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/15311)
### Description
<!-- Describe your changes. -->
Add registration for DML reduce functions in opset 18.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Linnea May <linneamay@microsoft.com>
### Description
This PR changes an EmbedLayerNormalization node's mask index output to
be an optional output if a mask input is not provided.
### Motivation and Context
The documentation for EmbedLayerNormalization states
```
The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated.
```
However, if the mask input is not provided, the mask index output is
still calculated and required.
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU.
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0
Dynamic Shapes Support for iGPU
### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).
Some of the added optimizations include:
- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
- For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
- For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
- CPU:
- Different Q and KV sequence lengths
- Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
- CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
- CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
- Different Q, K, V input shapes
- Pass past key and past value directly as key and value
- CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)
### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```
Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline
import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/
directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'
# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
directory,
use_io_binding=(device == 'cuda'),
provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=(-1 if device == 'cpu' else 0),
)
# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```
Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920
### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)
This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427
This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
### Description
Bump ruff version in CI and fixed new lint errors.
- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.
### Motivation and Context
Code quality
### Optimize SCE loss compute
Compute optimization based on label data sparsity:
- Insert ShrunkenGather before SCELoss node, to filter out invalid
labels for compute.
- Support ShrunkenGather upstream.
- Added test for the above.
- Added flag to enable label sparsity optimization with env var, by
default disabled now. Will enable after comprehensive benchmarking
later.
- Extract common logic into test_optimizer_utils.h/cc from
core/optimizer/compute_optimzier_test.cc, then the common functions can
be shared by both core/optimizer/compute_optimzier_test.cc and
orttraining/core/optimizer/compute_optimzier_test.cc
- Extract common logic into shared_utils.h/cc: `GetONNXOpSetVersion` and
`Create1DInitializerFromVector`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Adding 'Add' functionality to FP16 Conv operator. It takes a tensor that
has the same shape of the output tensor, and add it to the result
tensor.
### Motivation and Context
Needed to run Resnet 50
### Description
Adjust various code paths to allow Whisper model to function with
BeamSearch op.
Approach: Add a new kModelType enum value in IGenerationParameters as
so:
#### Old: 0 = GPT2, 1 = T5
#### New: 0 = GPT2, 1 = T5, 2 = Whisper
When the user assigns this attribute value to 2, various shape and type
checks are changed to accommodate Whisper inputs.
### Motivation and Context
BeamSearch is currently designed to function with BERT-based models with
inputs as vocab tokens, and needs changes to function with Whisper
inputs (3-D float values processed from audio data).
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
### Description
<!-- Describe your changes. -->
Add a tool to convert fused BERT like model to packing mode
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.
This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.
Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.
Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.
### Notable changes
1. This PR **removed some suboptimal patterns**:
- `not xxx in` -> `xxx not in` membership checks
- bare excepts (`except:` -> `except Exception`)
- unused imports
The follow up PR will remove:
- `import *`
- mutable values as default in function definitions (`def func(a=[])`)
- more unused imports
- unused local variables
2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.
3. Removed the legacy flake8 ci flow and updated docs.
4. The added workflow supports SARIF code scanning reports on github,
example snapshot:

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Unified linting experience in CI and local.
Replacing https://github.com/microsoft/onnxruntime/pull/14306
---------
Signed-off-by: Justin Chu <justinchu@microsoft.com>
### Description
<!-- Describe your changes. -->
As synced offline, rename this op and will create another op for mha
that supports both self and cross attention.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
1. upgrade cutlass to 3.0 that containing attn_bias support.
2. extend Attention/MHA to use memory efficient attention when
rel_pos_bias with [1, num_head, s, s*] and 1d mask with [2 * batch_size
+ 1] are present.
new mask format introduction:
MASK_1D_KEY_SEQ_LEN_START,
[3 * batch_size + 2] with [key_len[0], ..., key_len[batch_size - 1],
query_start[0], ..., query_start[batch_size - 1], query_end[batch_size -
1], key_start[0], ..., key_start[batch_size - 1], key_end[batch_size -
1]]
e.g
2D mask with [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] converts to this
1D mask is [3, 5, 0, 6, 12, 0, 6, 12]
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It potentially benefits tnlrv6 and t5(encoder)
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com>
Co-authored-by: Kunal Vaishnavi <kvaishnavi@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Statistics tool for ORTModule convergence parity
As ORTModule get more and more validated, it is pretty fast to
intergrade PyTorch based model with ORT.
The same time, we need make sure once there is convergence issue, we
don't spend months of time to investigate. As part of this efforts, this
PR is introducing a tool to dump activation statistics without much
involvement from users. The dumping results contains only some statistic
numbers plus sampled data, which is not big, compared with dumping all
the tensors, it is much faster and space efficient.
For us to use it, two single lines are needed before wrapping ORTModule.
For baseline run, need also apply the same trick.
```
+ from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
+ SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)])
```
Once you run the steps, following command can be used to merge result
into per-step-summary respectively for ORT and baseline runs.
```bash
python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output
```
Docs is added here as part of this PR [convergence investigation
notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md)
Based on the generated merged files, we can compare them with tools.

### Design and Implementation
This PR introduced a common mechanism registering custom logic for
nn.Module's post forward hooks. And statistics for activation
(StatisticsSubscriber) is one of the implementations. If there is other
needs, we can define another XXSubscriber to do the customized things.
### Description
<!-- Describe your changes. -->
Transformer models can handle batch of inputs at once. However,
sequences in a batch usually have different length. Then we have to pad
the short one to have same length as the longest. This is not efficient
especially for large batch with high variance.
This PR introduces a PackedAttention operator which can take in packed
sequences (no padding) and also produces output in packing mode.
There will be another PR to use the PackedAttention to implement the
encoder in packing mode.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
1. support optional bias in Attention op (used in T5 encoder)
2. support broadcasting rel_pos_bias in attention_softmax.h
3. add scale in
MHA op's attributes
4. support past_key/past_value and present_key/present_value in MHA
5. UT and parity tests are added
6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920
note: the fusions will be in another PR since mt5 needs to be tested and
an issue from github will be investigated.
Future works:
1. support shared buffer for past/present
2. enable trt kernels when possible and investigate (trt/cutlass)kernels
with rel_pos_bias)
3. support KV/QKV packing with past/present
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Implements the STFT operator for the DirectML execution provider. This
is implemented as a custom op, just like the DFT kernel, because it's
implemented as a composite of two operators (DML Mul/Identity + DFT). As
such, this inherits the same restrictions as the existing DFT kernel
(requires power-of-two window sizes for now).
This change also adds a native FP16 shader to DFT so that both DFT/STFT
kernels support float16 tensors. There is no typed UAV fallback or
emulation path, so the HW _needs_ to support native float16. It also
appears the stockham shader was compiled with all optimizations disabled
and debug symbols (tsk tsk, Sheil), and this has been fixed.
This is passing all existing STFT tests (i.e. all of 1). I'm adding some
additional collateral in the Windows AI conformance tests in parallel to
check some extra cases.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
### Description
<!-- Describe your changes. -->
I fixed some broken links in the C API documentation, but then did a
quick pass over all of the links I could find and then fixed those.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I got some 404's when exploring the documentation and wanted to fix it.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
(1) Support packed QKV format in MultiHeadAttention. This format could
avoid add bias transpose when TRT fused kernel is used.
(2) Add cache for cumulated sequence length computation. For SD, it only
need computed once since sequence length is fixed.
(3) Do not allocate qkv workspace to save memory for packed KV or QKV.
(4) Add unit tests for packed kv and packed qkv format in
MultiHeadAttention
(5) Mark some fusion options for SD only
Performance tests show slight improvement in T4. Average latency reduced
0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5
models. Memory usage drops from 5.1GB to 4.8GB.
The third part for stable diffusion CUDA optimizations
(1) Add BiasAdd operator to replace two Add (bias and residual); Add
fusion for BiasAdd
(2) Add Attention fusion for VAE decoder.
(3) Update float16 conversion to handle Resize and GroupNorm. This could
reduce two Cast nodes for each Resize op in fp16 model.
(4) Force inputs and outputs to be float16 to avoid data casts in the
pipeline.
(5) Add options --force_fp32_ops, --inspect etc in optimize script so that
user could force some operator to run in float32 to potentially get
better image quality (with cost of performance).
Performance tests show slight improvement in T4. Average latency reduced
0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.
### Description
<!-- Describe your changes. -->
1. fix a bug in relative position bias kernel where seq_len > 32
2. rename extra_add_qk to relative_position_bias
3. support relative_position_bias in multihead attention (B, N, S, S*)
4. gru_gate support by Lei
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Co-authored-by: Lei Zhang <zhang.huanning@hotmail.com>
### Description
Reduce the cuda library size by:
1. refactoring beam_search_top_k to reduce template instantiation. It
saves ~56MB
2. opt out TopK for type uint*, int8_t and int16_t. It saves ~50MB.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add stable diffusion CUDA kernel optimizations.
The following are included:
(1) GroupNorm operator. This kernel is from TensorRT 8.5.
(2) BiasSplitGelu operator. This kernel is modified from SplitGelu of
TensorRT 8.5. We added bias to the SplitGelu.
(3) NhwcConv operator. This adds support of NHWC format (ONNX Conv
operator uses NCHW format).
(3) Update MultiHeadAttention (packed kv and no bias) for cross
attention. This could avoid transpose of kv for TRT fused cross
attention kernel.
(4) Optimization and benchmark script
Not included:
(1) Script to convert Conv to NhwcConv in onnx graph.
(2) Update symbolic shape inference for NhwcConv.
(3) Add SeqLen2Spatial operator
(4) Documents
Limitations: GroupNorm, BiasSplitGelu and NhwcConv kernels are
implemented based on stable diffusion usage. They might not be
applicable to any input size or dimensions. For example, BiasSplitGelu
requires hidden size to be 2560 | 5120 | 10240, and NhwcConv assumes 4D
input/weight.
There is minor increasement of binary size. For SM=75 only, python
package wheel size adds (33757K - 33640K) = 117 KB. It is possible to
move NHWC from template parameter to constructor to reduce binary size
(with slight cost of performance).
Note: for RTX 4090/4080/4070 Ti, need build with CUDA 11.8 and latest
cuDNN to get best performance.
This PR registers ScatterElements-16 to the DML EP
- CPU fallback is added if the reduction attribute is in use, as this is
not yet supported by DML.
---------
Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
To support LpPool (18)
### Motivation and Context
for Ort 1.14 release
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
### Description
<!-- Describe your changes. -->
Remove exclusions for ONNX model tests that now pass due to kernels
being implemented.
Update ONNX update doc to point to correct location for tests.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Run as many tests as possible.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/12843
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
1. add an optional input to pass in seed
2. two UTs. one for top_p=0.5, another for top_p=0.01(create greedy
search result, in convert_generation.py)
3. fix a bug in cpu kernel
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
This PR registers ScatterND-16 to the DML EP
- CPU fallback is added if the reduction attribute is in use, as this is
not yet supported by DML.
Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
### Description
Changes to incorporate OpenVINO EP 2022.3
### Motivation and Context
This change is required to incorportate OpenVINO EP 2022.3
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: mohsinmx <mohsinx.mohammad@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Aravind <aravindx.gunda@intel.com>
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
Co-authored-by: flexci <mohsinmx>
### Description
<!-- Describe your changes. -->
Opset 18 Split changes. Adds ability to specify num_outputs which also
allows uneven splitting.
https://github.com/onnx/onnx/releases/tag/v1.13.0
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Support ONNX opset 18.
### Description
<!-- Describe your changes. -->
rename the CrossAttention to MultiheadAttention since this op can also
be used as self attention
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
This PR Registers Identity-16 to the DML EP.
ONNX Backend tests and optional type tests were skipped pending future
additions.
Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
### Description
This cuda op implements the compute_bias() method in T5 Attention
including the permutation.
note:
1. bias_table needs to be saved in col-major. be careful when
implementing fusion script
2. second input(sequence length) is placed on cpu. (using Shape node's
output should be good)
3. the first dimension of output is 1, so extra_add_qk in attention
should support broadcasting
4. compute_bias() only used in self-attn in t5
TODO: docs change will be applied later
### Motivation and Context
It's part of the process of optimizing t5 attention as well as t5 based
generation model
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Move separated Q, K and V (without input projection) from Attention to a
new operator CrossAttention.
The Attention operator is hard to maintain when we need support with and
without input projection in one class. Add a new operator according to
feedback.
Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless
experiments show that fusing Add bias with MatMul instead of this op
could improve performance).
(2) support packed KV. There are two ways to support it: when key and
value are same Tensor, they are packed; or we can make value as
optional, and use packed mode when value is empty and the key has packed
K/V.
(3) support cached key and value, and other (like relative position
bias), or more attention mask format. They can be added easily without
breaking backward compatible.
(4) ROCm/CPU implementation of this op.
### Description
1. The graph pattern search introduced in
https://github.com/microsoft/onnxruntime/pull/13914/ needs to be
enhanced so that SkipLayerNormalization is supported
2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization`
fusion. The optional output of SLN needs to also include the bias (if
present) and the added output should be a sum of `input + skip + (bias)`
### Motivation and Context
Fix some breaking tests
### Description
T5 uses a layer_norm which only scales and doesn't shift, which is also
known as Root Mean Square Layer Normalization.
ORT already have the simplified_layer_norm which is the RMS layer_norm.
This PR extends this T5 layer_norm with support of skip/bias and the
residual output.
This new op is named SkipSimplifiedLayerNorm and has similar interface
as SkipLayerNorm but removes the beta as input
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option:
- Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.
Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs
Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)
### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
### Description
<!-- Describe your changes. -->
Sampling op for cpu and cuda
support huggingface case and custom case
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Optimize computation orders
In `Roberta/Electra`, when `ClassificationHead` is used, there is
slicing operation on features on sequence_length dimensions, then loss
calculations only depend on this sliced data. This is a slicing at axis
1. Before slicing the shape is [batch, sequence_length, hidden], after
slicing, it becomes [batch , hidden_stage]
We had opportunities to bring this slicing earlier as much as possible,
by passing through simple elementwise ops (like Add/Div), or
Layernorm/Softmax(if their reduce axis is after the slicing axis), or
even MatMul's the left operand (if only it did not affect the last
dims).
For operators like Reshape/Transpose, it is special since they have
either data specified (after slicing we need update), or they have perm
specified, which requires the input rank remain unchanged. So for those
kinds of operators, we can remain the original rank, but just leave the
sliced dim to be 1, after the compute completed, we do a Squeeze.
```
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
classifier_dropout = (
config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
)
self.dropout = nn.Dropout(classifier_dropout)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
```
src\transformers\models\roberta\modeling_roberta.py
src\transformers\models\electra\modeling_electra.py
#### Benchmark
A simple benchmark shows Robeta training latency dropped from 208ms ~
199ms. 4.5+% reduction.
More comprehensive tests are on the way.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset
12+ is not supported yet.
With the way these kernels are currently registered, the documentation
shows support for opset 11+. This is not accurate.
### Motivation and Context
Fix#13781
This PR Registers the following operators for opset 16 to the DML EP:
- LeakyRelu-16
- PRelu-16
- Where-16
- GreaterOrEqual-16
- LessOrEqual-16
Identity-16 was not added in this PR due to pipeline failures
Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>
Implement reuse kv_cache past and present tensor in Attention Ops.
Unit test for abover feature.
Utilize the reuse kv_cache for past and present tensor in Greedy Search.
Correctness test for it.
Co-authored-by: Zhang Lei <phill.zhang@gmail.com>
### Description
This PR adds support for `float64` kernels in the latest versions of
operators: Floor, Ceil and IsNaN.
### Motivation and Context
The lack of these kernels is non-trivial to work around and easily lead
to performance losses when it is attempted. When equivalence with an
existing implementation is required, precision is easily lost when
casting to `float32` instead.
IsNaN is common when cleaning up data in an ML pipeline. Floor and Ceil
have uses for discretising values and single-precision floats are
insufficient to round well when values get larger than a few million.
According to my measurement this only increases the binary size by a few
kilobytes (on the Python wheel of RelWithDebInfo).
Closes#13673 (Round already has float64 support)
Partially solves #8791 (Looks like there's parallel issues/PR open for
Split, but it is also hard to work around and hence useful)
Signed-off-by: jbachurski <kbachurski@gmail.com>
Split copies data - we can add support for all data types without too much binary size impact by using data type size-based implementations. The DispatchStridedCopy() function used here does this.
### Description
Add the NonZero op for DML
### Motivation and Context
NonZero is used in a few transformer models, so having a DML
implementation will stop large tensors from being transferred to the CPU
and back to the GPU
### Description
Add mixed datatype support for DML's LayerNorm contrib op.
### Motivation and Context
The fusion logic removes casts around LayerNorm in the graph because the
contrib version of the op supports mixed datatypes. Scale, Bias and
Output's datatypes must match, but input's datatype can be different.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
fix https://github.com/microsoft/onnxruntime/issues/13508
### Description
Add a DML registration for Shape to avoid copying back to the CPU just
to get the shape of a GPU tensor.
### Motivation and Context
When using free dimensions, many Transformers models extensively use the
`Shape` operator. This causes hundreds of GPU->CPU copy that should be
completely avoidable. Note that this change also uses the same
heuristics as other providers (e.g. CUDA) to force some tensors on the
CPU in certain situations.
Co-authored-by: Patrice Vignola <pavignol@microsoft.com>
### Add guidelines for ORTModule
As title.
Feel free to let me know if I missed something.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
**Description**: Subgraph-level recompute
This PR adds an optional capability trading additional re-computation
for better memory efficiency. Specifically, a pre-defined operator list
used to iterate the Graph to find some subgraphs for recompute, to
reduce some stashed activations whose lifetime across forward and
backward pass.
When training with ORTModule, by default, the graph transformer will
scan the execution graph to find all eligible subgraph to recompute,
along with sizes that can save. An example looks like below.
If we want to enable some of them to recompute, we can define env
variable this way:
`export
ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"`
```
[1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary]
[1,0]<stderr>:MemoryAlleviation Summary:
[1,0]<stderr>: User config:
[1,0]<stderr>: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1
[1,0]<stderr>: =================================
[1,0]<stderr>: Subgraph: BitmaskDropout+
[1,0]<stderr>: AlleviationType: Disabled
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 1,024 x Frequency:1
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: BiasGelu+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Reshape[1,0]<stderr>:+
[1,0]<stderr>: AlleviationType: Disabled
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:labels_dim0 x Frequency:1
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>: AlleviationType: Disabled
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:23
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:1
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Mul+Add+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+
[1,0]<stderr>: AlleviationType: Disabled
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Mul+Sub+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: Cast+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:1,024 x 1,024 x Frequency:97
[1,0]<stderr>: PatternShape:3 x 1,024 x Frequency:1
[1,0]<stderr>: PatternShape:8 x 64 x Frequency:24
[1,0]<stderr>: PatternShape:1,024 x 4,096 x Frequency:24
[1,0]<stderr>: PatternShape:4,096 x Frequency:24
[1,0]<stderr>: PatternShape:4,096 x 1,024 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: Subgraph: FusedMatMul+
[1,0]<stderr>: AlleviationType: Recompute
[1,0]<stderr>: Patterns:
[1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24
[1,0]<stderr>: --------------------------------
[1,0]<stderr>: =================================
```
"Type config:" whether recompute is enabled by users. 0 - disable, 1-
enable.
"Subgraph" means what kind of subgraph will be recomputed, in this case,
it is a single node "Gelu", and it will be "Recompute".
"Shape && Frequency" means, for this recompute, one tensor of size
(batch size, 500) will be saved because it will be recomputed.
**Baseline**
On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100
GPUs. With latest main branch, we can run batch size 16, and the maximum
batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB
memory is used during training. The SamplesPerSec=479.2543353561354.

**With this PR**
Gelu is recomputed for saving memory peak, batch size 32 can be run. The
97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (**1.17X**
of baseline).

**Motivation and Context**
- Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here.
Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for
forward and 5 Ops for backward. The PR is to fuse this to a single Op
named QuickGelu and its gradient QuickGeluGrad.
For CUDA, tested in V100 using input tensor with shape [64,128,2048] and
float16 type:
Before, FW takes 335us, BW takes 614us

After, FW takes 115us, BW takes 139us, which is much faster.

For CPU kernel, using same shape and float type:
Before, FW takes 10us, BW takes 49us
Mul: 3480[µs]
Sigmoid: 1996[µs]
Mul: 4789[µs]
Mul: 4642[µs]
Mul: 4195[µs]
SigmoidGrad: 18328[µs]
Mul: 2988[µs]
Sum: 18576[µs]
After, FW takes 4us, BW takes 5us, which is also much faster.
QuickGelu: 3939[µs]
QuickGeluGrad: 5089[µs]
Co-authored-by: Vincent Wang <weicwang@microsoft.com>
### Description
<!-- Describe your changes. -->
Fix document generation CI. It's not currently updating the docs as
we're skipping the tests, which is the invocation of build.py that would
have generated the documentation.
Setup specific task to generate documentation for greater clarity.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Operator kernel documentation is not getting updated and is now out of
date.
### Description
Allow separated Q, K and V inputs to support cross attention:
* Q: [batch_size, sequence_length, hidden_size]
* K: [batch_size, kv_sequence_length, hidden_size]
* V: [batch_size, kv_sequence_length, v_hidden_size]
* Output: [batch_size, sequence_length, v_hidden_size]
To use separated Q/K/V inputs, the input tensor is for query, and two
optional inputs are added for key and value. Weights for input
projection is not included for now, so the MatMul of input projection
shall be done out of Attention operator, but Add bias is included for
performance consideration.
### Description
Bumping up version number to 1.14.0
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fix a bug in GreedySearch Op when batch > 1
Support custom attention mask in GreedySearch and BeamSearch with GPT2
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
fix some typo in docs
### Motivation and Context
singed vs signed
succeding vs succeeding
fileter vs filter
kernal vs kernel
libary vs library
* Change block dimension type to Int from Ints.
* In response to feedback that the block dimension corresponds to the
reduction dimension of the consuming matrix multiplication. There is
always only 1 reduction dimension.
### Description
binraries ==> binaries
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* `QuantizeBFP` and `DequantizeBFP` schemas - similar to
`QuantizeLinear` and `DeQuantizeLinear`.
* BFP datatype is represented as a `uint8` tensor with shape and stride
metadata. This is preferrable to adding a new datatype for BFP, which is
more disruptive and [discouraged by
PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2).
Context:
The Microsoft Floating Point (BFP) datatype shares an exponent for every
n numbers called a “bounding box.” Each number still has its own
mantissa and sign bits. BFP has been shown to incur 3-4 less cost
(energy and area) than BFloat16 and INT8 counterparts without reductions
in accuracy for the ImageNet benchmark as described in [Rouhani
2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf).
Requirements:
* There are many variants of BFP (number of mantissa bits, number of
shared exponent bits, size of bounding box, custom bit fields, etc.)
* The size and layout of an BFP variant varies across hardware
* bounding box can be over arbitrary dimensions; for example, for the
channel "C" dimension in a N x C x H x W tensor for convolution
Goals of this PR:
* Add initial versions of QuantizeBFP and DequantizeBFP operators to
enable QDQ-style quantization with BFP. Once the schemas stabilize, we
can consider upstreaming to ONNX.
* Add some basic type and shape inferencing tests; tests that run on an
EP will be a follow-up.
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.
# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
* Fix bug in pybind get_all_operator_schema due to premature reference dropping
* Add updated operator kernels markdown table
* Update build.py to include documentation generation for DML operators too
* Update GPU pipeline to include DML in the build to so operators can be generated.
* Use a separate pipeline stage, feedback from Changming and Scott
* Appease annoying Python linter
* Add onnxruntime_BUILD_UNIT_TESTS=OFF and remove stale --use_dml in cuda stage
* Make ORT as Pytorch JIT backend
LORT likely doesn't work with aten fallback so we only test LORT in its own CI.
* Revert changes to enable external CUDA allocator. Will add it later.
Revert "Revert changes to enable external CUDA allocator. Will add it later."
This reverts commit d5487f2e193014c805505afae8fb577c53667658.
Fix external allocator
* Relax tolerance and remove commented code
* Print more information in CI
* Fix pointer
* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.
* Use Pytorch master branch as all PRs are merged
Fix
* Refine based on cpplint feedbacks
* Revert changes to allow custom CUDA allocator in public APIs
* Use torch.testing.assert_close
* Use unittest framework
* Switch docker repo
* Rename *.cpp to *.cc
* Address comments
* Add comment
* Use same pipeline file for eager and lort pipelines
* Address comments
* Add yaml comment
* Fix cmake files
* Address comments
* Rename flags, remove printing code, remove dead comment
* Update to handle multiline declarations for the kernels which are typical these days.
* Update to new path for the cpu contrib_op kernel registrations.
* Update tools/python/find_optimizer_opset_version_updates_required.py
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>