### Description
This change introduces the WebGPU EP into ONNX Runtime.
To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update
This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.
There is going to be separated PRs for each mentioned above.
Current working branch: #21904
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enabling python binding and gcc support for AIX.
### Motivation and Context
Code changes in this PR contains:
1. python binding enablement
2. gcc building support
Below are list of files and the description.
1. cmake/CMakeLists.txt
[gcc building support] -no-unused-function compiler flag addition for
IBMClang
2. cmake/external/eigen.cmake
[gcc building support] AIX check for applying the AIX patch
3. cmake/onnxruntime_python.cmake
[python binding ] putting NOT AIX check for -Xlinker
4. cmake/onnxruntime_unittests.cmake
[gcc building support] Fix for gtest behavior. Check the comment .
[python binding ] using -Wl,-brtl for linking
onnxruntime_providers_shared in test_execution_provider
5. cmake/patches/eigen/eigen-aix.patch
[gcc building support] In AIX gcc, we are hitting
__builtin_cpu_supports("mma") which is not supported yet. So patching
code for this method . Patched code will check for P10 Processor at
run-time and based on that routine will be set.
6. onnxruntime/python/onnxruntime_validation.py
[python binding ] Adding AIX check in check_distro_info()
7. onnxruntime/test/providers/cpu/generator/random_test.cc
[gcc building support] updating previous check for AIX , along with
clang. So in case of gcc, else block will hit.
8. onnxruntime/test/python/onnxruntime_test_python.py
[python binding ] powerpc check on platform.processor()
9. setup.py
[python binding ] Adding AIX check for list of libs.
### Description
Added CUDNN Frontend and used it for NHWC convolutions, and optionally
fuse activation.
#### Backward compatible
- For model existed with FusedConv, model can still run.
- If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.
#### Major Changes
- For cuDNN 9, we will enable cudnn frontend to fuse convolution and
bias when a provider option `fuse_conv_bias=1`.
- Remove the fusion of FusedConv from graph transformer for CUDA
provider, so there will not be FusedConv be added to graph for CUDA EP
in the future.
- Update cmake files regarding to cudnn settings. The search order of
CUDNN installation in build are like the following:
* environment variable `CUDNN_PATH`
* `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from
build.py/build.sh, user can pass it through `--cudnn_home` parameter, or
by environment variable `CUDNN_HOME` if `--cudnn_home` not used.
* cudnn python package installation directory like
python3.xx/site-packages/nvidia/cudnn
* CUDA installation path
#### Potential Issues
- If ORT is built with cuDNN 8, FusedConv fusion is no longer done
automatically, so some model might have performance regression. If user
still wants FusedConv operator for performance reason, they can still
have multiple ways to walkaround: like use older version of onnxruntime;
or use older version of ORT to save optimized onnx, then run with latest
version of ORT. We believe that majority users have moved to cudnn 9
when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3
months when 1.20 release), so the impact is small.
- cuDNN graph uses TF32 by default, and user cannot disable TF32 through
the use_tf32 cuda provider option. If user encounters accuracy issue
(like in testing), user has to set environment variable
`NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of
use_tf32 later.
#### Follow ups
This is one of PRs that target to enable NHWC convolution in CUDA EP by
default if device supports it. There are other changes will follow up to
make it possible.
(1) Enable `prefer_nhwc` by default for device with sm >= 70.
(2) Change `fuse_conv_bias=1` by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).
### Motivation and Context
The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution with the pointwise bias operation. On the [NVIDIA
ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/)
we get a performance boost from 49.1144 ms to 42.4643 ms per inference
on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i
'prefer_nhwc|1' resnet50.onnx`).
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>
* Fix fallback setting (cuda still falls back to cuda).
* Fix cuda provider fallback inconsistent with/without CUDA_PATH
environment variable.
* Add cuda and cudnn major version requirement in error message.
Example result in Windows:
```
>>> import onnxruntime
>>> ort_session = onnxruntime.InferenceSession("model.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
2024-07-19 17:43:44.2260019 [E:onnxruntime:Default, provider_bridge_ort.cc:1972 onnxruntime::TryGetProviderInfo_CUDA] D:\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc:1636 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\.conda\envs\py310\lib\site-packages\onnxruntime\capi\onnxruntime_providers_cuda.dll"
2024-07-19 17:43:44.2312351 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:970 onnxruntime::python::CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12.*, and the latest MSVC runtime. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported.
>>> ort_session
<onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x0000016BB2DF7D60>
>>> ort_session.get_providers()
['CPUExecutionProvider']
```
Example result in Linux:
```
>>> import onnxruntime
>>> ort_session = onnxruntime.InferenceSession("resnet50-v2-7.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
2024-07-20 20:33:26.486974543 [E:onnxruntime:Default, provider_bridge_ort.cc:1972 TryGetProviderInfo_CUDA] /work/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.12: cannot open shared object file: No such file or directory
2024-07-20 20:33:26.487034646 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:961 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12.*. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported.
>>> ort_session.get_providers()
['CPUExecutionProvider']
```
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/21424
### Description
As suggested by SciPy's doc, we will
`Build against NumPy 2.0.0, then it will work for all NumPy versions
with the same major version number (NumPy does maintain backwards ABI
compatibility), and as far back as NumPy 1.19 series at the time of
writing`
I think it works because in
[numpyconfig.h#L64](https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/numpyconfig.h#L64)
there is a macro NPY_FEATURE_VERSION. By default it is set to
NPY_1_19_API_VERSION. And the NPY_FEATURE_VERSION macro controls ABI.
This PR only upgrade the build time dependency; When a user installs
ONNX Runtime, they still can use numpy 1.x.
### Motivation and Context
Recently numpy published a new version, 2.0.0, which is incompatible with the latest ONNX Runtime release.
### Description
Remove the "--enable_language_interop_ops" build flag, because the code
is incompatible with the latest numpy, and the build flag is not used
anywhere except a macOS CI pipeline. It does not seem to have a ship
plan.
### Motivation and Context
The build error was:
```
onnxruntime/core/language_interop_ops/pyop/pyop.cc:122:85: error: no member named 'elsize' in '_PyArray_Descr'
static_cast<int64_t>(PyArray_DescrFromType(type)->elsize),
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
```
### Description
<!-- Describe your changes. -->
- Add check for CoreML MLProgram supported ops
- Only check usability with ORT Mobile package if requested
- this package will be deprecated so info is a) of minimal value and b)
can be confusing.
- Output more things at INFO level
- a lot of meaningful info was only output at DEBUG level. The default
INFO level is more useful
- dump full partition info at DEBUG level
- Check subgraphs fully
- CoreML can handle a subgraph
- TBD if we want to add support for adding a subgraph to the parent
graph for Loop and If nodes
- most likely will be required for simple If nodes to be performant
- Check 5D CoreML limitation
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve helper tools
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
This PR supports a build of onnxruntime.xcframework for xros/xrsimulator
for visionos via the build command of
`python3 tools/ci_build/github/apple/build_apple_framework.py --config
Release/Debug
tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`.
For officially include visionos in ios cocoapods package and testing in
CI, would require separate work for upgrading the Xcode version &
upgrade macOS CI agent to macos-13-arm64 or higher.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
visionos support:
https://github.com/microsoft/onnxruntime/discussions/19313
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
Introducing a new class ORTPipelineModule to handle wrapping layers in
DeepSpeed pipeline parallel.
### Motivation and Context
To support pipeline parallelism on ORTModule.
This PR will include an initial support of deepspeed Pipeline
parallelism.
- [x] Support Pipeline parallel where layers are nn Modules in
Sequential.
- [ ] Support LayerSpec and TiedLayerSpec
- [ ] Enable partitioning to accept List
- [ ] Full-GPU Graph Consolidation
- [ ] Subgraph Merging for Inference
copy QNN deps when building python bindings as well.
tweak the wildcard to only copy QNN related files. latest sdk from
Qualcomm (>= 2.21) also include SNPE dll's which we don't want to
include.
### Description
Answers issue #19640
More details are in the issue, basically I am changing all the include
directory and link directory usage to CMake's `CUDA::*` targets
### Description
<!-- Describe your changes. -->
This PR adds
onnx conversion script for dynamo exported phi2,
optimization script,
and inference example script
A readme file is added as documentation.
https://github.com/microsoft/onnxruntime/tree/wangye/phi2_doc/onnxruntime/python/tools/transformers/models/phi2#readme
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Refactor the VAIEP to use MSFT's standalone API
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs
in order to decouple the EP from onnxruntime.dll and the providers.dll.
This will help to simplify customer deployment of applications and use
cases that need to share their onnxruntime.dll with other applications.
---------
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
### Description
- Adds graph fusions to preprocessing step that can be called before
creating a QDQ model for QNN EP.
- Fuse Erf sequence to Gelu (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_gelu.py)).
Required by QNN EP.
- Fuse ReduceMean sequence to LayerNormaliation (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_layernorm.py)).
Not required by QNN EP.
- Fuse ReduceL2 sequence to LpNormalization (new, specific to QNN EP).
Required by QNN EP.
Example use:
```python3
from quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model
# Added by this PR:
model_updated = qnn_preprocess_model("model.fp32.onnx", "model.fp32.preprocessed.onnx", fuse_layernorm=True)
model_to_quantize = "model.fp32.preprocessed.onnx" if model_updated else "model.fp32.onnx"
# Quantize model ...
qnn_config = get_qnn_qdq_config(model_to_quantize, data_reader, activation_type=QuantType.QUInt16)
quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```
### Motivation and Context
Allow more models to be quantized for use with QNN EP
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
#### 1. Adds `TensorQuantOverrides` extra option
Allows specifying a dictionary of tensor-level quantization overrides:
```
TensorQuantOverrides = dictionary :
Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
per-channel quantization, the list contains a dictionary for each channel in the tensor.
Each dictionary contains optional overrides with the following keys and values.
'quant_type' = QuantType : The tensor's quantization data type.
'scale' = Float : The scale value to use. Must also specify `zero_point` if set.
'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set.
'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also
set `scale` or `zero_point`.
'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also
set `scale` or `zero_point`.
'rmax' = Float : Override the maximum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
'rmin' = Float : Override the minimum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
```
- All of the options are optional.
- Some combinations are invalid.
- Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
are also specified.
Example for per-tensor quantization overrides:
```Python3
extra_options = {
"TensorQuantOverrides": {
"SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
"WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
"BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
},
}
```
Example for per-channel quantization overrides (Conv weight and bias):
```Python3
extra_options = {
"TensorQuantOverrides": {
"WGT": [
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.0,
"rmax": 2.5,
"reduce_range": True,
},
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.2,
"rmax": 2.55,
"reduce_range": False,
},
],
"BIAS": [
{"zero_point": 0, "scale": 0.000621},
{"zero_point": 0, "scale": 0.23},
],
},
}
```
#### 2. Adds utilities to get the default QDQ configs for QNN EP
Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
that inspects the model and returns suitable quantization
configurations.
Example usage:
```python3
from quantization import quantize, QuantType
from quantization.execution_providers.qnn import get_qnn_qdq_config
qnn_config = get_qnn_qdq_config(input_model_path,
data_reader,
activation_type=QuantType.QUInt16,
weight_type=QuantType.QUInt8)
quantize(input_model_path,
output_model_path,
qnn_config)
```
### Motivation and Context
Make it possible to create more QDQ models that run on QNN EP.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
Motivation for this PR is code cleanup.
1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
This PR is to support efficient attention and flash attention in
ORTModule, including:
- Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev
or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable.
- Integrate Triton Flash attention, which requires
triton==2.0.0.dev20221202. Need A100 or H100.
ORTMODULE_USE_FLASH_ATTENTION=1 to enable.
- A python transformer tool to match sub-graph by config and write
transformer quickly.
Current transformers supports attention mask for both efficient attn and
flash attn, and dropout for efficient attn only. To support more
training scenarios (such as causal mask in GPT2), more transformers need
to be added.
The feature is guarded by system environment variables, it won't effect
any current behavior if not enabled. Since it requires specific
PyTorch/Triton versions, related tests is not added for now.
### Description
This PR adds the following scripts for LLaMA:
- LLaMA conversion (support for TorchScript and Dynamo exporters)
- LLaMA parity
- LLaMA benchmark
- LLaMA quantization
- LLaMA integration with [Hugging Face
Optimum](https://github.com/huggingface/optimum)
### Motivation and Context
This PR adds scripts for using LLaMA. There is a [follow-up
PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding
scripts for Whisper.
### Motivation and Context
When we handle PyTorch models' inputs in different places (ORTModule or
others), it's common for us to flatten a structured data into a 1-D
tensor list (required by lib for example torch.onnx.export,
torch.autograd.Function.forward or ORT inference session), then do
subsequent work, then unflatten back to original hierarchy as returned
values.
DeepStage3 hooks support work also need such a lib to do similar things,
so I was proposing to extract this pair of APIs in training/utils/,
which can be more used more generally. Also a comprehensive set of test
data are used for testing unflatten/flatten in unit tests.
Let me know if you have any other suggestions.
### Refactor schema extraction and output unflattening
Move `_extract_schema` and `unflatten_user_output` in
`orttraining/orttraining/python/training/ortmodule/_io.py` . to
`extract_data_and_schema` and `unflatten_data_using_schema` in
`orttraining/orttraining/python/training/utils/torch_io_helper.py` as
shared libs, which can be used later by other features (deepspeed stage
3 hook rewrite).
While there are still a few duplicated logic handling flatten with
different task by recursively loop the data struct, will change them
step by step in case of heavy review efforts.
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
### Description
This PR enables execution of subgraphs in OVEP and currently, when OVEP
developers install the onnxruntime-openvino package on windows from
pypi, they would have to additionally download OpenVINO windows binaries
and run the setupvars.bat script which sets the environment PATH to
locate the OV dll's. Also this PR fixes issues of OVEP windows io buffer
sample.
### Motivation and Context
Fix: We want to make the user experience easy for OVEP Python developers
on windows platform.
This fix, introduces a function add_openvino_libs_to_path at the
location tools/python/util/add_openvino_win_libs.py.
The above function, can be called by OVEP python users in the
application code and that takes care of setting
the OpenVINO dll's to the path from the OpenVINO pypi packge (openvino)
which was installed.
This change also makes sure that add_openvino_libs_to_path() function is
added to onnxruntime python package
only when it is build for OpenVINO Execution Provider for ONNXRuntime
and not for default ORT python package builds.
New user experience for Python OVEP developers on windows platform:
step 1: pip install onnxruntime-openvino
step 2: pip install openvino
step 3: <Add these 2 lines in the application code>
import onnxruntime.tools.add_openvino_win_libs as utils
utils.add_openvino_libs_to_path()
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
### Description
DML_PACKAGE_DIR cmake variable is not getting set properly when dml_path
build options is used.
### Motivation and Context
- Why is this change required? What problem does it solve?
It is required for DML Perf dashboard.
<!--- If it fixes an open issue, please link to the issue here. -->
### Description
Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if
(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code
look more nature.
For example,
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY)
```
becomes
```cmake
if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
```
### Description
This change will allow us building CUDA EP without installing CUDA SDK
on Windows.
### Motivation and Context
Nvidia's CUDA installer comes with a VS extension. In the past, we
require installing the extension. It is a little bit inconvenient since:
1. Visual Studio must be installed before CUDA SDK. CUDA's installer
will not install the extension if your machine doesn't have Visual
Studio.
2. We need to install CUDA SDK on our build machines, instead of just
downloading it and using it.
After this change, we will not need to install CUDA SDK on our build
machines. So it will be easier to add a support for a different CUDA
version.
Also, fix two PreFast warnings.
### Description
While building ORT for DML EP with `dml_EXTERNAL_PROJECT` flag, 2
variables (`DML_SHARED_LIB`, `DML_PACKAGE_DIR`) value is not set
properly.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR adds the Whisper custom export scripts to the wheel.
### Motivation and Context
This enables access to the custom export scripts in the wheel.
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).
Some of the added optimizations include:
- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
- For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
- For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
- CPU:
- Different Q and KV sequence lengths
- Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
- CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
- CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
- Different Q, K, V input shapes
- Pass past key and past value directly as key and value
- CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)
### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```
Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline
import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/
directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'
# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
directory,
use_io_binding=(device == 'cuda'),
provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=(-1 if device == 'cpu' else 0),
)
# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```
Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- https://github.com/huggingface/optimum/pull/872
- https://github.com/huggingface/optimum/pull/920
### Motivation and Context
This PR helps the following issues:
- https://github.com/microsoft/onnxruntime/issues/15100
- https://github.com/microsoft/onnxruntime/issues/15235
- https://github.com/huggingface/optimum/issues/869 (work in progress)
This PR can be used with the other currently merged Whisper PRs:
- https://github.com/microsoft/onnxruntime/pull/15247
- https://github.com/microsoft/onnxruntime/pull/15339
- https://github.com/microsoft/onnxruntime/pull/15362
- https://github.com/microsoft/onnxruntime/pull/15365
- https://github.com/microsoft/onnxruntime/pull/15427
This PR uses changes from the following merged PRs:
- https://github.com/microsoft/onnxruntime/pull/14198
- https://github.com/microsoft/onnxruntime/pull/14146
- https://github.com/microsoft/onnxruntime/pull/14201
- https://github.com/microsoft/onnxruntime/pull/14928 (this introduced
the new multi-head attention spec)
### Statistics tool for ORTModule convergence parity
As ORTModule get more and more validated, it is pretty fast to
intergrade PyTorch based model with ORT.
The same time, we need make sure once there is convergence issue, we
don't spend months of time to investigate. As part of this efforts, this
PR is introducing a tool to dump activation statistics without much
involvement from users. The dumping results contains only some statistic
numbers plus sampled data, which is not big, compared with dumping all
the tensors, it is much faster and space efficient.
For us to use it, two single lines are needed before wrapping ORTModule.
For baseline run, need also apply the same trick.
```
+ from onnxruntime.training.utils.hooks import SubscriberManager, StatisticsSubscriber
+ SubscriberManager.subscribe(model, [StatisticsSubscriber("pt_out", override_output_dir=True)])
```
Once you run the steps, following command can be used to merge result
into per-step-summary respectively for ORT and baseline runs.
```bash
python -m onnxruntime.training.utils.hooks.merge_activation_summary --pt_dir pt_out --ort_dir ort_out --output_dir /tmp/output
```
Docs is added here as part of this PR [convergence investigation
notes](https://github.com/microsoft/onnxruntime/blob/pengwa/conv_tool/docs/ORTModule_Convergence_Notes.md)
Based on the generated merged files, we can compare them with tools.

### Design and Implementation
This PR introduced a common mechanism registering custom logic for
nn.Module's post forward hooks. And statistics for activation
(StatisticsSubscriber) is one of the implementations. If there is other
needs, we can define another XXSubscriber to do the customized things.
### Description
This is a follow-up of
https://github.com/microsoft/onnxruntime/pull/14428 for Stable Diffusion
CUDA optimizations:
(1) use NchwConv to replace Conv in onnx graph and add Tranpose nodes
accordingly
(2) reduce sequential Transpose nodes to at most one.
(3) symbolic shape infer of NchwConv
(4) fix add bias transpose which causes CUDA error (launching more than
1024 threads per block) in inferencing fp32 model.
(5) add models (bert, bart, stable_diffusion subdirectories) to package;
(6) remove option --disable_channels_last
Note that
(1) We can add a few graph transformations to reduce Transpose nodes
further. It is not done in this PR due to time limit.
(2) Stable diffusion 2.1 model outputs black images. It seems that
forcing Attention to float32 could avoid the issue. However it is much
slow to use float32 Attention.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
If a user installs the debug libraries from Python on Windows the ORT
python project file attempts to use the debug python lib, which
conflicts with a pragma in pyconfig.h that wants the release lib (due to
pybind11 undefining _DEBUG).
Explicitly use the release lib instead of Python::Module so the build
doesn't break.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix obtuse build break.
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option:
- Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.
Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs
Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)
### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
Implement CloudEP for hybrid inferencing.
The PR introduces zero new API, customers could configure session and
run options to do inferencing with Azure [triton
endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint)
Sample configuration in python be like:
```
sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton');
sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com');
sess_opt.add_session_config_entry('cloud.model_name', 'detection2');
sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1
sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose
...
run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint.
run_opt.add_run_config_entry('cloud.auth_key', '...')
...
sess.run(None, {'input':input_}, run_opt)
```
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
## Description
1. Convert some git submodules to cmake external projects
2. Update nsync from
[1.23.0](https://github.com/google/nsync/releases/tag/1.23.0) to
[1.25.0](https://github.com/google/nsync/releases/tag/1.25.0)
3. Update re2 from 2021-06-01 to 2022-06-01
4. Update wil from an old commit to 1.0.220914.1 tag
5. Update gtest to a newer commit so that it can optionally leverage
absl/re2 for parsing command line flags.
The following git submodules are deleted:
1. FP16
2. safeint
3. XNNPACK
4. cxxopts
5. dlpack
7. flatbuffers
8. googlebenchmark
9. json
10. mimalloc
11. mp11
12. pthreadpool
More will come.
## Motivation and Context
There are 3 ways of integrating 3rd party C/C++ libraries into ONNX
Runtime:
1. Install them to a system location, then use cmake's find_package
module to locate them.
2. Use git submodules
6. Use cmake's external projects(externalproject_add).
At first when this project was just started, we considered both option 2
and option 3. We preferred option 2 because:
1. It's easier to handle authentication. At first this project was not
open source, and it had some other non-public dependencies. If we use
git submodule, ADO will handle authentication smoothly. Otherwise we
need to manually pass tokens around and be very careful on not exposing
them in build logs.
2. At that time, cmake fetched dependencies after "cmake" finished
generating vcprojects/makefiles. So it was very difficult to make cflags
consistent. Since cmake 3.11, it has a new command: FetchContent, which
fetches dependencies when it generates vcprojects/makefiles just before
add_subdirectories, so the parent project's variables/settings can be
easily passed to the child projects.
And when the project went on, we had some new concerns:
1. As we started to have more and more EPs and build configs, the number
of submodules grew quickly. For more developers, most ORT submodules are
not relevant to them. They shouldn't need to download all of them.
2. It is impossible to let two different build configs use two different
versions of the same dependency. For example, right now we have protobuf
3.18.3 in the submodules. Then every EP must use the same version.
Whenever we have a need to upgrade protobuf, we need to coordinate
across the whole team and many external developers. I can't manage it
anymore.
3. Some projects want to manage the dependencies in a different way,
either because of their preference or because of compliance
requirements. For example, some Microsoft teams want to use vcpkg, but
we don't want to force every user of onnxruntime using vcpkg.
7. Someone wants to dynamically link to protobuf, but our build script
only does static link.
8. Hard to handle security vulnerabilities. For example, whenever
protobuf has a security patch, we have a lot of things to do. But if we
allowed people to build ORT with a different version of protobuf without
changing ORT"s source code, the customer who build ORT from source will
be able to act on such things in a quicker way. They will not need to
wait ORT having a patch release.
9. Every time we do a release, github will also publish a source file
zip file and a source file tarball for us. But they are not usable,
because they miss submodules.
### New features
After this change, users will be able to:
1. Build the dependencies in the way they want, then install them to
somewhere(for example, /usr or a temp folder).
2. Or download the dependencies by using cmake commands from these
dependencies official website
3. Similar to the above, but use your private mirrors to migrate supply
chain risks.
4. Use different versions of the dependencies, as long as our source
code is compatible with them. For example, you may use you can't use
protobuf 3.20.x as they need code changes in ONNX Runtime.
6. Only download the things the current build needs.
10. Avoid building external dependencies again and again in every build.
### Breaking change
The onnxruntime_PREFER_SYSTEM_LIB build option is removed you could think from now
it is default ON. If you don't like the new behavior, you can set FETCHCONTENT_TRY_FIND_PACKAGE_MODE to NEVER.
Besides, for who relied on the onnxruntime_PREFER_SYSTEM_LIB build
option, please be aware that this PR will change find_package calls from
Module mode to Config mode. For example, in the past if you have
installed protobuf from apt-get from ubuntu 20.04's official repo,
find_package can find it and use it. But after this PR, it won't. This
is because that protobuf version provided by Ubuntu 20.04 is too old to
support the "config mode". It can be resolved by getting a newer version
of protobuf from somewhere.
1. Update CK to its latest develop branch
2. `-mllvm -amdgpu-early-inline-all=true` is critical to CK's
performance, ensure it is properly configured.
- The flags are propagated from target `hip-lang::device`'s
`INTERFACE_COMPILE_OPTIONS`, we must not manually add the flags.
- Instead, we must ensure this target is properly configured by checking
_CMAKE_HIP_DEVICE_RUNTIME_TARGET is set.
TL,DR
`hip-lang::device` sometime will be not be properly configured if our
`CMAKE_PREFIX_PATH` is not configured carefully. In the CI docker, the
configuration is in good state, but on dev machine it is not, which then
silently result poor performance for kernels. We fixed it in this PR and
add a guard to avoid unsuccessful future editing and to prevent
convoluted debugging process.
`_CMAKE_HIP_DEVICE_RUNTIME_TARGET ` is shared in
`/opt/rocm/lib/cmake/hip-lang/hip-lang-config.cmake` and it is internal
to
[CMake](https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6121/diffs),
the variable name will not be changed in the foreseeable future.
This PR enables ORT to execute graphs captured by TorchDynamo. Major compilation code is in `OrtBackend.compile` in ort_backend.py. `register_backend.py` is for plugging `OrtBackend` into TorchDynamo as a compiler.
`python setup.py develop` doesn't install PyTorch as a normal package in
site-packages anymore, and the user must stay at PyTorch's root
directory to call `import torch`. This will break LORT tests because
LORT tests contains `import torch` and are called outside PyTorch root
directory. To make PyTorch a normal package again, this PR build PyTorch
with `python setup.py install`.