### Use full qualified name for PythonOp export
Originally, when there are duplicate named torch.autograd.Function in
different module, for example:
`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`
We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.
This PR finds a way to construct a full qualified name.
`def _export_pt_1_10(g, n, *args, **kwargs):`
1. in exporter function, kwargs contains `name` and `module`, in the
above example:
`a.b.c.Gelu` --> name: `Gelu`, module: `a.b.c`
`d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.
The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).
2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.
With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
### Fix few bugs
1. symbolic shape infer, there is no None check before get length.
2. Rename PythonOp/PythonOpGrad's attribute `name` to `func_name`,
otherwise, when we use onnx.helper.make_node to create node, `name`
conflicts with node name.
3. Filter shape inference warnings for PythonOp for torch 2.0 or newer.
4. Close file descriptor for log suppression. Without the fix, two extra
fd is left after the log suppression exit its context.
Before enter log suppression (left), Before exit log suppression (right)

With the fix, no fd added after context exit.

### Description
1. Add valgrind to existing ep_perf CI MemTest and parse ORT-TRT memLeak
details
1. General Valgrind logs and logs related to ORT-TRT will be parsed in
[CI
artifacts](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=334122&view=artifacts&pathAsName=false&type=publishedArtifacts)
1. Logic:
1. Run valgrind with `onnxruntime-perf-test -e tensorrt` and export log
to `valgrind.log`
2. Identify if any `definitely lost` memleak happened
1. For log paragraphs which show `definitely lost`, parse if they have
keyword `TensorrtExecutionProvider`.
2. If so, extract these details to `ort_trt_memleak_detail.log`, and
return `build failure` to EP Perf CI
3. Fix existing addressSanitizer and sync the squeezenet testcase with
latest update from
[ort-inference-example](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/squeezenet/main.cpp)
1. Updates in short: Upgrade main.cpp to be using
OrtTensorRTProviderOptionsV2
4. Reorder the 7-min-MemTest to be ahead of 9-hr-model-tests, and enable
MemTest by default
### Description
Add an option to generate different formats of attention_mask for
testing transformers models:
1 - 1D mask index, actual sequence length excluding padding
2 - 2D attention mask. Value 0 means padding, 1 otherwise.
3 - 1D, key lengths and cumulated sequence lengths of query and key
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update scripts for converting model with MulitHeadAttention to packing
mode.
- [x] Update symbolic shape inference for PackedMultiHeadAttention and
GatedRelativePositionBias
- [x] Update convert_to_packing_mode to handle model with
MulitHeadAttention
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Being able to leverage I/O binding for DML and registering `If` for the
DML EP allows us to avoid copying the past/present key/values back and
forth between the CPU and the GPU after every token.
This gives us a 25% performance increase for Dolly V2 with 128 tokens on
an RTX 4090.
### Description
This PR adds support for saving model optimizations after loading a
model that contains external data into an `InferenceSession`.
### Motivation and Context
This PR is a follow-up to a [previous
PR](https://github.com/microsoft/onnxruntime/pull/16716) for saving a
model optimized by an `InferenceSession`.
### Description
Support SmoothQuant for ORT static quantization via intel neural
compressor
> Note:
Please use neural-compressor==2.2 to try SmoothQuant function.
### Motivation and Context
For large language models (LLMs) with gigantic parameters, the
systematic outliers make quantification of activations difficult. As a
training free post-training quantization (PTQ) solution, SmoothQuant
offline migrates this difficulty from activations to weights with a
mathematically equivalent transformation. Integrating SmoothQuant into
ORT quantization can benefit the accuracy of INT8 LLMs.
---------
Signed-off-by: Mengni Wang <mengni.wang@intel.com>
This will remove transposes that are non needed in the DML kernel. To
keep backward compatiblity, the default behavior is to set NHWC when no
attribute is set.
### Description
Disable two PERF* rules in ruff to allow better readability. Rational
commented inline. This change also removes the unused noqa directives
because of the rule change.
### Motivation and Context
Readability
/builds/devtechproviz/dl/ort-builder/onnxruntime/onnxruntime/python/onnxruntime_pybind_state.cc:388:14:
error: missing initializer for member
'OrtTensorRTProviderOptionsV2::trt_cuda_graph_enable'
[-Werror=missing-field-initializers]
388 | 0};
|
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ #16789
Bump ruff to 0.0.278 and fix new lint errors. I added noqa to all
existing RUF012 errors which requires mutable class variables to be
annotated with `ClassVar`, as well as all PERF issues.
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #16789
* __->__ #16788
This change fixes the N802 lint errors by renaming the test case to use
snake case.
### Description
A [previous PR](https://github.com/microsoft/onnxruntime/pull/16531)
added a temporary directory to save the model optimizations after
loading a model into an `InferenceSession`. Many models that have an
external data file, however, require the data file to be in the same
directory as the ONNX model file. Because the model is saved in a
temporary directory and the data is saved in another directory, this
causes a `FileNotFoundError` error when trying to load the model in the
temporary directory.
This PR fixes this error by saving the external data file in the same
directory that the optimized model is located in.
### Motivation and Context
This PR fixes a bug with using a temporary directory while running the
optimizer for models that have an external data file.
### Description
Fix some issues found in GPT-NeoX graph fusion:
(1) GPT-NeoX uses float16 weights. The step of using onnxruntime with
opt_level==1 uses CPU provider. Since most operators does not have fp16
in CPU EP, so extra Cast nodes are added to up cast to fp32.
(2) Add is shared by two LayerNormalization children, and
SkipLayerNormalization might cause invalid graph.
(3) Reshape fusion might miss since some part only check initializer but
not Constant.
This PR adds a check whether model uses FP16, and output a warning when
use_gpu is not True, and use GPU provider for graph optimization when use_gpu=True.
GemmSoftmaxGemmTunble occasionally broken with large numerical error.
The root cause of this error is CK's Strided Batched Gemm has larger
error under a specific initialization distribution
`(multinormal_distribution)`.
Generic(Gemm1 + Softmax + Gemm2) implementation is one instance of
GemmSoftmaxGemmTunble. Gemm1 and Gemm2 in Generic implementation are
TunableOps when tuning enabled. In some case GemmSoftmaxGemmTunble
select Generic implentation, while Gemm1 or Gemm2 select ck
implementation, the result of GemmSoftmaxGemmTunble affect by CK.
- Make tolerance more loosen.
- Add `GemmSoftmaxGemmPermuteGenericNestedTunable` to test Generic
implementation with tuning enabled.
0(1) Fix a bug in https://github.com/microsoft/onnxruntime/pull/16560
that UNet shall be set fp16 flag.
(2) Remove wget in requirements since it is no longer needed.
(3) Add benchmark numbers in A100-PCIE-80GB. Note that CUDA EP have
issue to run in batch size 4 so the number is not added.
### Description
- Add hipBLASLt tuning logic in place of default hipBLASLt
implementation;
- add kernel explorer for hipBLASLt.
related operators: Gemm, StridedBatchedGemm, and GemmFastGelu.
Temporarily mark algos that require extra workspace as unsupported.
Will add workspace support in later PR, which will change Gemm Params
def and affect multiple files.
### Description
- Update existing rocBLAS get_solutions API using
`*_get_solutions_by_type` (supported from ROCm5.6); remove the original
nested TunableOp logic.
- Update kernel_explorer.
This PR mainly optimize ROCm CI test to reduce time and CPU utilization.
- use smaller batch size on strided_batched_gemm/batched_gemm test
- disable cpu training test
- fix test_e2e_padding_elimination Occasional failures on ROCm.
### Description
<!-- Describe your changes. -->
This PR adds support for rotary embeddings in decoder masked
self-attention
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP.
They can automatically export and optimize ONNX model, and create
ONNXRuntime session to use TensorRT EP or CUDA execution provider.
Add support for benchmarking TensorRT.
Add support of cuda graph. The feature is only supported in nightly
package right now.
Engine/Provider to test | command line
---- | ---
CUDA EP | `python benchmark.py -v 1.5`
CUDA EP with cuda graph | `python benchmark.py -v 1.5
--enable_cuda_graph`
TensorRT EP | `python benchmark.py -v 1.5 -r tensorrt`
TensorRT EP with cuda graph | `python benchmark.py -v 1.5 -r tensorrt
--enable_cuda_graph`
TensorRT | `python benchmark.py -v 1.5 -e tensorrt`
Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch
1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or
ort-nightly-gpu 1.16 for cuda graph).
TODO: add benchmark numbers of A100-80GB
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
kernel explorer has lots of tests and need numpy to verify the results
of GPU kernels, it will make CPU utilization very high. This PR use
`cupy ` to replace `numpy` to do compute on GPU to reduce CPU
utilization.
set `KERNEL_EXPLORER_TEST_USE_CUPY=1` to enable cupy.
The `GemmSoftmaxGemmPermuteTunableOp<HipT>` is expensive to construct,
avoid the ctor invocation will substantially improve the launch time and
get better performance during the decoding. This get <7% e2e time
reduction of whisper large.
- Fix some warnings from Xcode build (`-Wshorten-64-to-32`).
- Enable `-Wshorten-64-to-32` warning if available. Currently it's not fully enabled for `onnxruntime_test_all` and `onnxruntime_providers_xnnpack` yet.
- Some clean up in build.py including setting CMake generator more consistently.
### Description
Add a greedy option to the initializer deduplication process in the
Whisper export.
Currently to detect shared initializers, ORT compares every initializer
against every other initializer (n^2). In the comparison operator, if
the two initializers have different data types (e.g. raw_data and
int_64), both initializers are converted to a numpy array and the cast
result is compared. This cast happens in every comparison, and
exponentially affects the runtime of finding shared initializers. This
cast operation is the bottleneck for the current Whisper export script.
The conversion to the numpy array is useful for detecting equal
initializer values across nodes of different data types (e.g.
recognizing a bias value of 0.0 is the same as a slice index of 0) but
isn't triggered when comparing initializers of the same data type (e.g.
weight value of 0.6 == weight value of 0.6). The latter case is where
the majority of utility is for Whisper, and so by eliminating our path
for comparing numpy arrays for initializers we save a lot of time for
minimal cost.
In other words, this PR adds an option to remove the ability to detect
shared initializers of different types (e.g. Slice Index and MatMul
Constant) while retaining the ability to deduplicate weights.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- Current time to export Whisper-large is prohibitive.
---------
Co-authored-by: Peter McAughan <petermca@microsoft.com>
### The optimize_model will generate a temporary model in current model
folder. Most of time, it is fine.
However, the scenario will break when the function run against input
model mount from AzureML. In that case, the mounted folder is read-only.
We have to copy the model to another temp folder to call optimize_model
to workaround this issue. Otherwise, the optimize_model will fail when
creating the optimized model in the read-only folder. However, the model
copy is painful, especially when model is huge.
This PR just expose the optimized_model_path at optimize_model level so
that the caller could decide where to save the temp model.
### Description
Remove AllocatorManager class
### Motivation and Context
After the refactor PR #15833 is in, AllocatorManager class is not
referenced anymore.
### Description
Before transformers 4.27, the causal mask uses uint8 data type, so there
is extra Cast node to convert it to bool. This adds a pattern that
without Cast node to support attention fusion for GPT-2 models exported
with transformers >= 4.27.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/16453
Python script to modify Onnx model to make it aligned with converted QNN
model
### Description
Onnxruntime QNN EP can support context binary file generated by QNN tool chain. However QNN generated context binary file uses channel last and 8 bits or 16 bits for input and output. This script get the QNN model input & output information from QNN converted model_net.json file, and insert Cast, Transpose nodes to Onnx model if required.
### Description
<!-- Describe your changes. -->
protobuf CopyFrom doesn't work for model > 2GB for version 4.23. This PR
removes the copy for Calibrator.
Currently Calibrator copies the ModelProto to avoid changing it. The
reason is that: quantize_static passes a ModelProto to Calibrator to
calibrate quantitation parameters, and then use it for quantization. If
calibrator changes the ModelProto, quantizaiton won't work.
This PR changes quantize_static to pass in a model path to Calibrator
instead of a ModelProto, and make Calibrator only take in model path as
input, which is how it is used in most cases.
This PR also remove the optimization from quantization. User needs to
call pre-process to optimize the model
### Description
This PR fixes a typo with assigning the `repetition_penalty` input in
the Whisper export with beam search model. It is a follow-up to the
[export stabilization
PR](https://github.com/microsoft/onnxruntime/pull/16297).
### Motivation and Context
The `repetition_penalty` input should be set to `repetition_penalty`
instead of `input_features`.
CUDA EP already supports [CUDA
graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs),
also we observed some models can benefit from using CUDA graph with
`trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP.
The implementation is based on
https://github.com/microsoft/onnxruntime/pull/9978 with the same
[constraints](https://github.com/microsoft/onnxruntime/pull/9978) as
below:
- Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
- Usage of CUDA Graphs is limited to models where-in all the model ops
(graph nodes) can be partitioned to the TRT EP.
- The input/output types of models need to be tensors.
- Shapes of inputs/outputs cannot change across inference calls.
- IObinding is required.
### Description
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to refactor ExecutionProvider API for memory management,
which is to move allocators from EP level to SessionState level and
indexed by OrtDevice. By this change, EP level will shift the burden of
maintaining allocators, which will be user friendly for EP developers
---------
Co-authored-by: Lei Cao <leca@microsoft.com@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
This PR stabilizes the Whisper export with beam search by adding the
following:
- Remove unused ONNX models and extra folders generated during the
export process
- Specify the Whisper with beam search model's IR version for E2E
integration
- Parity check for Whisper with beam search model between PyTorch and
ORT
- Remove previously exported Whisper with beam search model before
saving newly exported model
### Motivation and Context
- Removing the unused ONNX models and extra folders frees up disk space
after exporting and makes it easier to copy and move the output folder
to other environments.
- Specifying the IR version fixes an issue with generating the ONNX E2E
model
- Adding a parity check helps detect runtime issues during the export
process
- Removing the previously exported Whisper with beam search model
prevents the data file size from doubling when the newly exported model
is saved with the same filename
Whsiper model contains five different types of attention, q, k, v bias
was fused into Attention/MHA/DMHA op,
encoderdecoderinit subgraph
- Attention: encoder attention
- Attention: decoder self attention + present k, v
- MultiHeadAttention: decoder cross attention + present k and v. q and v
have bias.
decoder subgraph
- DecoderMultiHeadAttention: decoder cross attention + past k, v. q has
bias
- DecoderMultiHeadAttention: decoder self attention + past/present k, v.
q, k, v have bias.
For ROCm EP, MHA/DMHA doesn't support additional bias. This PR add a
fusion option `disable_multi_head_attention_bias` to split q.k,v bias
from MHA/DMHA.
### Description
This PR adds flags for exporting Whisper with vocab masks for logits
processing. This PR also sets `input_features` back to FP32 precision
for the user and casts `input_features` to FP16 precision when needed.
### Motivation and Context
This helps enable specific logits processing for the exported Whisper
model.