### Description
Make MemcpyToHost to a separate stream for performance gain in default
DeviceBasedPartitioner
### Motivation and Context
Our experiments show that make MemcpyToHost a separate stream will make
it run parallel with other kernels, especially those compute-intensive
ones.
---------
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
<!-- Describe your changes. -->
I fixed some broken links in the C API documentation, but then did a
quick pass over all of the links I could find and then fixed those.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
I got some 404's when exploring the documentation and wanted to fix it.
### Description
`_infer_Slice()` is a function (arguably the most complex one) in
`symbolic_shape_infer.py` that infers the shape of the output of a
`Slice` node. This commit fixes an edge case in `_infer_Slice()` caused
by a SymPy quirk.
When both the end of the slice (let's call it `e`) and the corresponding
dimension of the sliced tensor (let's call it `dim`) are arbitrary
symbolic expressions, `symbolic_shape_infer.py`
[checks](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1728))
if `e <= dim`. Comparing symbolic expressions is hard in general, so if
the comparison fails, `symbolic_shape_infer.py` [gives
up](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1734))
and assumes that `e` is equal to `dim`.
A failure of this sort currently happens for expressions of the form `Y
- X >= 0` where `Y` contains a `sympy.Min()` (`symbolic_shape_infer.py`
tries to rewrite `X <= Y` comparisons in various ways, and `Y - X >= 0`
is [one of
them](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L1664))).
An simple example to illustrate this:
```python
>>> import sympy
>>> X = sympy.Symbol('X', positive=True, integer=True)
>>>
>>> y1 = 9999
>>> Y1 = X + y1 - 5000
>>> bool(Y1 - X >= 0)
True
>>>
>>> y2 = X + 4999
>>> Y2 = X + y2 - 5000
>>> bool(Y2 - X >= 0)
True
>>>
>>> y3 = sympy.Min(y1, y2)
>>> Y3 = X + y3 - 5000
>>> bool(Y3 - X >= 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../venv/lib/python3.9/site-packages/sympy/core/relational.py", line 511, in __bool__
raise TypeError("cannot determine truth value of Relational")
TypeError: cannot determine truth value of Relational
```
If you assume that `X` is positive symbol (`symbolic_shape` [does
assume](de7a868d5f/onnxruntime/python/tools/symbolic_shape_infer.py (L2129))
this for graph inputs), then both `Y1 >= X` and `Y2 >= X` holds, and
SymPy can prove this. This means that `Y3 >= X` also holds (since `Y3`
is essentially equal to either `Y1` or `Y2`, depending on the value of
`X`), but this is too hard for SymPy to prove. I confirmed that this is
still the case for the latest SymPy version (`1.11.1`).
This commit tries to fix this edge case by slightly rewriting the
expression containing `sympy.Min()`. I explain the details in the
comments in `symbolic_shape_infer.py`, so I won't duplicate them in the
PR description.
### Motivation and Context
This sounds like a very contrived example, but it actually appeared in
the wild when we tried to infer shapes for an ONNX graph exported from
PyTorch that used relative-position multihead attention from Fairseq.
The problematic line is
[here](7d050ada7d/fairseq/modules/espnet_multihead_attention.py (L192)).
In our codebase, we have something like `matrix_bd = matrix_bd[:, :, :,
: matrix_ac.size(-1)]` before we add `matrix_ac` and `matrix_bd`.
`matrix_bd` is itself a result of another slice, hence its shape
contains `sympy.Min()`, and the SymPy weirdness described above prevents
`symbolic_shape_infer.py` from correctly inferring the final shape of
`matrix_bd`. Then `symbolic_shape_infer.py` explodes when we try to add
`matrix_ac` and `matrix_bd`, because their shapes are not compatible.
I added a small self-contained unit test to illustrate the problem.
*Without* the fix, `slice_out_cropped` has shape `[N + Min(42, N + 21) -
22]`, and `input` has shape `[N]`, and we get this:
```
> python onnxruntime_test_python_symbolic_shape_infer.py
..................Cannot determine if 22 - N < 0
Unable to determine if N <= N + Min(42, N + 21) - 22, treat as equal
E....
======================================================================
ERROR: test_slice_of_min (__main__.TestSymbolicShapeInferenceForSlice)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/dfyz/onnxruntime/onnxruntime/test/python/onnxruntime_test_python_symbolic_shape_infer.py", line 460, in test_slice_of_min
model = SymbolicShapeInference.infer_shapes(onnx.helper.make_model(graph_def))
File "/home/dfyz/onnxruntime/onnxruntime/test/python/../../python/tools/symbolic_shape_infer.py", line 2461, in infer_shapes
raise Exception("Incomplete symbolic shape inference")
Exception: Incomplete symbolic shape inference
----------------------------------------------------------------------
Ran 23 tests in 0.486s
FAILED (errors=1)
```
*With* the fix, both tensors have shape `[N]`, and the test passes.
---------
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
### Fix memory profiler
A follow up fix for PR
https://github.com/microsoft/onnxruntime/pull/13495
In ORTModule training, `PartialExecuteThePlan` is called twice, we need
create log event after the backward graph run complete to collect the
whole training graph's activations infos.
Also change some log level to verbose, to avoid too many logs in >
verbose log level.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Merging extensions from Git submodule to cmake FetchContent
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Jian Chen <jchen351@MacBook-Pro.local>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
(1) Add two environment variables to configure the cuda dumper:
`ORT_TENSOR_SNIPPET_THRESHOLD` and `ORT_TENSOR_SNIPPET_EDGE_ITEMS`
(2) Move IConsoleDumper definition to a separated file console_dumper.h.
### Description
This PR addresses the case where an optional Gather node is in the
subgraph pattern. The optional node is now fused with the other nodes
matched in the pattern to create an EmbedLayerNormalization node.
### Motivation and Context
The original subgraph pattern is
```
Gather Gather
\ /
Add
|
LayerNormalization
|
Attention
|
...
```
and the new subgraph pattern is
```
Gather Gather
\ /
Gather (optional) Add
\ |
LayerNormalization
|
Attention
|
...
```
### Description
The onnx file is about 5Mb for a lightgbm model with 500 trees.
onnxruntime uses additional 10Mb to compute the inference and keeps the
onnx structure. This PR reduces the memory usage by almost 50%. The
memory used by the onnx node could be freed if there is no optimized
graph to save but that's not covered by this PR.
### Motivation and Context
Reduce memory usage.
Update stable diffusion benchmark script:
(1) Test GPU memory usage
(2) Change diffusers version to 0.13, and add support of PyTorch 2.0
including compile
(3) Add support of xformers
(4) Output result to CSV file
Example to run PyTorch 2.0 with torch.compile:
```
pip3 install numpy --pre torch --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117
export TRITON_PTXAS_PATH=/usr/local/cuda-11.7/bin/ptxas
python benchmark.py -e torch -v 1.5 -c 5 -n 1 -b 1 --enable_torch_compile
```
Enable Opset11 Sequence Ops on DirectML, and make the CPU
implementations agnostic to backend EP
Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence
With the exception of ConcatFromSequence, all of the above operators
were implemented with CPU kernels that a) required all of the contained
tensors to also be on CPU, and b) would clone each tensor into a new
sequence as a side effect of each operator. The implementation of
sequences are backend agnostic, as they dont affect actual tensor layout
or manipulate the contents of the tensors. In addition, with the
exception of SequenceAt, the other operators need not make copies of the
underlying referenced tensors.
Consequently, this change does the following:
1) Sequence* operators (except SequenceAt) no longer copies the contents
of a sequence of tensors on every kernel execution.
2) SequenceAt uses the DataTransferManager to copy tensors agnostic to
backend.
3) The internal container implemented by TensorSeq has changed from
onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor
does not support copy or assignment construction, so it must have a
singular owner. However, is same tensor participates in multiple
containers it would have multiple container "owners" and this would not
be possible.
4) Other code that accessed values from TensorSeq have associated
changes to extract Tensors from OrtValues now.
In addition, DirectML execution was very slow when the above Sequence
operators were added to a graph, as this caused MemcpyToHost and
MemcpyFromHost kernels to be inserted between the graph and the sequence
operators. To optimize DirectML,
1) The CPU implementations for the Sequence* ops were registered as DML
implementations. Since the above changes also includes making the CPU
kernel implementations EP agnostic, the CPU kernels can be added as is.
2) The ConcatFromSequence operator needed to be implemented on DirectML.
However, there was little DirectML EP operator framework support for
operators that accept/output sequences of tensors. This change has
modified the internal COM interfaces to include new apis to interrogate
for sequence shapes, and extract the needed tensors from TensorSeq.
---------
Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>
Current Sigmoid's CUDA kernel uses target data type for all computation.
For some small negative numbers, if using FP16, it will loss precision.
For example, for input [-7.8477, 7.3320, -7.8008, 6.6016], the expected
output is [3.9047e-04, 9.9935e-01, 4.0919e-04, 9.9864e-01], but current
kernel will generate result [0.0000, 0.9990, 0.0000, 0.9990]. If some
sub-graph contains Sigmoid, such as BinaryCrossEntropyWithLogits, it's
likely to produce NaN as compute result.
The PR fixes this by using FP32 for kernel internal computation. Note
that the fix will not have perf regression, as CUDA's _Exp will also do
float to half casting, so the fix doesn't introduce extra cast. We move
the cast to right begin and end of the whole kernel so that other parts
of computation are also in FP32 (instead of only Exp).
### Description
Fix the broken link in header file onnxruntime_c_api.h w.r.t. the graph
optimization levels (line 300).
### Motivation and Context
This fix solves open issue #14741
### Description
<!-- Describe your changes. -->
Refactor the cost check used by the transpose optimizer to separate out
ORT specific logic.
Change the post-layout transform optimization to only skip the cost
check when moving the layout transform nodes around. Fall back to the
normal cost check for all other transpose nodes.
Cleanup some const correctness.
Refactor usage of ResizeHandler slightly so the clang-formatting is
nicer.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Address performance issue seen in SNPE model where a non-layout
transpose node was moved. See
https://github.com/microsoft/onnxruntime/pull/14547 for more details.
Improve separation between generic transpose optimization code and any
ORT specific code.
- Fix CoreML API usage memory leak by putting CoreML API prediction call in an `@autoreleasepool` block as suggested in #14455 and [here](https://developer.apple.com/forums/thread/692425). Conservatively wrapping all CoreML API usage.
- Use MLModelConfiguration.computeUnits instead of deprecated MLPredictionOptions.usesCPUOnly (originally in #11382).
We use customized libc++ which uses raw pointers as std::vector::iterators.
As per [expr.pre.incr](https://eel.is/c++draft/expr.compound#expr.pre.incr), builtin `operator++` can only be applied to lvalue, while `std::vector::begin()` returns an rvalue.
See [this](https://godbolt.org/z/d3a1aKTWP) godbolt snippet for the details.
There is accuracy regression in GPT-2 model. Top1 match rate (vs PyTorch
model) drops about 1%. The cause is the fused causal attention uses fp16
accumulation. Disable it by default and add an environment variable
ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1 to turn on it manually.
It also updated the GPT-2 parity test script to generate left side
padding to reflect the actual usage.
To test:
```
python -m onnxruntime.transformers.models.gpt2.convert_to_onnx -m gpt2 --output gpt2.onnx -o -p fp16 --use_gpu
```
The top1-match-rate in the output is on-par with ORT 1.13.1.
### Description
Create new stream for data copy for IOBidning input scenario
### Motivation and Context
Previously in bindInput(), a nullptr Stream is passed to copy data cross
device. This caused the default stream is used thus hurt the
performance.
This PR is to fix https://github.com/microsoft/onnxruntime/issues/14484
---------
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
Reenable disabled opset18 tests
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Fix random failure of ortmodule_api.py::test_unused_parameters
Fix FAILED
orttraining_test_ortmodule_api.py::test_unused_parameters[model1-none_pt_params1]
for orttraining-linux-gpu-ci-pipeline CI pipeline
```
=================================== FAILURES ===================================
________________ test_unused_parameters[model1-none_pt_params1] ________________
model = UnusedMiddleParameterNet(
(fc1): Linear(in_features=784, out_features=500, bias=True)
(relu): ReLU()
(fc2): Linear(in_features=500, out_features=400, bias=True)
(fc3): Linear(in_features=500, out_features=10, bias=True)
)
none_pt_params = ['fc2.weight', 'fc2.bias']
@pytest.mark.parametrize(
"model, none_pt_params",
[
(UnusedBeginParameterNet(784, 500, 400, 10), ["fc1.weight", "fc1.bias"]),
(UnusedMiddleParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]),
(UnusedEndParameterNet(784, 500, 400, 10), ["fc2.weight", "fc2.bias"]),
],
)
def test_unused_parameters(model, none_pt_params):
device = "cuda"
N, D_in, H1, H2, D_out = 64, 784, 500, 400, 10
model = model.to(device)
ort_model = ORTModule(copy.deepcopy(model))
# Make sure model runs without any exception
for _ in range(5):
x = torch.randn(N, D_in, device=device)
y = copy.deepcopy(x)
out_pt = model(x)
out_ort = ort_model(y)
loss_pt = out_pt.sum()
loss_pt.backward()
loss_ort = out_ort.sum()
loss_ort.backward()
_test_helpers.assert_values_are_close(out_ort, out_pt)
> _test_helpers.assert_gradients_match_and_reset_gradient(ort_model, model, none_pt_params=none_pt_params)
orttraining_test_ortmodule_api.py:4050:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_test_helpers.py:216: in assert_gradients_match_and_reset_gradient
assert_values_are_close(ort_param.grad, pt_param.grad, rtol=rtol, atol=atol)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
```
Initially the test runs very well. As we insert more and more tests,
when running ortmodule_api.py::test_unused_parameters, the random
generated data got changed, and now it is more easily to generate an
input data that produce a result the break existing rtol and atol.
The example data, 0.1041 only have very minor diff, e.g. abs_diff:
2.2649765014648438e-06.
> The torch.allclose judge it is not equal because: abs_diff> 0.1041 *
rtol + atol = 1.041e-1 * 1e-5 + 1e-6 =-2.041e-6.
> Additionally, according to math
[here](7b31bcda2e/orttraining/orttraining/test/python/_test_helpers.py (L230))
The maximum atol is 1.2238311910550692e-06 > current atol(1e-6), maximum
rtol is 1.2149855137977283e-05 > current rtol(1e-5).
This PR looses the atol to 1e-5, rtol to 1e-4 .
### Description
<!-- Describe your changes. -->
Update java/build.gradle to not use deprecated features that were
removed in gradle 8.0.
Also move gradle wrapper setup from a script into a step template.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix builds which use hosted Mac agents and gradle.
Recently the system version of gradle got upgraded to 8.0. Even though
we use an older gradle wrapper version, java/build.gradle is still
processed with gradle 8.0 in the initial call to `gradle wrapper`.
### Description
Update oneDNN version from 2.7 to 3.0
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add a fusion to remove transpose in subgraph like
```
--> Gemm --> Unsqueeze(axes=[2]) --> Unsqueeze(axes=[3]) --> Add --> Transpose([0,2,3,1]) --> GroupNorm
```
With this fusion, we can remove 22 Transpose nodes in UNet, and reduce
latency by 0.1 second per image in T4.
1. Add Softmax warpwise_forward into SoftmaxTunableOp.
2. Set Softmax op use tunableOp as optional and use original
implementation by default.
3. There are some other operators use `dispatch_warpwise_softmax_forward
/dispatch_warpwise_softmax_forward/ SoftMaxComputeHelper ` directly. But
they only have files under cuda directory, adding `RocmTuningContext `
for these files requires copying and modifying hipified files. Now only
set RocmTuningContext as nullptr by default and not hipified other
operators.
Related PR: https://github.com/microsoft/onnxruntime/pull/14541
---------
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
Current configuration has CPU as the highest priority as per the specification found at :
https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession
providers – Optional sequence of providers in order of decreasing precedence.
Values can either be provider names or tuples of (provider name, options dict). If not provided,
then all available providers are used with the default precedence.
Sets correct operator precedence for the EPs in parity utilities for test runs
Ruling out any odd out of order issues when setting up tests for multiple EPs
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
(1) Support packed QKV format in MultiHeadAttention. This format could
avoid add bias transpose when TRT fused kernel is used.
(2) Add cache for cumulated sequence length computation. For SD, it only
need computed once since sequence length is fixed.
(3) Do not allocate qkv workspace to save memory for packed KV or QKV.
(4) Add unit tests for packed kv and packed qkv format in
MultiHeadAttention
(5) Mark some fusion options for SD only
Performance tests show slight improvement in T4. Average latency reduced
0.15 seconds (from 5.25s to 5.10s) for 512x512 in 50 steps for SD 1.5
models. Memory usage drops from 5.1GB to 4.8GB.
### Description
FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel,
and ARM64 NEON kernel.
### Motivation and Context
First step in creating native support of fp16 model inferencing on ARM64
and AMD64 platforms.
---------
Co-authored-by: Chen Fu <fuchen@microsoft.com>
When inferencing real gpt2-based model, found some gaps between CUDA and
ROCm codebase.
The fixes include:
1. minimum code change to fix tensor shape on Attention Op
2. Support optional output tensor with SkipLayerNorm
3. fix a build error found on MI200
---------
Co-authored-by: Ubuntu <ettao@ettao-amd-dev1.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
Add the ability to get and set tuning results of an inference session.
Also add tool to manipulate onnx file to embed the results into the
model file and automatically load it on session initialization.
The third part for stable diffusion CUDA optimizations
(1) Add BiasAdd operator to replace two Add (bias and residual); Add
fusion for BiasAdd
(2) Add Attention fusion for VAE decoder.
(3) Update float16 conversion to handle Resize and GroupNorm. This could
reduce two Cast nodes for each Resize op in fp16 model.
(4) Force inputs and outputs to be float16 to avoid data casts in the
pipeline.
(5) Add options --force_fp32_ops, --inspect etc in optimize script so that
user could force some operator to run in float32 to potentially get
better image quality (with cost of performance).
Performance tests show slight improvement in T4. Average latency reduced
0.1 seconds (from 5.35s to 5.25s) for 512x512 in 50 steps.
### Description
The custom thread entry point was declared `__stdcall` even though the
API dictated a different type. Casting caused improper cleanup of the
stack and crash manifested only in 32-bit Debug builds.
### Motivation and Context
This addresses https://github.com/microsoft/onnxruntime/issues/14613
### Description
<!-- Describe your changes. -->
1. fuse rel_pos_bias in T5.
2. remove extended masks in T5 decoder and decoder_init since they
generate all zeros
3. fix a bug in onnx_model.py
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
This has been available since July 25th 2022 in MIGraphX. Appared to be
missing from support list of ops
https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/pull/1302
### Description
<!-- Describe your changes. -->
Add in node name for Mod Operator to be supported by MIGraphX
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Expand available functionality to Onnxruntime for the MIGraphX EP
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
### Description
Fixes the DML release build for 1.14.1. This was initially fixed by
https://github.com/microsoft/onnxruntime/pull/13417 for 1.13.1, but the
changes didn't make their way back to the main branch.