### Description
Update the MIGraphX version used in ORT to rocm-5.4.0
### Motivation and Context
The previous branch migraphx_for_ort has stopped updating, it is too far
away from the MIgraphX latest release branch. More discussion here:
https://github.com/microsoft/onnxruntime/issues/14126#issuecomment-1373201049
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
1. Set the WithCache default value as false in Mac OS CI workflow too.
2. Add date of today in cache key to avoid cache size keep increasing
too.
WithCache, the pipeline duration reduced from 70 more minutes to 10 more
minutes
### Description
Fix unconnected node removal logic
### Motivation and Context
The edges need to be removed before the nodes themselves, otherwise the
indices will reference the wrong nodes.
### Description
DML EP was using very old feature level (2.0) which may lead to model
(having latest operator) execution failure, if model is running against
old DirectML.dll.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Remove unconnected nodes from the DML EP graph.
### Motivation and Context
Some operators like `EmbedLayerNorm` have many outputs, and some of the
outputs are non-optional. But in practice, they act like optional
outputs because they can have a value of 0, which means that the rest of
the model doesn't need to depend on those. The problem with that is that
DML will implicitly remove those output from the graph, but the nodes
that feed into that output will stay and become unconnected from the
rest of the graph, which is illegal in DML. Removing unconnected nodes
as a last pass will make sure that those nodes are getting removed and
will simplify the logic of individual operators by not having to account
for these special cases.
### Description
This cuda op implements the compute_bias() method in T5 Attention
including the permutation.
note:
1. bias_table needs to be saved in col-major. be careful when
implementing fusion script
2. second input(sequence length) is placed on cpu. (using Shape node's
output should be good)
3. the first dimension of output is 1, so extra_add_qk in attention
should support broadcasting
4. compute_bias() only used in self-attn in t5
TODO: docs change will be applied later
### Motivation and Context
It's part of the process of optimizing t5 attention as well as t5 based
generation model
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Move separated Q, K and V (without input projection) from Attention to a
new operator CrossAttention.
The Attention operator is hard to maintain when we need support with and
without input projection in one class. Add a new operator according to
feedback.
Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless
experiments show that fusing Add bias with MatMul instead of this op
could improve performance).
(2) support packed KV. There are two ways to support it: when key and
value are same Tensor, they are packed; or we can make value as
optional, and use packed mode when value is empty and the key has packed
K/V.
(3) support cached key and value, and other (like relative position
bias), or more attention mask format. They can be added easily without
breaking backward compatible.
(4) ROCm/CPU implementation of this op.
…threadpools' options of The Env.
### Description
<!-- Describe your changes. -->
add a c++ class ThreadingOptions, wraps OrtThreadingOptions
as I described in issue #13710
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
close#13710
Co-authored-by: zengxiangneng <zengxiangneng@360.cn>
### Description
In case where Q have multiple DQ children, we want to keep only 1 DQ.
The only remaining DQ's will channel its output to deleted DQ children's
outputs.
ex
Q->N(DQ). => Q->DQ
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
1. The graph pattern search introduced in
https://github.com/microsoft/onnxruntime/pull/13914/ needs to be
enhanced so that SkipLayerNormalization is supported
2. Fix fp32 parity for GPT-2 while using `SkipLayerNormalization`
fusion. The optional output of SLN needs to also include the bias (if
present) and the added output should be a sum of `input + skip + (bias)`
### Motivation and Context
Fix some breaking tests
### Description
<!-- Describe your changes. -->
Fix the error https://github.com/microsoft/onnxruntime/issues/14126
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
Introduce a runtime flag in SessionState about whether any EP in current
session using stream feature, if no, avoid trigger the lock. This will
avoid the impact to CPU build.
### Motivation and Context
Currently we use a lock in SessionState when retrieve device stream
collection, this is mainly for reusing the device stream for EP like GPU
eps, so it shouldn't impact the build which doesn't using stream
feature, like cpu build. Instead of play with build flags, this PR
introduce a runtime flag in SessionState to indicate whether current
session has any EP that using the stream feature. if no, we don't need
to trigger the lock.
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
Add GemmFastGelu CK implementation.
TODO
1. The performance of CK GemmFastGelu in ORT is not good as using CK
directly, still need to investigate the reason and improve the CK in
ORT.
`GemmFastGeluUnfused float16 NN m=49152 n=3072 k=768 2298.8064 us 100.89
tflops`
`withbias DeviceGemmMultipleD_Xdl_CShuffle<256, 256, 128, 32, 8, 8,
Default> LoopScheduler: Default, PipelineVersion: v1 float16 NN m=49152
n=3072 k=768 2401.9799 us 96.56 tflops`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
Fixes Prefast C26814
```shell
onnxruntime::contrib::cuda::QAttention<onnxruntime::MLFloat16,signed char>::ComputeInternal
onnxruntime/contrib_ops/cuda/quantization/attention_quantization.cc
The const variable 'element_size' can be computed at compile-time. Consider using constexpr (con.5).
```
### Description
- Adds a new C API `OrtApi::RegisterCustomOpsLibrary_V2` that manages
the lifetime of dynamic library handles (i.e., calls `dlclose` or
`FreeLibrary`).
- Deprecates C API `OrtApi::RegisterCustomOpsLibrary`.
- Adds C++ API wrapper for convenient registering of custom op
libraries.
- `PySessionOptions` is now an alias of `OrtSessionOptions`
### Motivation and Context
The current API for registering custom op libraries loads dynamic
libraries but requires users to handle the release of the corresponding
library handles. Additionally, the user has to make sure to release the
library handle _after_ the session has been destroyed (or the program
segfaults).
The new API automatically cleans up the library and allows the user to
write more straightforward code.
### Description
T5 uses a layer_norm which only scales and doesn't shift, which is also
known as Root Mean Square Layer Normalization.
ORT already have the simplified_layer_norm which is the RMS layer_norm.
This PR extends this T5 layer_norm with support of skip/bias and the
residual output.
This new op is named SkipSimplifiedLayerNorm and has similar interface
as SkipLayerNorm but removes the beta as input
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Decouple the DML bucketized allocator from the individual block
allocation logic
### Motivation and Context
This is the first step into using tiled/placed resources instead of
committed resources. Given the potential impact of changing the
allocation logic and the large number of edge cases, I decided to take a
step-by-step approach. It will also reduce the size of the PRs to a
reasonable length, while making sure each PR has a single
responsibility.
Decoupling the logic that way will make it easier in the future to
easily plug in different kind of "suballocators" if we want to play
around with the allocation logic. Currently, the only suballocator is a
committed resource, but placed resources are the next step and will come
in a future PR.
### Description
<!-- Describe your changes. -->
when custom decoder onnx model passes in, user can specify eos/pad token
id instead of populating from torch config.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
"Consttant Folding" need to enhance to support "function" in onnx spec.
If those nodes are inlined into sub-graph and captured by a EP,
espeicially this EP doesn't support that, error occured.
There are many test failure in Onnx 1.13 agaist NNAPI, these are listed
bellow;
```
prelu_broadcast_expanded
selu_example_expanded_ver18
layer_normalization_2d_axis0
shrink_hard_expanded_ver18
elu_expanded_ver18
softsign_example_expanded_ver18
leakyrelu_example_expanded
hardsigmoid_example_expanded_ver18
thresholdedrelu_default_expanded_ver18
split_variable_parts_2d_opset18
efault_expanded
prelu_example_expanded
thresholdedrelu_example_expanded_ver18
selu_default_expanded_ver18
elu_example_expanded_ver18
hardsigmoid_default_expanded_ver18
softsign_expanded_ver18
hardsigmoid_expanded_ver18
leakyrelu_expanded
scatter_with_axis
selu_expanded_ver18
shrink_soft_expanded_ver18
relu_expanded_ver18
thresholdedrelu_expanded_ver18
elu_default_expanded_ver18
```
Solution: To prevent NNAPI capture it for now, we can revert it once a
better CF implemented.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Latest torch exporter changed the LayerNorm exporting code to add two
more Cast nodes (to make it logically correct in compute), but our
current LayerNormFusion doesn't support the new pattern. The PR is to
add support of this.
### Description
Add date value of today into the cache key.
### Motivation and Context
Microsoft-host agent has only 10GB for build.
To limit cache size, pipeline only use cache generated today.
### Description
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495
Changes including:
- Make StreamAwareArena transparent to minimal build
- Make DeviceStreamCollection transparent to minimal build
- Replace ORT_MUST_USE_RESULT with [[nodiscard]]
- Remove unnecessary shared_ptr
### Motivation and Context
This PR is to address follow-up comments for the multi-stream pr
https://github.com/microsoft/onnxruntime/pull/13495
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
1. Renames all references of on device training to training apis. This
is to keep the naming general. Nothing really prevents us from using the
same apis on servers\non-edge devices.
2. Update ENABLE_TRAINING option: With this PR when this option is
enabled, training apis and torch interop is also enabled.
3. Refactoring for onnxruntime_ENABLE_TRAINING_TORCH_INTEROP option:
- Removed user facing option
- Setting onnxruntime_ENABLE_TRAINING_TORCH_INTEROP to ON when
onnxruntime_ENABLE_TRAINING is ON as we always build with torch interop.
Once this PR is merged when --enable_training is selected we will do a
"FULL Build" for training (with all the training entry points and
features).
Training entry points include:
1. ORTModule
2. Training APIs
Features include:
1. ATen Fallback
2. All Training OPs includes communication and collectives
3. Strided Tensor Support
4. Python Op (torch interop)
5. ONNXBlock (Front end tools for training artifacts prep when using
trianing apis)
### Motivation and Context
Intention is to simply the options for building training enabled builds.
This is part of the larger work item to create dedicated build for
learning on the edge scenarios with just training apis enabled.
### Description
1. SkipLayerNormalization has a new output
(https://github.com/microsoft/onnxruntime/pull/13988) and the symbolic
shape inference script needs corresponding updates
2. The greedy sampling op
(https://github.com/microsoft/onnxruntime/pull/13426) shouldn't re-use
the logits buffer as its corresponding kernel doesn't seem to support it
yet.
### Motivation and Context
Fix some transformer issues
### Description
Force instance norm inputs to be 4D to better target metacommands
### Motivation and Context
This may improve performance on some hardware by allowing the driver to
return valid layouts to DML when querying for metacommand support.
### Description
Force layer norm inputs to be 4D to better target metacommands
### Motivation and Context
This may improve performance on some hardware by allowing the driver to
return valid layouts to DML when querying for metacommand support.
Implement CloudEP for hybrid inferencing.
The PR introduces zero new API, customers could configure session and
run options to do inferencing with Azure [triton
endpoint.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-with-triton?tabs=azure-cli%2Cendpoint)
Sample configuration in python be like:
```
sess_opt.add_session_config_entry('cloud.endpoint_type', 'triton');
sess_opt.add_session_config_entry('cloud.uri', 'https://cloud.com');
sess_opt.add_session_config_entry('cloud.model_name', 'detection2');
sess_opt.add_session_config_entry('cloud.model_version', '7'); // optional, default 1
sess_opt.add_session_config_entry('cloud.verbose', '1'); // optional, default '0', meaning no verbose
...
run_opt.add_run_config_entry('use_cloud', '1') # 0 for local inferencing, 1 for cloud endpoint.
run_opt.add_run_config_entry('cloud.auth_key', '...')
...
sess.run(None, {'input':input_}, run_opt)
```
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
It's from the PR #14085
On multiple running msbuilds , it throws the exception of
```
22-12-30T16:35:34.2423207Z ##[error]C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(155,5): Error MSB3073: The command "setlocal
"C:\Program Files\CMake\bin\cmake.exe" -E copy D:/a/_work/1/b/RelWithDebInfo/dnnl/install/bin/dnnl.dll D:/a/_work/1/b/RelWithDebInfo/RelWithDebInfo
if %errorlevel% neq 0 goto :cmEnd
:cmEnd
endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone
:cmErrorLevel
exit /b %1
:cmDone
if %errorlevel% neq 0 goto :VCEnd
:VCEnd" exited with code 1.
```
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=847423&view=logs&j=249e9d58-0012-5814-27cf-6a201adbd9cf&t=182b9780-832e-5dcb-3957-d6aa3ece582f
It should make sure that the onnxruntime_test_all project depends on
dnnl project.