### Description
<!-- Describe your changes. -->
This fix macos packaging build on universal2 arch.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Re-enable the react native e2e android unit test for react native CI as
recent change of specifying `default` instead of `google-apis` in
android emulator CI tests gives pretty stable result for now.
Upgrade the targetSDKversion for gradle test project in
react-native/android to meet minimum target api level requirement for
Google Play apps.
https://support.google.com/googleplay/android-developer/answer/11926878?hl=en
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
React Native CI issue.
### Description
<!-- Describe your changes. -->
This PR speeds-up Clip operations by replacing their sequential
implementation with a parallelized one. The parallelization is achieved
by dividing the input data into chunks of size N and using a thread pool
to process the chunks in parallel. The chunk size N is set to 16K based
on performance evaluation on input tensors of 10^i elements for i in [1
.. 6].
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The Clip operation is frequently executed in image processing models.
Its implementation can be easily parallelized and therefore sped up when
executed on a multi-core machine. On long inputs (>= 100K elements) this
PR achieves speedup of over 2x. On shorter inputs, this PR does not
introduce any substantial performance change.
Add BiasSplitGelu/BiasAdd/GroupNorm/NhwcConv operator for ROCm EP.
1. BiasSplitGelu and BiasAdd operators can be automatically hipified
from CUDA EP.
2. GroupNorm was hipified from CUDA EP and modified to build.
3. NhwcConv is similar to NhwcConv in CUDA EP, But the MIOpen API and
cuDnn API are different. `miopenConvolutionForwardbias` and
`miopenOpTensor` of MIOpen doesn't support NHWC layout now, use
BinaryElementwise to replace miopenConvolutionForwardbias(NHWC layout).
### Description
<!-- Describe your changes. -->
1. added script for t5 encoder self attention and t5 decoder self/cross
attention fusions.
2. added simplified layernorm fusion for --external_data_format senario.
(otherwise relying on ORT optimizer)
3. added rel_pos_bias shape inference code, modified attention/mha shape
inference script.
4. reworked graph_topologic_sort() because the currently implementation
is not functioning correctly. also added an option to topo-sort the
graph in a deterministic way to let tests pass.
note:
1. the t5-beamsearch export code is slightly modified. specifically,
encoder_hidden_states(ehs) is no longer an input to the t5 decoder since
the ehs is not actually used in the graph execution.
2. recent PRs do not add optimizations to t5 on cpu.
3. the fp32 model(encoder and decoder) for t5-small, t5-base and
t5-large can get a parity of e-5 and the corresponding beam search
models generate same results as pytorch.
4. fp16(mixed-precision) models, however, get a parity around 3e-2 and
some has maximum diff a bit over 3e-2. But the beam search models still
generate same results as pytorch (based on limited input data)
5. mt-5 model has a parity issue at the moment, even before any
optimization. will investigate later.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
While browsing the sources I found several typos here and there.
I collected them to a single PR and fixed them.
Namely these typos are: operater, tranform, neccessary, trainig.
After fixing none of them was found anymore:
$ git grep "operater"
$ git grep "tranform"
$ git grep "neccessary"
$ git grep "trainig"
$
### Motivation and Context
Since some of the typos are in example notebooks and markdown files,
users can see them.
### Description
<!-- Describe your changes. -->
1. support optional bias in Attention op (used in T5 encoder)
2. support broadcasting rel_pos_bias in attention_softmax.h
3. add scale in
MHA op's attributes
4. support past_key/past_value and present_key/present_value in MHA
5. UT and parity tests are added
6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920
note: the fusions will be in another PR since mt5 needs to be tested and
an issue from github will be investigated.
Future works:
1. support shared buffer for past/present
2. enable trt kernels when possible and investigate (trt/cutlass)kernels
with rel_pos_bias)
3. support KV/QKV packing with past/present
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Enable LeakyRelu latest since the last version differs only in type
support.
Refactor `fast_gelu_fusion` to enable the script, because our script is
unable to
check if any of the optimizers are outdated and no longer in effect.
### Motivation and Context
We do not want to loose performance.
Next step is to file improvements issues if any are required.
### Slice op upstream refactor
A refactor work for https://github.com/microsoft/onnxruntime/pull/13672.
### Motivation and Context
There is a similar optimization opportunity for other operator
upstreaming, to reduce compute flops. So refactor the existing code base
for making it easier to support other ops.
The changes in this PR are mainly about renaming and moving.
- Move common logic (from compute_optimizer.h/cc) into
upstream_transformer_base.h/cc and shared_utils.h/cc.
- For upstream common logic, they are moved into
upstream_transformer_base.h/cc
- For shared utilities, they are moved to shared_utils.h/cc.
- After the move, compute_optimizer.h/cc mainly for upstreaming gather
implementation (inheriting upstream_transformer_base.h/cc). Ideally it
should be renamed, but for easier review this time, I keep its name.
### Description
OpSchema::GetFunction() changed in ONNX to support
opset-version-dependent function-body. Update the call to GetFunction
appropriately.
### Motivation and Context
Motivated by https://github.com/microsoft/onnxruntime/issues/14810
---------
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
### Description
BUG FIX: the if...else in telemetry-steps.yml does not really work. It
always says "Telemetry is disabled." even through the pipeline doesn't
have the pipeline variable.
### Motivation and Context
For example, recently I setup a new pipeline in
https://dev.azure.com/onnxruntime/onnxruntime/_build without setting the
ADO variable, but the powershell code still thinks that we have enabled
telemetry.
See:
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=910107&view=results
The reason it didn't work because when the pipeline
variable("TELEMETRYGUID") doesn't exist, the occurrence of
"$(TELEMETRYGUID)" would be not replace to anything. It will remain as
it is.
### Description
QNN EP:
- Adds the
[InstanceNormalization](https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html)
operator to QNN EP.
- Fixes graph composition bug when Transpose node is the last node in a
graph.
- Adds check for input shape when GetCapability is called (before and
after layout transformation)
- Should add similar checks for other layout sensitive ops (conv, pool,
...) in a separate PR
- Adds initial QNN op tests for QDQ conv and QDQ InstanceNormalization
- Should add tests for other ops in a separate PR
Optimizer:
- Makes InstanceNormalization a layout sensitive operator.
- Adds a custom QDQ group selector for InstanceNormalization.
Quantization tool:
- Adds QDQ support for InstanceNormalization operator.
- Adds python unit test for InstanceNormalization quantization.
### Motivation and Context
Needed to support stable diffusion models with QNN.
---------
Co-authored-by: Hector Li <hecli@microsoft.com>
### Description
In transpose.cc:
Arithmetic overflow: Using operator '-' on a 4 byte value and then
casting the result to a 8 byte value. Cast the value to the wider type
before calling operator '-' to avoid overflow (io.2).
In cuda_provider_factory.h:
The type 'struct onnxruntime::CUDA_Provider' with a virtual function
needs either public virtual or protected non-virtual destructor (c.35).
### Description
Re-work handling of static objects in pybind.
Make sure we ref-count Environment from Sessions.
The following has been done:
- Make global objects function static. This ensures that the objects are
constructed on demand. The first object constructed is destructed last.
This is platform independent.
- Make global objects ownership shared as suggested by pybind since they
are not surfaced at Python level, and they cannot be referred to by
dependent python objects. Verified that all python objects are GCed
before globals are destroyed. This takes care of inference session
dependency on environment and its default logger and this is also
platform independent.
- Utilize pybind atexit mechanism to clear execution providers and
unload CUDA libraries (as suggested by
https://github.com/microsoft/onnxruntime/pull/14903) . Since this is
registered for module exit, it takes place before any other global are
destroyed and clears shared objects state or even unloads the libraries.
This should also work in a platform independent way.
### Motivation and Context
- Global object destruction order is managed manually and that becomes
source of trouble. We want to make it deterministic and platform
independent.
- Frequent hangs in Python layer due to the static object's destruction
order. Some of the Python session objects are being garbage collected
after main exits and they require ORT environment to be alive. (Use
after free)
### Description
- Add QNN 2.8 SDK
- Make QNN SDK version a pipeline template parameter for QNN pipelines.
### Motivation and Context
Updates to latest QNN SDK version, and allows testing different QNN SDK
versions without modifying yaml files.