### Description
<!-- Describe your changes. -->
Re-enable the react native e2e android unit test for react native CI as
recent change of specifying `default` instead of `google-apis` in
android emulator CI tests gives pretty stable result for now.
Upgrade the targetSDKversion for gradle test project in
react-native/android to meet minimum target api level requirement for
Google Play apps.
https://support.google.com/googleplay/android-developer/answer/11926878?hl=en
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
React Native CI issue.
### Description
<!-- Describe your changes. -->
This PR speeds-up Clip operations by replacing their sequential
implementation with a parallelized one. The parallelization is achieved
by dividing the input data into chunks of size N and using a thread pool
to process the chunks in parallel. The chunk size N is set to 16K based
on performance evaluation on input tensors of 10^i elements for i in [1
.. 6].
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The Clip operation is frequently executed in image processing models.
Its implementation can be easily parallelized and therefore sped up when
executed on a multi-core machine. On long inputs (>= 100K elements) this
PR achieves speedup of over 2x. On shorter inputs, this PR does not
introduce any substantial performance change.
Add BiasSplitGelu/BiasAdd/GroupNorm/NhwcConv operator for ROCm EP.
1. BiasSplitGelu and BiasAdd operators can be automatically hipified
from CUDA EP.
2. GroupNorm was hipified from CUDA EP and modified to build.
3. NhwcConv is similar to NhwcConv in CUDA EP, But the MIOpen API and
cuDnn API are different. `miopenConvolutionForwardbias` and
`miopenOpTensor` of MIOpen doesn't support NHWC layout now, use
BinaryElementwise to replace miopenConvolutionForwardbias(NHWC layout).
### Description
<!-- Describe your changes. -->
1. added script for t5 encoder self attention and t5 decoder self/cross
attention fusions.
2. added simplified layernorm fusion for --external_data_format senario.
(otherwise relying on ORT optimizer)
3. added rel_pos_bias shape inference code, modified attention/mha shape
inference script.
4. reworked graph_topologic_sort() because the currently implementation
is not functioning correctly. also added an option to topo-sort the
graph in a deterministic way to let tests pass.
note:
1. the t5-beamsearch export code is slightly modified. specifically,
encoder_hidden_states(ehs) is no longer an input to the t5 decoder since
the ehs is not actually used in the graph execution.
2. recent PRs do not add optimizations to t5 on cpu.
3. the fp32 model(encoder and decoder) for t5-small, t5-base and
t5-large can get a parity of e-5 and the corresponding beam search
models generate same results as pytorch.
4. fp16(mixed-precision) models, however, get a parity around 3e-2 and
some has maximum diff a bit over 3e-2. But the beam search models still
generate same results as pytorch (based on limited input data)
5. mt-5 model has a parity issue at the moment, even before any
optimization. will investigate later.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
While browsing the sources I found several typos here and there.
I collected them to a single PR and fixed them.
Namely these typos are: operater, tranform, neccessary, trainig.
After fixing none of them was found anymore:
$ git grep "operater"
$ git grep "tranform"
$ git grep "neccessary"
$ git grep "trainig"
$
### Motivation and Context
Since some of the typos are in example notebooks and markdown files,
users can see them.
### Description
<!-- Describe your changes. -->
1. support optional bias in Attention op (used in T5 encoder)
2. support broadcasting rel_pos_bias in attention_softmax.h
3. add scale in
MHA op's attributes
4. support past_key/past_value and present_key/present_value in MHA
5. UT and parity tests are added
6. fix an issue: https://github.com/microsoft/onnxruntime/issues/14920
note: the fusions will be in another PR since mt5 needs to be tested and
an issue from github will be investigated.
Future works:
1. support shared buffer for past/present
2. enable trt kernels when possible and investigate (trt/cutlass)kernels
with rel_pos_bias)
3. support KV/QKV packing with past/present
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
### Description
Enable LeakyRelu latest since the last version differs only in type
support.
Refactor `fast_gelu_fusion` to enable the script, because our script is
unable to
check if any of the optimizers are outdated and no longer in effect.
### Motivation and Context
We do not want to loose performance.
Next step is to file improvements issues if any are required.
### Slice op upstream refactor
A refactor work for https://github.com/microsoft/onnxruntime/pull/13672.
### Motivation and Context
There is a similar optimization opportunity for other operator
upstreaming, to reduce compute flops. So refactor the existing code base
for making it easier to support other ops.
The changes in this PR are mainly about renaming and moving.
- Move common logic (from compute_optimizer.h/cc) into
upstream_transformer_base.h/cc and shared_utils.h/cc.
- For upstream common logic, they are moved into
upstream_transformer_base.h/cc
- For shared utilities, they are moved to shared_utils.h/cc.
- After the move, compute_optimizer.h/cc mainly for upstreaming gather
implementation (inheriting upstream_transformer_base.h/cc). Ideally it
should be renamed, but for easier review this time, I keep its name.
### Description
OpSchema::GetFunction() changed in ONNX to support
opset-version-dependent function-body. Update the call to GetFunction
appropriately.
### Motivation and Context
Motivated by https://github.com/microsoft/onnxruntime/issues/14810
---------
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
### Description
BUG FIX: the if...else in telemetry-steps.yml does not really work. It
always says "Telemetry is disabled." even through the pipeline doesn't
have the pipeline variable.
### Motivation and Context
For example, recently I setup a new pipeline in
https://dev.azure.com/onnxruntime/onnxruntime/_build without setting the
ADO variable, but the powershell code still thinks that we have enabled
telemetry.
See:
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=910107&view=results
The reason it didn't work because when the pipeline
variable("TELEMETRYGUID") doesn't exist, the occurrence of
"$(TELEMETRYGUID)" would be not replace to anything. It will remain as
it is.
### Description
QNN EP:
- Adds the
[InstanceNormalization](https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html)
operator to QNN EP.
- Fixes graph composition bug when Transpose node is the last node in a
graph.
- Adds check for input shape when GetCapability is called (before and
after layout transformation)
- Should add similar checks for other layout sensitive ops (conv, pool,
...) in a separate PR
- Adds initial QNN op tests for QDQ conv and QDQ InstanceNormalization
- Should add tests for other ops in a separate PR
Optimizer:
- Makes InstanceNormalization a layout sensitive operator.
- Adds a custom QDQ group selector for InstanceNormalization.
Quantization tool:
- Adds QDQ support for InstanceNormalization operator.
- Adds python unit test for InstanceNormalization quantization.
### Motivation and Context
Needed to support stable diffusion models with QNN.
---------
Co-authored-by: Hector Li <hecli@microsoft.com>
### Description
In transpose.cc:
Arithmetic overflow: Using operator '-' on a 4 byte value and then
casting the result to a 8 byte value. Cast the value to the wider type
before calling operator '-' to avoid overflow (io.2).
In cuda_provider_factory.h:
The type 'struct onnxruntime::CUDA_Provider' with a virtual function
needs either public virtual or protected non-virtual destructor (c.35).
### Description
Re-work handling of static objects in pybind.
Make sure we ref-count Environment from Sessions.
The following has been done:
- Make global objects function static. This ensures that the objects are
constructed on demand. The first object constructed is destructed last.
This is platform independent.
- Make global objects ownership shared as suggested by pybind since they
are not surfaced at Python level, and they cannot be referred to by
dependent python objects. Verified that all python objects are GCed
before globals are destroyed. This takes care of inference session
dependency on environment and its default logger and this is also
platform independent.
- Utilize pybind atexit mechanism to clear execution providers and
unload CUDA libraries (as suggested by
https://github.com/microsoft/onnxruntime/pull/14903) . Since this is
registered for module exit, it takes place before any other global are
destroyed and clears shared objects state or even unloads the libraries.
This should also work in a platform independent way.
### Motivation and Context
- Global object destruction order is managed manually and that becomes
source of trouble. We want to make it deterministic and platform
independent.
- Frequent hangs in Python layer due to the static object's destruction
order. Some of the Python session objects are being garbage collected
after main exits and they require ORT environment to be alive. (Use
after free)
### Description
- Add QNN 2.8 SDK
- Make QNN SDK version a pipeline template parameter for QNN pipelines.
### Motivation and Context
Updates to latest QNN SDK version, and allows testing different QNN SDK
versions without modifying yaml files.
- Use java/gradlew directly in .github/workflows/publish-java-apidocs.yml.
- Remove use of deleted step from tools/ci_build/github/azure-pipelines/android-arm64-v8a-QNN-crosscompile-ci-pipeline.yml.
- Remove Gradle installations and PATH updates from Dockerfiles and scripts. Now Gradle wrapper is used so a system Gradle installation is not needed.
### Description
This will enable a user to use a TensorRT timing cache based on #10297
to accelerate build times on a device with the same compute capability.
This will work across models as it simply store kernel runtimes for
specific configurations. Those files are usually very small (only a few
MB) which makes them very easy to ship with an application to accelerate
the build time on the user end.
### Motivation and Context
Especially for workstation use cases TRT build times can be a roadblock.
With a few model from ONNX model zoo i evaluated speedups when a timing
cache is present.
`./build/onnxruntime_perf_test -e tensorrt -I -t 5 -i
"trt_timing_cache_enable|true" <onnx_path>`
|Model | no Cache | with Cache|
| ------------- | ------------- | ------------- |
|efficientnet-lite4-11 | 34.6 s | 7.7 s|
|yolov4 | 108.62 s | 9.4 s|
To capture this is had to modify the onnxruntime_perf_test. The time is
sometimes not captured within "Session creation time cost:" which is why
i introduced "First inference time cost:".
---------
Co-authored-by: Chi Lo <Chi.Lo@microsoft.com>
Add intializers to model proto in sorted order.
### Motivation and Context
Onnxruntime OpenVino Execution Provider interacts with Openvino API by
passing onnx serialised model proto.
Current flow is that onnx serialised model proto will be passed into
Read_model() API of OpenVino that creates an OpenVino execution network
thats passed to compile_model() API.
As part of optimizations we have combined the API's (Read_model and
Compile_model) into single compile_model() API that directly accepts
serialized onnx model proto. A hash function will be computed on this
serialized input for internal Openvino optimizations. This requires the
model_proto to be deterministic during each inference requests.
With the current flow, the [initializers are added to
model_proto](c1ff4b468d/onnxruntime/core/graph/graph_proto_serializer.cc (L48))
from an [unordered_map data
structure](8ed3dfe063/onnxruntime/core/providers/shared_library/provider_interfaces.h (L93))
that brings in random ordering of these initializers for inference runs.
The proposed solution is to add these initializers by iterating through
a sorted[ vector consisting of the initializer
names](2c7146cef8/onnxruntime/core/graph/graph_proto_serializer.cc (L49)).
### Description
For rocBLAS extensions API:
* fix `alpha`/`beta` dtype mismatch in `rocblas_gemm_ex()`, which should
be the same as `compute_type`.
* add support for `BatchedGemm` and `StridedBatchedGemm` cases.
### Description
* add more configs for `threads_per_block` in SkipLayerNorm, also in
kernel explorer.
* loosen constraints for hidden_size, so that `SkipLayerNormSmallOp` can
be selected for larger hidden sizes.
* add flag for optional output in kernel_explorer
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Make patch manylinux one single step.
### Motivation and Context
If we want to use hash of docker-related files as the cache key, the
files should keep consistent before and after docker build.
And changes in generated build_scripts should trigger rebuilding the
image as well.
### Description
tensorboard depends on rsa>=3.1.4, while rsa 4.5 has vuln issue, so pin
it to higher version as suggested
Fixed
[AB#7352](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/7352)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change delays the execution of checking whether bigint is available
in the context. This allows polyfill for
`BigInt64Array`/`BigUint64Array` (if there is any)
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/14940
- Update Gradle version used in most places from 6.8.3 to 8.0.1. Update Android Gradle Plugin version where applicable.
Not updated in this change: React Native Android projects (under `js/react_native/`). That can be done later along with updating the React Native projects.
- Add Gradle wrapper in `java/` to make it easier to consistently use a specific Gradle version.
### Description
<!-- Describe your changes. -->
Support externally-managed output tensors (torch Tensors) for dort.
Add `preallocate_output` option to OrtBackend to rely on
externally-managed output tensors for dort.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
DORT currently allocates and returns output ortvalues and convert them
to torch Tensors. The conversion based on dlpack does not support torch
Tensors for custom Aten backends, and it is not yet possible to transfer
the ownership from ortvalue to external handle (torch Tensor).
To avoid this issue, the PR change provides an option
(`preallocate_output`) to allocate output tensors externally in pytorch,
which creates torch Tensor for an Aten backend, and let dort take
pointers from torch Tensors to construct output ortvalues instead of
allocating them inside InferenceSession.
### Description
Current pipeline refers to an old image which is causing test failures.
Updating the image to the latest one.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Fixes pipeline failure:
https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=198
- If it fixes an open issue, please link to the issue here. -->
### Fix simplified layer norm fusion for training
Co-author with @prathikr.
Fix bug identified by @prathikr.
https://github.com/microsoft/onnxruntime/issues/14822.
Running T5 model enabling deepspeed, we see simplified layer norm is not
fused because the device check did not pass
b7fde84341/onnxruntime/core/optimizer/layer_norm_fusion.cc (L568).
Since during pretraining optimization pass, there is no device
placement, so the device check not fulfilled is expected.
On the other hand, the device check is still valid to avoid simplified
layer norm fusion works correctly for CPU runs. As a mitigation, added a
flag to indicate whether the fusion is triggered by pre-training
optimization or not. There is a risk though, when we run ORTModule
training with CPU EP, but I feel the risk can be much reduced if we
check CUDA/ROCM is enabled for the build.
```
CUDA_VISIBLE_DEVICES=0 python examples/onnxruntime/training/summarization/run_summarization.py --model_name_or_path t5-small --do_train --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --predict_with_generate --overwrite_output_dir --output_dir /bert_ort/pengwa/output --fp16 --max_steps 1 --logging_steps 1 --deepspeed aml_ds_config_zero_1.json
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Previous implementation did not support left or right node of a node to
have an index lower than the node itself. This condition would forbid
the tree to enter an infinite loop. Lightgbm does not follow that rule.
The changes do not change the algorithm but remove the test enforcing
that condition.
### Motivation and Context
It fixes a regression introduced by #14670.