- Update Gradle version used in most places from 6.8.3 to 8.0.1. Update Android Gradle Plugin version where applicable.
Not updated in this change: React Native Android projects (under `js/react_native/`). That can be done later along with updating the React Native projects.
- Add Gradle wrapper in `java/` to make it easier to consistently use a specific Gradle version.
### Description
<!-- Describe your changes. -->
Support externally-managed output tensors (torch Tensors) for dort.
Add `preallocate_output` option to OrtBackend to rely on
externally-managed output tensors for dort.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
DORT currently allocates and returns output ortvalues and convert them
to torch Tensors. The conversion based on dlpack does not support torch
Tensors for custom Aten backends, and it is not yet possible to transfer
the ownership from ortvalue to external handle (torch Tensor).
To avoid this issue, the PR change provides an option
(`preallocate_output`) to allocate output tensors externally in pytorch,
which creates torch Tensor for an Aten backend, and let dort take
pointers from torch Tensors to construct output ortvalues instead of
allocating them inside InferenceSession.
### Description
Current pipeline refers to an old image which is causing test failures.
Updating the image to the latest one.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Fixes pipeline failure:
https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=198
- If it fixes an open issue, please link to the issue here. -->
### Fix simplified layer norm fusion for training
Co-author with @prathikr.
Fix bug identified by @prathikr.
https://github.com/microsoft/onnxruntime/issues/14822.
Running T5 model enabling deepspeed, we see simplified layer norm is not
fused because the device check did not pass
b7fde84341/onnxruntime/core/optimizer/layer_norm_fusion.cc (L568).
Since during pretraining optimization pass, there is no device
placement, so the device check not fulfilled is expected.
On the other hand, the device check is still valid to avoid simplified
layer norm fusion works correctly for CPU runs. As a mitigation, added a
flag to indicate whether the fusion is triggered by pre-training
optimization or not. There is a risk though, when we run ORTModule
training with CPU EP, but I feel the risk can be much reduced if we
check CUDA/ROCM is enabled for the build.
```
CUDA_VISIBLE_DEVICES=0 python examples/onnxruntime/training/summarization/run_summarization.py --model_name_or_path t5-small --do_train --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --predict_with_generate --overwrite_output_dir --output_dir /bert_ort/pengwa/output --fp16 --max_steps 1 --logging_steps 1 --deepspeed aml_ds_config_zero_1.json
```
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Previous implementation did not support left or right node of a node to
have an index lower than the node itself. This condition would forbid
the tree to enter an infinite loop. Lightgbm does not follow that rule.
The changes do not change the algorithm but remove the test enforcing
that condition.
### Motivation and Context
It fixes a regression introduced by #14670.
### Description
TreeEnsemble* kernels fully copies all the parameters from the onnx
graph. Even if they are no longer needed or unused (hitrates), they
remain in memory. For big models >= 200 trees, max_depth > 10, the model
usually weights more than 10 Mb. This change offers a kernel the
possibility to remove all unneeded attributes after they were used to
create the session. Attributes are deleted after the model was possibly
saved, at the of the session creation.
The current design is to be debatted:
* it stored the list of removable attributes in class
`onnxruntime::Node`,
* the node is marked as `const` everytime this implementation needs to
register the name of a removable attribute or to remove them.
The current implementation is just a POC as it needs to cast
`onnxruntime::Node*` into `const onnxruntime::Node*`.
Should we keep the list of removable attributes in `onnxruntime::Node`?
### Motivation and Context
Motivation is mostly to reduce memory consumption.
---------
Signed-off-by: xadupre <xadupre@microsoft.com>
### Description
Split up the ORT build step in the Linux QNN CI Pipeline.
### Motivation and Context
Build errors were not being immediately reported at the end of the build
step. The build step currently concatenates multiple shell commands, and
the return code for the last (mkdir) was being reported. This PR ensures
that the return code of the `python build.py ...` command is reported
for the build step.
### Description
To reduce CUDA package's size a little bit. 37 is for Tesla K80. Azure's
NC-series uses it, but in most cases CUDA can dynamic generate device
code .
Extract QKV projection and attention computation into pipelines (composed from gemms and kernel launch).
This will allow us to introduce ck flash attention in next PR
### Description
1. Remove Python 3.7 from the python packaging pipeline. It is planned
for the next release and approved by the PMs. Also we will add 3.11, but
it will be addressed in another PR.
2. Stop generating python packages based on Ubuntu 18.04 which will
reach EOL next month. We will either replace them with Ubuntu 20.04 or a
CentOS 8 variant.
### Description
<!-- Describe your changes. -->
Consume ONNX 1.13.1 in ONNX Runtime. (ONNX 1.13.0 to ONNX 1.13.1)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ONNX 1.13.1 patch was just released yesterday. This PR is making ORT's
ONNX submodule consistent with the latest released ONNX. Not sure
whether this PR is really needed, but let me make it ready. Previous PR
for testing ONNX 1.13.1rc2 :
https://github.com/microsoft/onnxruntime/pull/14634.
Fixed
[AB#13174](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/13174)
.
### Description
Fix the function inliner logic for renaming variables. Typically, a
FunctionProto does not contain references to outer-scope names. However,
special cases, such as the function-expansion of SequenceMap, can
generate such FunctionProtos. Extend the renaming logic to ensure that
references to outer-scope names are not renamed.
### Motivation and Context
Fixes https://github.com/onnx/onnx/issues/4892
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
### Description
Call Qnn deviceCreate during backend setup and call deviceFree during
shutdown
### Motivation and Context
Algin with Qnn formal setup and shutdown procedure.
### Description
Do not create Barrier and triggerDownstream steps during execution plan
creation if the corresponding nodes are split by yield Op in training
scenario.
### Motivation and Context
In training scenario, forward and backward processes are running two
different partial nodes of a graph. If there are two nodes each in one
of the partial graph and separate in two streams, there are still
triggerDownstream/barrier steps between them which work quite different
from inference process as one of the steps will not be executed due to
it is not in the correct range. To make it work, there is a hacky way to
trigger the barrier step explicitly for training.
This PR is to do some check, and do not create Barrier and
triggerDownstream steps if the corresponding nodes are split by yield Op
in training scenario. So the hacky way is not needed.
### Description
<!-- Describe your changes. -->
Changes to support standalone custom ops in a minimal build. Also
incorporates changes from #14492 (needed to test builds prior to that
being checked in).
We first need to save the schema info from the operators used by the
standalone op invoker in the ORT format model. Add mechanism for that.
Merge the kernel lookup logic so the same is used in full and minimal
build. NOTE: the version matching is now consistent with all other
kernel lookups, and the call to CreateOp MUST use the exact version for
the operator. Previously matching wasn't as strict, but this can lead to
the incorrect kernel being chosen.
Add tests.
NOTE: There is currently no way to detect the ops/types/opsets used
inside these custom ops as they don't exist until we create kernels,
which is after model loading completes (which is the point the ORT
format model is saved). Due to that they have to be manually added to
the configuration used to do the reduced ops build. That shouldn't be
too hard for the custom op author to add given the custom op
implementation is specifying the op, opset and type constraints (i.e.
they have the info and it's just a case of capturing/formatting it
correctly).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable usage of the standalone op invoker by custom ops in a minimal
build.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
A follow up change for
https://github.com/microsoft/onnxruntime/pull/13616.
SoftmaxCrossEntropyLossInternal/SoftmaxCrossEntropyLossInternalGrad
support different type for input and output.
Add SCELoss(SCELossGrad) support half(float) input float(half) output
### Test Note
#### Add tests for variant input and output types. To add such tests,
have to refactor existing testing code for sce loss and scelossinternal
gradient.
Originally,
FP32 input and output, the CPU kernels, runs with CPU kernels the
baseline, CUDA/RCOM then runs with same data, user CompareTester to
compare with CPU run results.
FP16 input and output, the CPU kernels (did not have half kernels), runs
with Cast_to_float->CPU kernel->cast_to_half as the baseline, CUDA/RCOM
then runs with same data but using Half implementation, user
CompareTester to compare with CPU run results.
Now, we want the support run different input and output types. The
proposed change here is, to run CPU kernels always with float input and
output as baseline (because CPU only have float type kernels impl), this
step is the very first thing for every test.
Then, we run CUDA/RCOM kernels using half_input_half_output,
float_input_float_output, half_input_float_output,
float_input_half_output if there is corresponding kernel registered.
Afterwards, compare the CUDA/ROCM run results with CPU float baselines.
Be noted, there is one thing that deserved a special note:
CompareOpTester's result compare can be loose than OpTester's.
Roughly speaking: the former tolerant diff <= atol +
rtol*expected_value, while the later one telerant diff < atol && diff <
rtol*expected_value. When the expected value is super small in many
cases of our tests cases, the former one can pass but the later one
fails. So the refactoring also move the check outside of OpTester,
explicitly check the values using the way CompareOPTester did (to align
the previous behaviour).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Addresses two separate SDL warnings, neither of which point to a cause
for concern:
1. `The expression '0<=_Param_(1)&&_Param_(1)<=3-1' is not true at this
call.
at
D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\\DmlSTFT.h@443,33`.
In other words, the tool thinks one of the calls to
`barriers[barrierCount++]` will be an index-of-of-range issue, even
those this is not currently possible. Switching a normal C array avoids
this complaint.
2. `'_Param_(1)' could be '0': this does not adhere to the specification
for the function 'CD3DX12_RESOURCE_BARRIER::UAV'`. The d3dx12 helper for
UAV barriers has the wrong SAL annotation and doesn't allow a null
resource (`_In_`), even though a null resource is legal and well
defined. Updated the annotation to `_In_out_` and created a PR upstream.
### Motivation and Context
Pacify SDL tasks in CI pipelines.
### Description
<!-- Describe your changes. -->
1. add a build flag for rocblas tuning feature.
2. fix a build bug when enable rocblas tuning.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The rocblas tunning feature has no build flag to control, only using a
MACRO flag.
So I add an build flag, and fix a code bug when enable rocblas tunning.
### Description
disable multi-thread test on Node.js in E2E test.
multi-thread test on Node.js in E2E test never worked, however the CI
does not pick up the error every time. So this became a flaky test case
which sometimes cause a build break.
Disable this test now and should enable it once it's get fixed.
### Description
Make GPU job depends on all CPU jobs
### Motivation and Context
GPU resources are very limited in packaging pipeline.
And GPU test job is very time consuming.
Only one CPU job fails, the workflow fails, so the GPU job is
meaningless.
To utilize GPU resources more efficiently, run GPU job only after all
CPU jobs succeed.
###test pipeline
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=280905&view=results
### Description
Half precision gemm test requirement relaxation
### Motivation and Context
Most CPUs does not support mixed precision accumulation, only mul & add
fuse. As a result, different striding on the K dimension may lead to
rounding error. Accumulation of these rounding error maybe very
significant. So setting an approximation ratio does NOT always work.
What's more, a relaxed test condition may hide real implementation
problem. So this is only a compromised fix.
More rigorous tests require manual efforts:
1. Change the K stride of the kernel under test to be 16.
2. Force the K stride of the fp16 kernel to 16
3. Change the test oracle to be exact match.
4. Pass this test and then change it back :-(.
Co-authored-by: Chen Fu <fuchen@microsoft.com>