### Description
<!-- Describe your changes. -->
1. Build ROCm CI with Release config to save time.
2. use 32 threads to build, we have 256 threads on new CI machine.
3. enable ROCm kernel explorer test.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
Patch Protobuf and ONNX's cmake files and enforce BinSkim check.
This PR has overlap with #13523 . I would prefer to get this one merged
first so that we can finished the BinSkim work, and I try to make this
PR as small as possible.
`aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in
`_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch
commit, and this PR is a fix.
Basically, we examine more key-word attributes passed to
`aten::_to_copy` and if they lead to a type casting operator (i.e.,
mapped to ONNX's Cast), we replace that `aten::_to_copy` with
`aten::to`. Unsupported attributes are removed (with a low risk of
breaking FX graph's assumptions).
### Description
After this change, you will see GSL.natvis and wil.nativs files will be
added to every onnxruntime_xxx project.
Like this:

This is because in onnxruntime_common.cmake we have:
```cmake
if (MSVC)
set(ABSEIL_NATVIS_FILE "abseil-cpp.natvis")
target_sources(
onnxruntime_common
INTERFACE $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/external/${ABSEIL_NATVIS_FILE}>)
endif()
```
It sets a property, INTERFACE_SOURCES, on the target
"onnxruntime_common".
Then if anyone else uses:
```
target_link_libraries(mytarget PRIVATE onnxruntime_common)
```
The nativis file will be added to `mytarget`.
However, in this project we don't use such things for the targets that
are static libraries. For example, onnxruntime_graph is a static
library.
Instead, we use the `onnxruntime_add_include_to_target ` function to
explicitly control what we want to propagate . The function was written
before we started to have nativis files. So it doesn't pass a source
file from one static library to another. Now we have the need. Probably
only for Windows.
### Motivation and Context
Add natvis files to every project.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
fix https://github.com/microsoft/onnxruntime/issues/13508
### Description
Update protobuf-java to version 3.21.7. This change only impact tests.
### Motivation and Context
The current version exhibits CVE-2022-3509
### Description
<!-- Describe your changes. -->
Add ROCm5.3.2 to python package pipeline
we build rocm/dev-centos-7:x.x.x stage by ourselves to avoid dependence
on AMD's release.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
SkipLayerNorm performance improvement when bias is present as input
### Motivation and Context
- For SkipLayerNorm op, adding bias tensor using post-op to the add
primitive adding input and skip tensors is causing drastic performance
degradation.
- Hence the post-op is removed and instead, two add primitives are used
in series, adding input and skip, and then adding bias to the result of
input and skip.
- This change has shown a significant amount of performance gain for
SkipLayerNorm operator.
### Description
<!-- Describe your changes. -->
The default python upgrades to 3.11 in Mac, but 3.11 hasn't been
supported yet.
So Use python3.8 instead.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix MacOS CI in Zip-Nuget-Java-Nodejs Packaging Pipeline
### Test Run
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=249020&view=logs&j=ded01483-6627-58ac-64dc-d4a232827e5d
### Description
Fix round 6
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
- Add missing uint8 typedArray case
- Add createInputTensor_uint8 unit test in TensorHelperTest.java file
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Detected inferencesession.run() call error when running react native app
with uint8array input ort tensor. Add missing support to fix.
### Description
**Description**: [ONNX
Squeeze-13](https://github.com/onnx/onnx/blob/main/docs/Operators.md#Squeeze)
treats empty `axes` as if all axes had been given. This works for
[earlier Squeeze
versions](https://github.com/microsoft/onnxruntime/pull/12649), but
Squeeze-13 checks for axes as a dynamic input tensor, which means it
needs to checked for existence before accessing.
### Motivation and Context
- *Why is this change required? What problem does it solve?* Fixes a
customer model. Makes ORT DML EP consistent with spec.
### Description
Round 5 of the fixes, there are 192 to go.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Improve the profile explorer by enabling shape sensitivity for GPU
kernels.
### Motivation and Context
Due to problems with the ROCM profiler, it was previously challenging to
retrieve the shapes corresponding to a GPU kernel event. [PR
13546](https://github.com/microsoft/onnxruntime/pull/13549) addresses
these problems, so it's now possible to retrieve shapes from the ORT
ROCM/CUDA profilers. This PR leverages [PR
13546](https://github.com/microsoft/onnxruntime/pull/13549) to enable
shape-sensitive GPU kernel ranking.
Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description
ignore dirty state of submodule XNNPACK
### Motivation and Context
ONNX Runtime WebAssembly build will apply a patch to XNNPACK so it is
considered 'dirty' state in the submodule. We want to ignore this when
checking the workspace using `git status`.
### Description
In some case, we can't get node's shape to do pre-process.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This ensures that the graph is re-resolved after a free dimension shape
is overridden according to session options.
### Motivation and Context
This ensures that shape inference occurs, which is necessary to apply
the optimation and ensure it the session is compatible with bound
shapes. This bug seems to only have affected a small fraction of models.
### Description
Update pylint config to include valid short names
Also disabled `too-many-arguments` and `too-many-locals`
### Motivation and Context
Refine config to reduce lint noise
### Description
round 4, There are 436 more togo.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix React Native CI build.
Recently the build started picking up a more recent version of React Native that was published to Maven Central.
More details here: https://github.com/facebook/react-native/issues/35210
### Description
The existing ROCM profiler has a few shortcomings, which this PR fixes.
### Motivation and Context
The existing ROCM profiler:
1. Is not thread-safe
2. Is not session-aware: i.e., if multiple inference sessions enable
profiling, then events (esp GPU events) get mixed up between the
sessions
3. Has some issues with respect to coding standards.
This PR addresses all of the above by cleanly re-implementing parts of
the ROCM profiler as required.
Attached are 4 profile outputs from a multi-session run of the
StableDiffusion model, as well as a quick-and-dirty script that checks
the profile outputs for the invariants claimed.
[sd_profile_outputs.tar.gz](https://github.com/microsoft/onnxruntime/files/9924608/sd_profile_outputs.tar.gz)
[check_profile_output_wellformedness.zip](https://github.com/microsoft/onnxruntime/files/9924614/check_profile_output_wellformedness.zip)
Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
I built a new test infra for CUDA EP in #13016 but forgot adding the
test to onnxruntime_test_all. Here is the missing file. Now, the
`TestAll` function is really called in CI.
### Description
Revert DML's CPU fallback logic from
https://github.com/microsoft/onnxruntime/pull/13442.
### Motivation and Context
Although the logic works great in many models that have good DML
coverage, it makes perf worse in some models where many operators are
missing DML coverage (e.g. int64). Overall, the right fix seems to
instead implement the operator on DML even though it almost always falls
back to the CPU, just for the sake of having a registration.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix round 4. Still have about 632 to go.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Introduce Gemm weights pre-pack.
### Motivation and Context
A 1-P customer requested a performance improvement for DeepGru which
consumes a bulk of CPU in their model. This provides measurable
performance improvements.
Customer model numbers.
gru: mean = 356 us; 1ms = 99.8 prctile; 99th prctile = 665 ms
(yuslepukhin/deep_gru_opt)
main: mean = 375 us; 1ms = 99.8 prctile; 99th prctile = 695 ms (where
yuslepukhin/deep_gru_opt branched off main)
1.13.1: mean = 391 us; 1ms = 99.6 prctile; 99th prctile = 744 ms
On AMD Instinct MI200 GPUs, the FP16 and BF16 V_DOT2 and MFMA matrix
instructions flush input and output denormal values to zero. When
training using FP16 precision, some models may fail to converge with
FP16 denorms flushed to zero. The affected instructions are only used by
rocBLAS (GEMM) and MIOpen (convolution) kernels; all other onnxruntime
operations will not encounter this behavior. All other supported AMD
GPUs will not encounter this behavior.
rocBLAS and MIOpen provide alternate implementations for affected FP16
operations. Alternate implementations for BF16 operations are not
provided; BF16 numbers have a larger dynamic range than FP16 numbers and
are less likely to encounter denormal values. For the FP16 alternate
implementations, FP16 input values are cast to an intermediate BF16
value and then cast back to FP16 output after the accumulate FP32
operations. In this way, the input and output types are unchanged.
Denormal values more frequently occur in the backward pass of training
during gradient calculation. Therefore, it is necessary to track when
the backward pass of training is executing. For the ROCm EP only, the
`__backwardpass` attribute is added to all Nodes after the YieldOp is
detected. This takes place in a level1 graph optimization pass. The
attribute is forwarded to any newly created FusedMatMul Nodes. In
addition, the scope-based helper class `BackwardPassGuard` is provided
to toggle state for rocblas. This behavior of using the alternate
implementations during the backward pass is made automatic with this PR.
This default behavior can be overridden using environment variables,
ROCBLAS_INTERNAL_FP16_ALT_IMPL and
MIOPEN_DEBUG_CONVOLUTION_ATTRIB_FP16_ALT_IMPL. The behavior of these
environment variables is as follows:
| | forward | backward |
|--------------|-----------|-----------|
| Env unset | original | alternate |
| Env set to 1 | alternate | alternate |
| Env set to 0 | original | original |
See also:
https://pytorch.org/docs/stable/notes/numerical_accuracy.html#reduced-precision-fp16-and-bf16-gemms-and-convolutions-on-amd-instinct-mi200-devices
### Description
Redo the round using gsl:narrow and SafeInt
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add a DML registration for Shape to avoid copying back to the CPU just
to get the shape of a GPU tensor.
### Motivation and Context
When using free dimensions, many Transformers models extensively use the
`Shape` operator. This causes hundreds of GPU->CPU copy that should be
completely avoidable. Note that this change also uses the same
heuristics as other providers (e.g. CUDA) to force some tensors on the
CPU in certain situations.
Co-authored-by: Patrice Vignola <pavignol@microsoft.com>
### Description
Properly cleans up all temporary resources created while running
benchmarks.
Details:
- Dump all temporary artifacts (TRT engines, TRT profiles, inference
profiles, fp16 models) into a temp directory in `/tmp/`. Each model/EP
combination has its own temp directory that is deleted after validation
and benchmarking.
- Allow running both validation and benchmarking in one invocation of
the benchmark.py script. This is necessary to allow the benchmarking
step to reuse artifacts (e.g., TRT engines) created during validation.
Before this PR, we ran validation on all model/EP combinations before
running benchmarks on all combinations again. This required us to keep
all temporary artifacts for all model/EP combinations throughout the
entire run (expensive).
- Create individual functions for validation and benchmarking (split-up
large function that did it all)
### Motivation and Context
The EP Perf pipeline failed to run because the script generated too much
output and the VM ran out of disk space.