### Description
set node schema when apply NHWC transformer
### Motivation and Context
The implementation in `IExecutionProvider::GetCapability()` checks node
schema to determine the capability of the current EP. If NHWC graph
transformer created a new channel last `Conv` node to replace the
channel first `Conv` node, we need to assign the schema to the replaced
node.
### Description
Before this change, when the D3D12 device was getting removed, we were
returning a generic device removed error, which can be harder to
investigate.
### Motivation and Context
It makes it easier to debug and investigate device removal failures.
**Description**:
Adds support for creating and receiving sparse tensors in the ORT Java
API.
CSRC and COO tensors as inputs are tested, but there is no op which
accepts a block sparse tensor to test. COO tensors are tested as
outputs, but there is no op which emits a CSRC or block sparse tensor to
test.
**Motivation and Context**
- Why is this change required? What problem does it solve? Request to
expose ORT sparse tensor support in Java.
cc @yuslepukhin
Motivation:
PythonOp is saving input for backward, it's risky since ONNX Runtime
backend is not aware of this, the tensor buffer may be "released" by
ORT, then potentially modified by other operators before backward
function executes.
Fix:
This pr just clone all input of PythonOp before forward is invoked. This
may be high overhead, it's just a workaround before a better fix.
### Fix training convergence issues
#### Problem:
Huggingface Transformers: 4.22.0
PyTorch Lightning: 1.6.3
PyTorch: v1.12.1, cuda 11.6
ORT: main branch, cuda 11.6
Model: RobertaForSequenceClassification @
models/roberta/modeling_roberta.py
Mixed Precision training with `torch.autocast`:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)
Under this amp autocast context, forward + loss computation run. Here is
a snippet of loss computation.
```
if labels is not None:
...
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
...
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
**loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))**
elif self.config.problem_type == "multi_label_classification":
...
return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
```
It is found after forward run, loss is 1.0850 in float16, looks good..
Then it did a scaling up here:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62),
the scaler is 65536. then we get a scaled loss 71104 in float type
(because float16 loss multiple fp32 scaler, type got promoted to fp32).
Then backward started with initial grads to be 1, then 1 (float32) *
65536 (float32) as the backward step, generating a float16 gradient,
then we got a `inf`. The problem occurs. With `inf`, the backward feed
the `inf` into crossentropygradient op, generating `nan`s. Then all
gradients got `nan` in back propagation.
So we see training with ORTModule (it almost always `overflow`, the loss
did not drop too much, as compared with PyTorch).
#### Analysis for the UT (when autocast enabled)
PyTorch trace graph looks like this :
```
graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0),
%target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
%2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)):
%9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%12 : NoneType = prim::Constant()
%13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%17 : NoneType = prim::Constant()
%18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%19 : NoneType = prim::Constant()
%input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%21 : NoneType = prim::Constant()
%22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%data : Float(requires_grad=0, device=cuda:0) = **aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0**
%27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0
return (%27)
```
The most important lines
%target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
%input : **_Half_**(16, 3, strides=[3, 1], requires_grad=0,
device=cuda:0) = aten::linear(%18, %13, %19) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
**_Float_**(requires_grad=0, device=cuda:0) =
aten::cross_entropy_loss(**%_input_**, %target, %21, %22, %23, %24) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
`aten::cross_entropy_loss` takes Half input, and return Float output. As
said in doc:
https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32,
`cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its
input to fp32 (if it is not), do the compute and return fp32 result. The
other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of
input and output, and our code
31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)
when exporting `aten::cross_entropy_loss` assumed this, and set the
output to be fp16 either. So this is the reason we have the problem.
#### Possible Fixes
1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types
of input and output.
2. Check the input and output when exporting, add the input case
explicitly if there is type promotion from input to output.
This PR used the 2nd approach. We can start 1st approach when needed
later.
TODO: revisit all other exporter functions, add the checks, etc.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add '-DCMAKE_OSX_ARCHITECTURES=x86_64;arm64' when build protobuf from
source on MacOS. Because later on we will the built library with the
other parts of onnxruntime to generate libonnxruntime.dylib, and if the
target CPU ARCH of libonnxruntime.dylib is not x86_64, it will fail.
### Motivation and Context
To fix a packaging pipeline failure, which was introduced from #13694
### Description
<!-- Describe your changes. -->
1. Update the rules for GemmFastGelu fusion, MatMul input x should >=
two dimension, input weight should == two dimension.
2. Add GemmFastGelu fusion test.
3. Add GemmFastGelu TunableOp, only contains the original
implementation(Gemm + FastGelu).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
1. Re-add staticSelectionOp for FastGelu.
2. Call TunableOp when enable tuning. Call StaticSelectionOp when
disable tuning.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
**Description**:
1. add pytorch_half_pixel interpolation mode in resize-packed.ts
Changes: add the following case in createPackedResizeProgramInfo
function:
```
case 'pytorch_half_pixel':
getSourceFracIndex = `
vec4 getSourceFracIndex(ivec4 coords) {
vec4 fcoords = vec4(coords);
return vec4(
${outputWidth}.0 > 1.0 ? (fcoords.x + 0.5) / scaleWHWH.x - 0.5 : 0.0,
${outputHeight}.0 > 1.0 ? (fcoords.y + 0.5) / scaleWHWH.y - 0.5 : 0.0,
${outputWidth}.0 > 1.0 ? (fcoords.z + 0.5) / scaleWHWH.z - 0.5 : 0.0,
${outputHeight}.0 > 1.0 ? (fcoords.w + 0.5) / scaleWHWH.w - 0.5 : 0.0
);
}
`;
break;
```
2. fix "unrecognized input '' for node: Resize_$num" error when inputs
like [input_tensor, None, scale_factor] (roiInput not given) are fed
into the resize layer.
Changes: change in input handling logic in upsample.ts & node scanning
logic in graph.ts
**Motivation and Context**
Before this fix, we aren't able to use webGL backend when the neural
network contains pytorch resize layers. This fix adds
'pytorch_half_pixel' interpolation mode support and makes it possible to
use webGL backend for more kind of computer vision networks.
This commit solves:
#10430
Co-authored-by: neo <neo@icode-lab.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
**Description**:
Adding a few scripts to enable user to build ORT Web in a simpler way.
**Instructions**:
Under ROOT\js folder you will have 2 scripts -
1. "Build_web.bat" - for Windows users
1. "Build_web.sh" - for Linux users
Default build configuration is "Release" to change the build configuration just add to the script call the flag "--config <Desired configuration>". As example:
```
build_web.bat --config Debug
```
Co-authored-by: shalvamist <shalva.mist@microsoft.com>
Sometime it is a bit risky to call the Op directly to check whether the
impl supports consuming the param. This gives the user a way to actually
implement `IsSupported` for checking in non-compact way.
### Description
When using the build flag "--cmake_extra_defines
onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1" with WASM it results with a
build break. Since we are comparing a const vs. non-const T type, this
added casting resolves the issue.
### Description
<!-- Describe your changes. -->
1. Build ROCm CI with Release config to save time.
2. use 32 threads to build, we have 256 threads on new CI machine.
3. enable ROCm kernel explorer test.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
Patch Protobuf and ONNX's cmake files and enforce BinSkim check.
This PR has overlap with #13523 . I would prefer to get this one merged
first so that we can finished the BinSkim work, and I try to make this
PR as small as possible.
`aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in
`_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch
commit, and this PR is a fix.
Basically, we examine more key-word attributes passed to
`aten::_to_copy` and if they lead to a type casting operator (i.e.,
mapped to ONNX's Cast), we replace that `aten::_to_copy` with
`aten::to`. Unsupported attributes are removed (with a low risk of
breaking FX graph's assumptions).
### Description
After this change, you will see GSL.natvis and wil.nativs files will be
added to every onnxruntime_xxx project.
Like this:

This is because in onnxruntime_common.cmake we have:
```cmake
if (MSVC)
set(ABSEIL_NATVIS_FILE "abseil-cpp.natvis")
target_sources(
onnxruntime_common
INTERFACE $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/external/${ABSEIL_NATVIS_FILE}>)
endif()
```
It sets a property, INTERFACE_SOURCES, on the target
"onnxruntime_common".
Then if anyone else uses:
```
target_link_libraries(mytarget PRIVATE onnxruntime_common)
```
The nativis file will be added to `mytarget`.
However, in this project we don't use such things for the targets that
are static libraries. For example, onnxruntime_graph is a static
library.
Instead, we use the `onnxruntime_add_include_to_target ` function to
explicitly control what we want to propagate . The function was written
before we started to have nativis files. So it doesn't pass a source
file from one static library to another. Now we have the need. Probably
only for Windows.
### Motivation and Context
Add natvis files to every project.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
fix https://github.com/microsoft/onnxruntime/issues/13508
### Description
Update protobuf-java to version 3.21.7. This change only impact tests.
### Motivation and Context
The current version exhibits CVE-2022-3509
### Description
<!-- Describe your changes. -->
Add ROCm5.3.2 to python package pipeline
we build rocm/dev-centos-7:x.x.x stage by ourselves to avoid dependence
on AMD's release.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
SkipLayerNorm performance improvement when bias is present as input
### Motivation and Context
- For SkipLayerNorm op, adding bias tensor using post-op to the add
primitive adding input and skip tensors is causing drastic performance
degradation.
- Hence the post-op is removed and instead, two add primitives are used
in series, adding input and skip, and then adding bias to the result of
input and skip.
- This change has shown a significant amount of performance gain for
SkipLayerNorm operator.
### Description
<!-- Describe your changes. -->
The default python upgrades to 3.11 in Mac, but 3.11 hasn't been
supported yet.
So Use python3.8 instead.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix MacOS CI in Zip-Nuget-Java-Nodejs Packaging Pipeline
### Test Run
https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=249020&view=logs&j=ded01483-6627-58ac-64dc-d4a232827e5d
### Description
Fix round 6
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
- Add missing uint8 typedArray case
- Add createInputTensor_uint8 unit test in TensorHelperTest.java file
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Detected inferencesession.run() call error when running react native app
with uint8array input ort tensor. Add missing support to fix.
### Description
**Description**: [ONNX
Squeeze-13](https://github.com/onnx/onnx/blob/main/docs/Operators.md#Squeeze)
treats empty `axes` as if all axes had been given. This works for
[earlier Squeeze
versions](https://github.com/microsoft/onnxruntime/pull/12649), but
Squeeze-13 checks for axes as a dynamic input tensor, which means it
needs to checked for existence before accessing.
### Motivation and Context
- *Why is this change required? What problem does it solve?* Fixes a
customer model. Makes ORT DML EP consistent with spec.
### Description
Round 5 of the fixes, there are 192 to go.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Improve the profile explorer by enabling shape sensitivity for GPU
kernels.
### Motivation and Context
Due to problems with the ROCM profiler, it was previously challenging to
retrieve the shapes corresponding to a GPU kernel event. [PR
13546](https://github.com/microsoft/onnxruntime/pull/13549) addresses
these problems, so it's now possible to retrieve shapes from the ORT
ROCM/CUDA profilers. This PR leverages [PR
13546](https://github.com/microsoft/onnxruntime/pull/13549) to enable
shape-sensitive GPU kernel ranking.
Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>
### Description
ignore dirty state of submodule XNNPACK
### Motivation and Context
ONNX Runtime WebAssembly build will apply a patch to XNNPACK so it is
considered 'dirty' state in the submodule. We want to ignore this when
checking the workspace using `git status`.
### Description
In some case, we can't get node's shape to do pre-process.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This ensures that the graph is re-resolved after a free dimension shape
is overridden according to session options.
### Motivation and Context
This ensures that shape inference occurs, which is necessary to apply
the optimation and ensure it the session is compatible with bound
shapes. This bug seems to only have affected a small fraction of models.
### Description
Update pylint config to include valid short names
Also disabled `too-many-arguments` and `too-many-locals`
### Motivation and Context
Refine config to reduce lint noise
### Description
round 4, There are 436 more togo.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->