### Description
Enabled GatherElements Ops to enable DeBERTA Model
### Motivation and Context
- This change is required to enable DeBerta Model which is relevant to
MSFT
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: mayavijx <mayax.vijayan@intel.com>
The PR optimizes Slice CUDA kernel by two ways:
- Coalesce dimensions so less divmod during the kernel compute
- Split data load and write for better memory throughput
Below shows some perf results (cycles number from Nsight Compute) in
V100 using real cases from Huggingface's XLNet model:
| Old | New
-- | -- | --
[8,12,2048,1024], axis=2, start=1, end=2048 | 1838687| 1539846
[8,12,1024,2047], axis=3, start=0, end=1024 | 951383| 722203
### Description
This is the first PR of adding remaining Ops for XNPACK EP,
I am gonna add:
- [x] ConvTranspose f32 qu8 q s8
- [x] ~~UnMaxpool f32 qu8 qs8~~
- [x] Resize f32 qu8 q s8
- [ ] GEMM see https://github.com/microsoft/onnxruntime/pull/13126
The remains operation support would be seperated into another PR.
### Motivation and Context
### Description
Remove code that invokes cpuinfo library on platforms we do not set
affinity.
### Motivation and Context
`cpuinfo` library increases binary size.
fix for https://github.com/microsoft/onnxruntime/issues/13383,
https://github.com/microsoft/onnxruntime/issues/13408
Currently ort-web doesn't catch exceptions because turning on exception
catching increases the binary size by 3MB (~30%).
But ort can throw (ie onnx errors or ORT_ENFORCE) and there is no
useable error message.
Turning on exception catching just for top level api released file will
fix the error messages at minimal increase of binary size.
### Description
<!-- Describe your changes. -->
There are some compile errors with
google::protobuf::internal::RepeatedIterator.
replace reinterpret_cast with &(*iter), which iter is RepeatedIterator
type.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
My protobuf version is:
- libprotoc 3.21.5
- g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
when I use build command:
```
./build.sh --use_cuda --cudnn_home /usr --cuda_home /usr/local/cuda --config Debug --build_shared_lib --parallel
```
There are some compile errors like this:
- error 1
onnxruntime/test/util/test_utils.cc:186:105: error: no matching function
for call to ‘make_span(google::protobuf::RepeatedField<long
int>::const_iterator, google::protobuf::RepeatedField<long
int>::const_iterator)’
186 | ind_span = gsl::make_span(indices_proto.int64_data().cbegin(),
indices_proto.int64_data().cend());
- error 2
onnxruntime/test/onnx/tensorprotoutils.cc:101:56: error: invalid cast
from type ‘google::protobuf::internal::RepeatedIterator<const long
unsigned int>’ to type ‘const uint32_t*’ {aka ‘const unsigned int*’}
101 | *p_data++ = *reinterpret_cast<const T*>(data_iter);
Allows MIGraphX EP to run the following additional tests. Also adds support to get MIGraphX to run eval_squad.py
Reference to the Rocm EP changes: https://github.com/microsoft/onnxruntime/pull/13306
Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
CUDA's Transpose3DImpl is to transpose [batch, m, n] to [batch, n, m].
Currently it requires both m and n can be divided by 32 or 16. If it's
not this case, the compute will fallback to general implementation,
which is slow. This PR is to remove the limitation.
Profiling in V100 using below size of tensors, got the cycles number
from Nsight Compute:
| Old | New
-- | -- | --
[3072,64,512] | 760793 | 727140
[3072,16,2048] | 854303 | 851146
[3072,2048,12] | 986924 | 737884
[3072,1024,24] | 1212427 | 495117
It shows that even we added extra IF statements to the kernel
implementation, it has nearly no impact to the old version (case 1 and
2). And for case 3 and 4 which will fallback to general implementation
before, it's much faster.
Above data was collected using FP16 tensors, similar results was
observed for float tensors.
This PR is to enhance the perf of ORT training of Huggingface's XLNet
model which has[8,1024,1024,12].permute(0,3,1,2).
### Description
<!-- Describe your changes. -->
Skip the test with --filter in runtest.sh
### Motivation and Context
Recently, the Zip-Nuget-Java-Nodejs Packaging Pipeline always failed in
Nuget_Test_Linux_GPU.
To unblock the packaging workflow, skip the test in Nuget_Test_Linux_GPU
temporally.
the exception message is below.
```
[xUnit.net 00:07:26.28] TestCUDAProviderOptions [FAIL]
Failed TestCUDAProviderOptions [1 m 19 s]
Error Message:
Microsoft.ML.OnnxRuntime.OnnxRuntimeException : [ErrorCode:RuntimeException] Non-zero status code returned while running FusedConv node. Name:'' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Available memory of 11416064 is smaller than requested bytes of 134217728
Stack Trace:
at Microsoft.ML.OnnxRuntime.NativeApiStatus.VerifySuccess(IntPtr nativeStatus)
at Microsoft.ML.OnnxRuntime.InferenceSession.RunImpl(RunOptions options, IntPtr[] inputNames, IntPtr[] inputValues, IntPtr[] outputNames, DisposableList`1 cleanupList)
at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs, IReadOnlyCollection`1 outputNames, RunOptions options)
at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs, IReadOnlyCollection`1 outputNames)
at Microsoft.ML.OnnxRuntime.InferenceSession.Run(IReadOnlyCollection`1 inputs)
at Microsoft.ML.OnnxRuntime.Tests.CUDATest.TestCUDAProviderOptions() in /mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.Tests.NetCoreApp/InferenceTest.netcore.cs:line 93
Failed! - Failed: 1, Passed: 0, Skipped: 0, Total: 1, Duration: < 1 ms - /mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests/bin/Debug/netcoreapp3.1/Microsoft.ML.OnnxRuntime.EndToEndTests.dll (netcoreapp3.1)
Done executing task "Microsoft.TestPlatform.Build.Tasks.VSTestTask" -- FAILED.
1>Done building target "VSTest" in project "Microsoft.ML.OnnxRuntime.EndToEndTests.csproj" -- FAILED.
1>Done Building Project "/mnt/vss/_work/1/s/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests/Microsoft.ML.OnnxRuntime.EndToEndTests.csproj" (VSTest target(s)) -- FAILED.
```
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Right now we fix the warnings in an ad-hoc way. We run static analysis
in nightly builds, then create work items for the finding it found. Our
CI build pipelines run the same scan but do not break the build. So,
this PR will fix the remaining findings in the CPU EP(including the
training part) and enforce the check. Later on we can continue to expand
the scope.
We still have some warnings left in the JNI part. I will try to address
them later in the next month.
Need this for benchmarks to function correctly with older containers
This fixes import errors when attempting to run eval_squad.py to
evaluate bert distilled models
Adds a change to the previously merged #12947 which fails when using
Python version < 3.8 to run this script.
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
### Description
<!-- Describe your changes. -->
Remove tuning options on transformerOptions, use IsTunableOpEnabled from
provider in the future.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
TopK in BeamSearch retrieves top 2*beam next tokens based on logit
score, specifically computing top [batch, 2*beam] tokens based on score
[batch, beam, vocab_size].
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Current implementation use batch as the grid and each thread block
compute top 2*beam from [beam, vocab_size]. It is inefficient because:
1. batch size is usually small( <32) and can not fully leverage GPU's
SMs; 2. vocab_size is usually more than 50k. It is inefficient to
compute 50k * beam in one thread block.
This PR split the topk computation into multiple stages:
- for small beam size, split [batch, beam, vocab_size] to [batch, beam,
parts_of_vocab, vocab_size_per_part]
- 1st stage, each thread block compute top 2*beam from
vocab_sizer_per_part and gets [batch, beam, parts_of_vocab, 2*beam]
- 2nd stage, each thread block compute top 2*beam from parts_of_vocab
*(2*beam} and gets [batch, beam, 2*beam]
- last stage, compute [batch, 2*beam] from [batch, beam, 2*beam]
- for large beam size, 1st stage computes [batch, beam, 2*beam] from
[batch, beam, vocab_size] and 2nd stage computes [batch, 2*beam] from
[batch, beam, 2*beam].
With the change, performance improves a lot, it reduces ~100us from 2ms
for batch:4, beam:4, vocab_size:~50k.
Use sequences to create initial feeds for decoder subgraph instead of
beam_next_tokens
### Description
For TuLG models exporting of decoder is different from bart model.
Passing beam_next_tokens to the decoder while ort inferencing generated
incorrect result from pytorch inference.
This change will use sequences as inputs for the first iteration as well
### Motivation and Context
Pytorch and ORT inference for TuLG models was incorrect, keeping pytorch
as correct result we modified ort to match the result.
### Description
set node schema when apply NHWC transformer
### Motivation and Context
The implementation in `IExecutionProvider::GetCapability()` checks node
schema to determine the capability of the current EP. If NHWC graph
transformer created a new channel last `Conv` node to replace the
channel first `Conv` node, we need to assign the schema to the replaced
node.
### Description
Before this change, when the D3D12 device was getting removed, we were
returning a generic device removed error, which can be harder to
investigate.
### Motivation and Context
It makes it easier to debug and investigate device removal failures.
**Description**:
Adds support for creating and receiving sparse tensors in the ORT Java
API.
CSRC and COO tensors as inputs are tested, but there is no op which
accepts a block sparse tensor to test. COO tensors are tested as
outputs, but there is no op which emits a CSRC or block sparse tensor to
test.
**Motivation and Context**
- Why is this change required? What problem does it solve? Request to
expose ORT sparse tensor support in Java.
cc @yuslepukhin
Motivation:
PythonOp is saving input for backward, it's risky since ONNX Runtime
backend is not aware of this, the tensor buffer may be "released" by
ORT, then potentially modified by other operators before backward
function executes.
Fix:
This pr just clone all input of PythonOp before forward is invoked. This
may be high overhead, it's just a workaround before a better fix.
### Fix training convergence issues
#### Problem:
Huggingface Transformers: 4.22.0
PyTorch Lightning: 1.6.3
PyTorch: v1.12.1, cuda 11.6
ORT: main branch, cuda 11.6
Model: RobertaForSequenceClassification @
models/roberta/modeling_roberta.py
Mixed Precision training with `torch.autocast`:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L99)
Under this amp autocast context, forward + loss computation run. Here is
a snippet of loss computation.
```
if labels is not None:
...
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
...
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
**loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))**
elif self.config.problem_type == "multi_label_classification":
...
return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
```
It is found after forward run, loss is 1.0850 in float16, looks good..
Then it did a scaling up here:
a64e1dfd7d/pytorch_lightning/plugins/precision/native_amp.py (L62),
the scaler is 65536. then we get a scaled loss 71104 in float type
(because float16 loss multiple fp32 scaler, type got promoted to fp32).
Then backward started with initial grads to be 1, then 1 (float32) *
65536 (float32) as the backward step, generating a float16 gradient,
then we got a `inf`. The problem occurs. With `inf`, the backward feed
the `inf` into crossentropygradient op, generating `nan`s. Then all
gradients got `nan` in back propagation.
So we see training with ORTModule (it almost always `overflow`, the loss
did not drop too much, as compared with PyTorch).
#### Analysis for the UT (when autocast enabled)
PyTorch trace graph looks like this :
```
graph(%0 : Float(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0),
%target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
%2 : Float(3, 3, strides=[3, 1], requires_grad=1, device=cuda:0)):
%9 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%10 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%11 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%12 : NoneType = prim::Constant()
%13 : Half(3, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%2, %9, %10, %11, %12) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%14 : int = prim::Constant[value=5]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%15 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%16 : bool = prim::Constant[value=0]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%17 : NoneType = prim::Constant()
%18 : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::to(%0, %14, %15, %16, %17) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%19 : NoneType = prim::Constant()
%input : Half(16, 3, strides=[3, 1], requires_grad=0, device=cuda:0) = aten::linear(%18, %13, %19) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
%21 : NoneType = prim::Constant()
%22 : int = prim::Constant[value=1]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%23 : int = prim::Constant[value=-100]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%24 : float = prim::Constant[value=0.]() # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
%data : Float(requires_grad=0, device=cuda:0) = **aten::cross_entropy_loss(%input, %target, %21, %22, %23, %24) # /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0**
%27 : Float(requires_grad=0, device=cuda:0) = ^_OutputIdentityOp()(%data) # /opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_io.py:430:0
return (%27)
```
The most important lines
%target : Long(16, strides=[1], requires_grad=0, device=cuda:0),
%input : **_Half_**(16, 3, strides=[3, 1], requires_grad=0,
device=cuda:0) = aten::linear(%18, %13, %19) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/modules/linear.py:114:0
**_Float_**(requires_grad=0, device=cuda:0) =
aten::cross_entropy_loss(**%_input_**, %target, %21, %22, %23, %24) #
/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/nn/functional.py:3,014:0
`aten::cross_entropy_loss` takes Half input, and return Float output. As
said in doc:
https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32,
`cross_entropy` in autocast mode will run in fp32 mode, e.g. convert its
input to fp32 (if it is not), do the compute and return fp32 result. The
other hand, ORT's `SoftmaxCrossEntropyLossInternal` take same types of
input and output, and our code
31cb3cb254/orttraining/orttraining/python/training/ortmodule/_custom_op_symbolic_registry.py (L68)
when exporting `aten::cross_entropy_loss` assumed this, and set the
output to be fp16 either. So this is the reason we have the problem.
#### Possible Fixes
1. Enhance `SoftmaxCrossEntropyLossInternal` to support different types
of input and output.
2. Check the input and output when exporting, add the input case
explicitly if there is type promotion from input to output.
This PR used the 2nd approach. We can start 1st approach when needed
later.
TODO: revisit all other exporter functions, add the checks, etc.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add '-DCMAKE_OSX_ARCHITECTURES=x86_64;arm64' when build protobuf from
source on MacOS. Because later on we will the built library with the
other parts of onnxruntime to generate libonnxruntime.dylib, and if the
target CPU ARCH of libonnxruntime.dylib is not x86_64, it will fail.
### Motivation and Context
To fix a packaging pipeline failure, which was introduced from #13694
### Description
<!-- Describe your changes. -->
1. Update the rules for GemmFastGelu fusion, MatMul input x should >=
two dimension, input weight should == two dimension.
2. Add GemmFastGelu fusion test.
3. Add GemmFastGelu TunableOp, only contains the original
implementation(Gemm + FastGelu).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
1. Re-add staticSelectionOp for FastGelu.
2. Call TunableOp when enable tuning. Call StaticSelectionOp when
disable tuning.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
**Description**:
1. add pytorch_half_pixel interpolation mode in resize-packed.ts
Changes: add the following case in createPackedResizeProgramInfo
function:
```
case 'pytorch_half_pixel':
getSourceFracIndex = `
vec4 getSourceFracIndex(ivec4 coords) {
vec4 fcoords = vec4(coords);
return vec4(
${outputWidth}.0 > 1.0 ? (fcoords.x + 0.5) / scaleWHWH.x - 0.5 : 0.0,
${outputHeight}.0 > 1.0 ? (fcoords.y + 0.5) / scaleWHWH.y - 0.5 : 0.0,
${outputWidth}.0 > 1.0 ? (fcoords.z + 0.5) / scaleWHWH.z - 0.5 : 0.0,
${outputHeight}.0 > 1.0 ? (fcoords.w + 0.5) / scaleWHWH.w - 0.5 : 0.0
);
}
`;
break;
```
2. fix "unrecognized input '' for node: Resize_$num" error when inputs
like [input_tensor, None, scale_factor] (roiInput not given) are fed
into the resize layer.
Changes: change in input handling logic in upsample.ts & node scanning
logic in graph.ts
**Motivation and Context**
Before this fix, we aren't able to use webGL backend when the neural
network contains pytorch resize layers. This fix adds
'pytorch_half_pixel' interpolation mode support and makes it possible to
use webGL backend for more kind of computer vision networks.
This commit solves:
#10430
Co-authored-by: neo <neo@icode-lab.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
**Description**:
Adding a few scripts to enable user to build ORT Web in a simpler way.
**Instructions**:
Under ROOT\js folder you will have 2 scripts -
1. "Build_web.bat" - for Windows users
1. "Build_web.sh" - for Linux users
Default build configuration is "Release" to change the build configuration just add to the script call the flag "--config <Desired configuration>". As example:
```
build_web.bat --config Debug
```
Co-authored-by: shalvamist <shalva.mist@microsoft.com>
Sometime it is a bit risky to call the Op directly to check whether the
impl supports consuming the param. This gives the user a way to actually
implement `IsSupported` for checking in non-compact way.
### Description
When using the build flag "--cmake_extra_defines
onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1" with WASM it results with a
build break. Since we are comparing a const vs. non-const T type, this
added casting resolves the issue.
### Description
<!-- Describe your changes. -->
1. Build ROCm CI with Release config to save time.
2. use 32 threads to build, we have 256 threads on new CI machine.
3. enable ROCm kernel explorer test.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
Patch Protobuf and ONNX's cmake files and enforce BinSkim check.
This PR has overlap with #13523 . I would prefer to get this one merged
first so that we can finished the BinSkim work, and I try to make this
PR as small as possible.
`aten::_to_copy` is not exportable to ONNX. In DORT, so it's replaced in
`_replace_to_copy_with_to`. This replacement logic becomes incorrect in latest PyTorch
commit, and this PR is a fix.
Basically, we examine more key-word attributes passed to
`aten::_to_copy` and if they lead to a type casting operator (i.e.,
mapped to ONNX's Cast), we replace that `aten::_to_copy` with
`aten::to`. Unsupported attributes are removed (with a low risk of
breaking FX graph's assumptions).
### Description
After this change, you will see GSL.natvis and wil.nativs files will be
added to every onnxruntime_xxx project.
Like this:

This is because in onnxruntime_common.cmake we have:
```cmake
if (MSVC)
set(ABSEIL_NATVIS_FILE "abseil-cpp.natvis")
target_sources(
onnxruntime_common
INTERFACE $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/external/${ABSEIL_NATVIS_FILE}>)
endif()
```
It sets a property, INTERFACE_SOURCES, on the target
"onnxruntime_common".
Then if anyone else uses:
```
target_link_libraries(mytarget PRIVATE onnxruntime_common)
```
The nativis file will be added to `mytarget`.
However, in this project we don't use such things for the targets that
are static libraries. For example, onnxruntime_graph is a static
library.
Instead, we use the `onnxruntime_add_include_to_target ` function to
explicitly control what we want to propagate . The function was written
before we started to have nativis files. So it doesn't pass a source
file from one static library to another. Now we have the need. Probably
only for Windows.
### Motivation and Context
Add natvis files to every project.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
fix https://github.com/microsoft/onnxruntime/issues/13508