### Description
<!-- Describe your changes. -->
The Env argument does not need to be mutable to call the underlying C
API. Update the Ort::Session ctor to have a const Env.
All other changes are from clang-format running.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Cleanup
Record more info from the React Native CI E2E test. In particular, log the view hierarchy when exiting the test and dump logs from Android emulator to the build output.
### Description
Currently, SliceIterator copies inner dimension size at once at best.
However, there are many slices when several inner dimensions can be
copied at once.
Furthermore, even if a dimension is sliced, it may employ step 1 and,
therefore, has a continuous block of inner dimensions that can be copied
at once.
### Motivation and Context
For example, `[N, C, H, W]` with slice `[:, :, i:, :]` and `[N, C, H-i,
W]`. Meaning, we slice along single axis, with step = 1. Current
implementation does `C * (H-i) memcpy` with W elements each. With this
change we can do `C memcpy with (H-i)*W` elements each.
The optimization produces ~11% savings on certain internal models.
### Description
Based on the ORT spec for ConvTranspose:
```
output_shape can also be explicitly specified in which case pads values are auto generated using these equations:
total_padding[i] = stride[i] * (input_size[i] - 1) + output_padding[i] + ((kernel_shape[i] - 1) * dilations[i] + 1) - output_shape[i]
If (auto_pads == SAME_UPPER): pads[start_i] = total_padding[i]/2; pads[end_i] = total_padding[i] - (total_padding[i]/2)
Else: pads[start_i] = total_padding[i] - (total_padding[i]/2); pads[end_i] = (total_padding[i]/2).
```
However the CPU EP logic differs. Basically, unless SAME_UPPER is
specified, the default behavior (for VALID,NOTSET,SAME_LOWER) should be
SAME_LOWER.
I think this is the pragmatic fix, however it's perhaps still not
totally up to standard.
In the case tested, the operator is actually only valid if padding is
inserted. Perhaps it "should" throw some error then, if auto_pad is not
SAME_UPPER or SAME_LOWER, as the spec also mentions:
"VALID mean no padding." (For convtranspose-1 but this was removed in
convtranspose-11, making it less clear.)
"NOTSET, which means explicit padding is used" (should technically
require explicit padding then, and not generate it)
HOWEVER, changing it to throw errors could do more harm than good. For
now, probably just best to make it consistent.
### Motivation and Context
We noticed that there was a discrepancy in one of the DML tests between
CPU and DML.
auto_pad is not specified, and DML is doing SAME_LOWER behavior by
default, where CPU EP is doing SAME_UPPER behavior.
```json
{
"graph_name": "ConvTranspose output_shape with even strides odd kernel autopad NOTSET",
"op_type": "ConvTranspose",
"dilations": [1,1],
"group": 1,
"strides": [2,2],
"kernel_shape": [3,3],
"output_shape": [1,1,4,4],
"X": {"dims": [1,1,2,2], "function": "iota"},
"W": {"dims": [1,1,3,3], "value": [1,2,3,4,5,6,7,8,9]},
"B": [1],
"Y": {"dims": [1,1,4,4], "value": [1,5,6,7,5,17,15,19,11,25,16,19,17,40,25,28]},
"T": "float32"
}
```
The commit causes subtle perf regressions in image models (caught by
Anubis). Since we are close to the release, reverting this change for
now so that the regression cause analysis doesn't push the release
timeline.
Once the PR is merged, I will re-open the GH issues that the original PR
closed.
### Motivation and Context
Fix regression in ORT 1.13 RC
### Description
Fix Bug where zero point isn't correct under entropy calibration
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
updating the ptca image used in the nightly pipeline
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
**Description**: utils for federated learning.
**Motivation and Context**
- This PR includes utils that will be used on federated learning
scenarios.
- Exposing python bindings to some utils, and added a util to calculate
the difference between two buffers.
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
### Description
<!-- Describe your changes. -->
ROCm developers always need to build onnxruntime *whl with
`--enable_rocm_profiling`.
Add a ROCm dev python package pipeline which product *.whl with build
args `--enable_rocm_profiling`.
The dev *whl need to upload to azure storage and can get from
https://download.onnxruntime.ai/onnxruntime_nightly_rocm53.profiling.html
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Detect and report thread creation failure on Windows.
Do not throw out of constructor after the thread is created,
the thread handle is lost and cannot be joined, resulting in a deadlock.
Make setting a thread priority on Linux consistent with windows.
Set thread priority in the thread itself. Log failure properly,
but do not exit the thread.
### Motivation and Context
Address issues https://github.com/microsoft/onnxruntime/issues/13291
And
https://github.com/microsoft/onnxruntime/issues/13285#issuecomment-1278063223
Avoid using vec_max/vec_min because their behaviors are undefined if one
of
the elements is NAN. The Power Vector Intrinsic Programming Reference
says:
"For floating-point types, if both source elements contain signed
zeros, or if either source element contains a NaN, it is
undefined which of the two source elements is copied into
the corresponding result element."
As the unittest Activation.ShortExecute expects NAN, this patch uses
vec_sel and vec_cmpgt to return NAN if one of the elements is NAN.
https://git.openpower.foundation/systemsoftware/Programming-Guides/src/branch/master/Intrinsics_Reference/ch_vec_reference.xml#L26808
**Description**: Changes to the MIGraphx execution provider code to
allow for stream synchronization on the gpu side
**Motivation and Context**
Performance boost by removing redundant host to device synchronizations
The current implementation of the execution provider continuously calls
hipDeviceSynchronize() between computations which adds overhead and an
idle wait between the GPU's computations. This is noticeable during
device
This change leverages new functionality that's been added to MIGraphX to
allow for GPU side synchronization which avoids the need for
host->device waits.
To maintain backwards compatibility with older MIGraphX versions, the
compile time define MIGRAPHX_STREAM_SYNC has been added to the API to
allow for older version operate with newer builds of onnxruntime without
loss of functionality to the current feature set as of (08/09/22)
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
`python setup.py develop` doesn't install PyTorch as a normal package in
site-packages anymore, and the user must stay at PyTorch's root
directory to call `import torch`. This will break LORT tests because
LORT tests contains `import torch` and are called outside PyTorch root
directory. To make PyTorch a normal package again, this PR build PyTorch
with `python setup.py install`.
This PR is to add support of using env variable to set provider option
cudnn_conv_algo_search so that user can choose better conv algo search
method to run model. This is a quick fix to unblock the test of MoE
model. Will have another PR to design and implement the ORTModule config
so that we can config ORTModule using Python script or config file
instead of env variable.
Model [huggingface's diffusers
library](https://github.com/huggingface/diffusers) has
torch.nn.GroupNorm which will be exported to sub-graph containing ONNX's
InstanceNormalization, which is lack of gradient. The implementation of
ORT's InstanceNormalization will call cuDNN's BatchNorm for part of
computation, which is not efficient compared to PyTorch's
implementation. This PR is to use ATen fallback to support this torch
module, including its forward and backward.
### Description
float and half initializers with same value are merged into same
initializer. This is a bug due to when we create pattern key, data type
is always be -1 (which is a naive mistake when doing code refactoring
previously), plus float and half are stored as float in constant store
for easier data comparison.
Added test coverage.
` [ONNXRuntimeError] : 1 : FAIL : Type Error: Type parameter (T) of
Optype (Mul) bound to different types (tensor(float) and tensor(float16)
in node
`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Unit test with ROCm5.3 slower than ROCm5.2.3. Revert to ROCm5.2.3.
We will update to ROCm5.3 when the issue resloved by AMD.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add ability to upgrade an ORT format model when loaded in a full build
by inserting the kernel constraint info and ignoring the kernel hashes.
This also allows upgrading the model to the latest format by saving the
model after loading.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
Provide official path to upgrading an ORT format model directly (vs.
reconverting).
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
supplement of #13248
Add PR trigger
https://learn.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#pr-triggers
fix: master -> main
Testted with #13289#13292
NB:
the real pipeline is always triggered if the workflow yaml changed even
it's added in the path filter.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make sure the real pipeline not run in the backend.
### Description
Check for null input
### Motivation and Context
This has been reported at least twice (once by the Windows team and once
by Speech team). Currently we just segfault.
Current graph builder for ORTModule will apply the training's graph
optimizations for both training and eval mode. Take BatchNorm as
example, one of training's graph optimizations will replace
BatchNormalization Op to BatchNormInternal which is for training only.
This PR is to fix this, for eval mode, we will not apply the training's
graph optimizations. The inference's graph optimizations will be applied
when InferenceSession initialization.
### Description
Correct the file name in the comments of the generated yaml.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
1. Remove ROCm5.1.1 and ROCm5.2 from ROCm python package pipeline
2. Add ROCm5.3 to ROCm python package pipeline
pipeline:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=237172&view=results
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Implemented gradient of cos as per the function below.

### Motivation and Context
Cos gradient required for [huggingface's diffusers
library](https://github.com/huggingface/diffusers)
### Testing
built ORT from source: `./build.sh --config RelWithDebInfo
--enable_training --use_cuda --cuda_home /usr/local/cuda --cudnn_home
/usr/local/cuda --build_wheel --parallel --skip_tests`
tested CosGrad implementation: `cd build/Linux/RelWithDebInfo/ &&
./onnxruntime_test_all --gtest_filter=GradientCheckerTest.CosGrad`
Co-authored-by: Prathik Rao <prathikrao@microsoft.com>
### Description
Implemented gradient of sin as a function op.
### Motivation and Context
Sin gradient currently implemented as cpu op which could hurt
performance.
### Testing
built ORT from source: `./build.sh --config RelWithDebInfo
--enable_training --use_cuda --cuda_home /usr/local/cuda --cudnn_home
/usr/local/cuda --build_wheel --parallel --skip_tests`
tested SinGrad implementation: `cd build/Linux/RelWithDebInfo/ &&
./onnxruntime_test_all --gtest_filter=GradientCheckerTest.SinGrad`
Co-authored-by: Prathik Rao <prathikrao@microsoft.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
### Description
To construct test name, replace whitespace to underscore and remove
parentheses
### Motivation and Context
gtest name only accepts '_' and alphanumeric
**Description**: Add qkv_hidden_size support in CUDA Attention Layer
implementation.
Changes include:
- Modify UT to test GPU and CPU implementation
- Add overload for CUDA kernel `AddBiasTransposeQKV` to support scenario
where V_HIDDEN_SIZE != QK_HIDDEN_SIZE
- Update variable names from `head_size` to `qkv_head_sizes[0]` or
`qkv_head_sizes[2]`
- Modify function definitions to allow communication of
`qkv_hidden_sizes` or `qkv_head_sizes`
Note that this feature is not supported in Rocm EP or quantized
attention right now.
**Motivation and Context**
- Why is this change required? What problem does it solve? The current
CUDA implementation of attention layer doesn't support the parameter
qkv_hidden_size added in the CPU implementation in PR
[8039](https://github.com/microsoft/onnxruntime/pull/8039)
- If it fixes an open issue, please link to the issue here.
Co-authored-by: Peter Mcaughan <petermca@microsoft.com>
This PR has two fixes:
- https://github.com/pytorch/pytorch/pull/85636 change the behavior of
register_custom_op_symbolic to only register the symbolic function at a
single version. For ORTModule we need to pass the op_set version when
calling it.
- Since torch_1.13 the signature of einsum is changed to have a new
argument, need to change our custom op symbolic registry code
accordingly.
Without the fixes, ORTModule will not work with the nightly torch, and
the new torch version will be released.
clang-tidy says "Do not implicitly decay an array into a pointer; consider using gsl::array_view or an explicit cast instead"
It is a false positive scattering around all our codebase when using
helper macros. It is becuase for function with 4 char name, say `main`,
the type of __FUNCTION__ and __PRETTY_FUNCTION__ is `char [5]`.