Update the usage of torch.onnx.OnnxRegistry, as it's officially
published in PyTorch: https://github.com/pytorch/pytorch/pull/106140.
---------
Co-authored-by: Wei-Sheng Chin <wechi@microsoft.com>
### Description
Enhanced SkipLayerNorm by implementing broadcasting for both CPU and
CUDA
### Motivation and Context
The input and skip tensors no longer have to be the same size which
means that it can accept data where the skip shape can be the same size
as the input shape, have a shape of {1, sequence_length, hidden_size},
or {sequence_length, hidden_size}.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Fix few bugs
1. symbolic shape infer, there is no None check before get length.
2. Rename PythonOp/PythonOpGrad's attribute `name` to `func_name`,
otherwise, when we use onnx.helper.make_node to create node, `name`
conflicts with node name.
3. Filter shape inference warnings for PythonOp for torch 2.0 or newer.
4. Close file descriptor for log suppression. Without the fix, two extra
fd is left after the log suppression exit its context.
Before enter log suppression (left), Before exit log suppression (right)

With the fix, no fd added after context exit.

If users use `trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`, they need to provide all the dynamic shape
input with associated shape profiles.
In the case of the main graph is partitioned into TRT/CUDA subgraphs, if
the input of the subgraph is also dynamic shape, users need to provide
its shape profiles as well. User might not notice, so TRT EP will tell
them which input shape profiles need to be provided.
New warning message is :
```
Traceback (most recent call last):
File "/home/azureuser/disk2/debug/optional_inputs.py", line 218, in <module>
test_optional_input_dynamic(trt_profile=True, optional=True)
File "/home/azureuser/disk2/debug/optional_inputs.py", line 195, in test_optional_input_dynamic
session = ort.InferenceSession(
File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line
419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line
471, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : User needs to provide all the
dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide
shape profiles for the TRT subgraph's input if it's dynamic shape input.
Following input(s) has no associated shape profiles provided: x1
```
Please see this github issue:
https://github.com/microsoft/onnxruntime/issues/16600
### Description
<!-- Describe your changes. -->
Adjust nativs to display tagged strings.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Hard to debug without seeing names.
[DML] Model corrupter during layernorm fusion and DmlNonZeroOperator
crashes
Two issues fixed in this PR:
1) Changes to layernom fusion regressed DirectML. This has been disabled
for DML to unblock models.
2) DmlNonZero needs to create an operator call that needs to know the
number of non-zero elements (size in bytes). Therefore this needs to be
allocated during compute, but is being allocated during initialization.
This causes the output tensor size to mismatch with the operator's
expectations.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
### Description
1. Add valgrind to existing ep_perf CI MemTest and parse ORT-TRT memLeak
details
1. General Valgrind logs and logs related to ORT-TRT will be parsed in
[CI
artifacts](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=334122&view=artifacts&pathAsName=false&type=publishedArtifacts)
1. Logic:
1. Run valgrind with `onnxruntime-perf-test -e tensorrt` and export log
to `valgrind.log`
2. Identify if any `definitely lost` memleak happened
1. For log paragraphs which show `definitely lost`, parse if they have
keyword `TensorrtExecutionProvider`.
2. If so, extract these details to `ort_trt_memleak_detail.log`, and
return `build failure` to EP Perf CI
3. Fix existing addressSanitizer and sync the squeezenet testcase with
latest update from
[ort-inference-example](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/squeezenet/main.cpp)
1. Updates in short: Upgrade main.cpp to be using
OrtTensorRTProviderOptionsV2
4. Reorder the 7-min-MemTest to be ahead of 9-hr-model-tests, and enable
MemTest by default
### Description
This change allows Web CI to do some check as the first step, so that if
there are errors it won't launch the task to build web assembly, which
is heavy.
Checks includes:
- "npm ci" in /js, /js/common and /js/web. this implicitly include:
- typescript compiler in /js
- typescript compiler in /js/common
- webpack build in /js/common
- typescript compiler in /js/web
- ESLint on typescripts
- clang-format formatter (.js, .ts, .cc, .h, .mm)
- Prettier formatter (.json, .jsonc, .md)
---------
Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Add a generic `UpdateTensorRTProviderOptionsWithValue()` C API to update
TensorRT provider options where its data type is pointer that can't be
represented by string.
### Description
<!-- Describe your changes. -->
This PR makes sure that only storage buffers are reused. Previously, the
query buffer might also get from the freeBuffers list if there is a
matching size in it. But they are different usage, which results errors.
Fixed ArgMin and ArgMax and refactored using functionality from Reduce
operator code.
### Description
Removed code/functionality duplication and fixed some issue.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Improves how unit tests measure the accuracy of QDQ models on QNN EP.
- Adds tests for ops: Add, Mul, Abs<sup>1</sup>, And<sup>1</sup>,
Or<sup>1</sup>, Ceil<sup>1</sup>, Cos<sup>1</sup>
<sup>1</sup>: Not previously supported due to missing node unit
handling.
### Motivation and Context
The new approach for testing QDQ operator accuracy requires running 3
inferences:
1. float model on CPU EP (baseline)
2. qdq model on CPU EP
3. qdq model on QNN EP
The units tests check that running the QDQ model on QNN EP (3) is at
least as accurate (+- small tolerance) as running the QDQ model on CPU
EP (2). We measure accuracy by comparing to the baseline (1).
This is essentially what we care about: is qnn ep as accurate as cpu ep.
If not, it is worth investigating as a potential bug.
### Motivation and Context
When we handle PyTorch models' inputs in different places (ORTModule or
others), it's common for us to flatten a structured data into a 1-D
tensor list (required by lib for example torch.onnx.export,
torch.autograd.Function.forward or ORT inference session), then do
subsequent work, then unflatten back to original hierarchy as returned
values.
DeepStage3 hooks support work also need such a lib to do similar things,
so I was proposing to extract this pair of APIs in training/utils/,
which can be more used more generally. Also a comprehensive set of test
data are used for testing unflatten/flatten in unit tests.
Let me know if you have any other suggestions.
### Refactor schema extraction and output unflattening
Move `_extract_schema` and `unflatten_user_output` in
`orttraining/orttraining/python/training/ortmodule/_io.py` . to
`extract_data_and_schema` and `unflatten_data_using_schema` in
`orttraining/orttraining/python/training/utils/torch_io_helper.py` as
shared libs, which can be used later by other features (deepspeed stage
3 hook rewrite).
While there are still a few duplicated logic handling flatten with
different task by recursively loop the data struct, will change them
step by step in case of heavy review efforts.
### Description
Make CacheHint mechanism, which is designed to avoid running the same
test multiple times saving the result mapped against a key, working by
adding input dims.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add an option to generate different formats of attention_mask for
testing transformers models:
1 - 1D mask index, actual sequence length excluding padding
2 - 2D attention mask. Value 0 means padding, 1 otherwise.
3 - 1D, key lengths and cumulated sequence lengths of query and key
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update scripts for converting model with MulitHeadAttention to packing
mode.
- [x] Update symbolic shape inference for PackedMultiHeadAttention and
GatedRelativePositionBias
- [x] Update convert_to_packing_mode to handle model with
MulitHeadAttention
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
update op test schema.
This changes fixes several problems for operator tests for web:
- `opsets` -> `opset`: an operator uses exactly one opset instead of
multiple
- `condition` -> `platformCondition`: make it less confusing
- `inputShapeDefinitions`: allows to test ORT behaviors when it get
no/partial/full shape info.
Added a JSON schema file and also an example file
### Description
Added Gather op that works with both i32 and i64 indices, assuming that
values fall into i32 limit. The assumption is safe because it's not
possible to allocate more than 2gb buffer for inputs.
It treats all data from input tensor as u32, copying 1 or 2 elements for
i64, u64 and double.
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
### Description
If Expand inputs has rank < 2, `inputIndicesHelper` and
`outputIndicesHelper` create indices as u32 instead if array<u32> and
`calculateInputIndex` throws an error
### Motivation and Context
I've encountered this error while making StableDiffusion work with JSEP
### Description
<!-- Describe your changes. -->
As title.
And manually validated it in the
https://github.com/fs-eire/ort-rn-hello-world test app with the
dev/updated version of onnxruntime-react-native package:
https://www.npmjs.com/package/onnxruntime-react-native/v/1.16.0-dev.20230712-a396a15fa6
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Resolve security warning issues. cc @skottmckay thanks author for the
changes.
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Two things done in this PR.
- [2nd commit] More tensor element types are supported because in
distributed computation, we need to re-shard tensors in many different
types.
- [1st commit] We now specify opset version in test models. Without this
change, those models will have opset=20 with latest ONNX and results
test errors.
- [3rd commit] Tests are modified to test `AllGather` and `AllToAll` for
boolean tensors. Several graph patterns are tried for tests. We found
that `int64_tensor -> Cast -> bool_tensor -> AllToAll -> bool_tensor ->
Cast -> int64_tensor` always generate random results. My guess is that
`AllToAll` needs to synchronize all GPUs before calling `ncclSend` and
`ncclRecv` since `AllGather` doesn't hit this problem. For reproducing
the error, search for `TODO` in this PR. Note that this PR doesn't fix
it.
### Description
Fixed the issue of finding nodes with empty name for vitis ai.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It is required because we encountered this error when testing newly
created models.
### Description
<!-- Describe your changes. -->
Simplify Shrink.
Replace Eigen code with the one that does not require fp16 conversion in
Sign.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
argmax and argmin are similar to reduce. Eventually we need to add
optimized flavors of the shader.
softmax is optimized but only works on the last axis for now which
should be the common use case.
todo: enable more ut for argmax/argmin
### Description
<!-- Describe your changes. -->
Support more data types for vitis ai.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It is required because the models we are testing now have uint8 data
type. To solve this once for all, we changed the code to support generic
data type.
### Description
Added Resize NHWC domain kernel registration.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Last week I fixed error #16484 found when trying to build onnxruntime
with the icpx compiler. Another thing I found out is that icpx uses
-ffast-math flag by default. You can check it by running the compiler
with -v flag like following:
```bash
# Setup the environment
. /opt/intel/oneapi/setvars.sh
# Compile any file to see all the implicit flags
icpx -v main.cpp
```
This leads to a bunch of warnings during the build like:
```bash
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc:5:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/provider_test_utils.h:6:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/test/providers/checkers.h:10:
In file included from /mnt/f/wsl_home/onnxruntime/onnxruntime/core/util/math_cpuonly.h:68:
In file included from /mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/Core:172:
/mnt/f/wsl_home/onnxruntime/build/Linux/RelWithDebInfo/_deps/eigen-src/Eigen/src/Core/MathFunctions.h:1019:12: warning: comparison with NaN always evaluates to false in fast floating point modes [-Wtautological-constant-compare]
return isnan EIGEN_NOT_A_MACRO (x);
^~~~~~~~~~~~~~~~~~~~~~~~~~~
```
And some tests are failing as well, usually with infinities involved. To
list a few:
```bash
# ...
1: [ FAILED ] IsInfTest.test_isinf_float
1: [ FAILED ] IsInfTest.test_isinf_double
1: [ FAILED ] IsInfTest.test_isinf_positive_float
1: [ FAILED ] IsInfTest.test_isinf_positive_double
1: [ FAILED ] IsInfTest.test_isinf_negative_float
1: [ FAILED ] IsInfTest.test_isinf_negative_double
1: [ FAILED ] IsNaNOpTest.IsNaNFloat
1: [ FAILED ] IsNaNOpTest.IsNaNDouble
# ...
```
This PR adds a quick global check for the IntelLLVM compiler, as in the
way its name is reported by CMake and then, depending on the compiler
driver, sets either MSVC-like or GCC-like switch to disable fast-maths.
Probably a bit cleaner solution would be to use
```target_compile_options(${TARGET} PRIVATE MEOW)``` instead of a
global-wide ```set(CMAKE_CXX_FLAGS MEOW)```, but then we'd be required
to add it to all the individual targets and execution providers and this
will lead to a lot of code duplication.
### Description
1. rename OrtValue.FillStringTensorElement to StringTensorSetElementAt .
To the API user I think we're conceptually setting the string at an
offset in the tensor with is roughly equivalent to `List<string> list
... list[index] = "value"`.
2. While working on new inference examples, I noticed that I am still
inclined to use `DenseTensor` for N-D indexing. Added `GetStrides()` and
`GetIndex()` from strides for long dims, so the user can obtain strides
and translate N-D indices into a flat index to operate directly on the
native `OrtValue` buffers. Expose these functions to the user.
3. Make sure we generate docs for C# public static functions.
In additions to `onnxruntime_test_all`, `onnxruntime_shared_lib_test`
and `onnxruntime_customopregistration_test` should
also add "-Wno-deprecated-declarations" flag to ignore compiler warning
### Save optimized pre_grad graph once it's ready
`graph_builder.build()` did two things for training: 1. optimized
forward graph, e.g. pre_grad graph optimization. 2. build gradient
graph.
Originally after `graph_builder.build()` completed, pre_graph graph is
saved. While if pre_grad graph optimization completed, but fail during
gradient graph build, we still cannot get pre_grad graph to investigate.
This PR made the change once pre_grad graph is ready, we save it (if
save_model is enabled) in C++ backend.
BTW, reset minimal supported opset to 1, because with minimal supported
opset 7 will ignore all ops that have last since version less than 7.
e.g. GlobalLpPool, it only has two opset versions: 1, 2.
Padding value in ONNX Pad can be negative, which indicates remove pixel.
WebNN EP can not support such operation, so it needs to use slice to
handle this case.
### Description
There are currently multiple failures that blocking the CI pipelines so
this PR has all of the fixes in order to make sure it passes the CI.
Otherwise a single fix will still fail the CI.
includes:
#16960#16958
Please help to make sure this PR get merged once CI passed.
@snnn @carzh @guschmue
Fixed:
[AB#18118](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18118)
---------
Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
Add lsb-release package to android custom build
### Motivation and Context
To fix a build issue:
/workspace/onnxruntime/tools/ci_build/github/linux/docker/inference/x64/python/cpu/scripts/install_protobuf.sh:
line 27: lsb_release: command not found
### Description
- enable unit test for js/common in CI
- add debug config in js/.vscode/launch.json
- enable source map for js/common/test for debugging purposes; add
source map files to ignore list
- ignore js/common/test folder for npm packaging