### Description
<!-- Describe your changes. -->
Check the bound of the node_get_inputs for out of bound error.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Model with loop would encounter this error. Currrent we do not support
custom op for loop. So, ideally it would throw an error and fall back to
CPU evalution.
### Description
Slightly increases the allowable error tolerance for ReduceProd tests on
x64 Windows/Linux with the QNN CPU backend.
### Motivation and Context
A recent [PR](https://github.com/microsoft/onnxruntime/pull/16916)
updated the input range for ReduceProd tests, which uncovered an
inaccuracy for ReduceProd on x64 Windows/Linux with the QNN CPU backend.
This PR updates the allowable error tolerance and adds a TODO for
investigation.
This is needed to ensure the QNN_Nuget_Windows pipeline runs
successfully.
OpenVINO EP ORT 5.1 Branch
Changes for the new API to take in OpenVINO Provider Options
and compatibility with OV 2023.1
### Motivation and Context
The change is required for the new API to take in OpenVINO Provider
Options
and make it seamless.
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Add a generic `UpdateCUDAProviderOptionsWithValue()` C API to update
CUDA EP provider options where its data type is pointer that can't be
represented by string.
Note: Please see some comments for the similar [PR
](https://github.com/microsoft/onnxruntime/pull/16965)for TRT EP.
### Use full qualified name for PythonOp export
Originally, when there are duplicate named torch.autograd.Function in
different module, for example:
`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`
We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.
This PR finds a way to construct a full qualified name.
`def _export_pt_1_10(g, n, *args, **kwargs):`
1. in exporter function, kwargs contains `name` and `module`, in the
above example:
`a.b.c.Gelu` --> name: `Gelu`, module: `a.b.c`
`d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.
The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).
2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.
With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
Fix an obvious bug:
(1) In packing mode, the input for SLN has two dimensions (introduced by
#15283): [token_count, hidden_size]. Current code of `element_count =
input_dims[0] * sequence_length * hidden_size` will use element_size =
token_count * hidden_size * hidden_size, and causes invalid memory write
in cuda kernel and ORT crash
and two minor issues:
(2) potential integer overflow in `static_cast<int>(element_count)`
(3) some dead code after `return LaunchSkipLayerNormKernel` that will
never have chance to run.
Maintaining one execution context on a per thread basis is suggested per
TRT
[doc](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#threading)
to avoid synchronization issue.
For previous TRT EP, we did see synchronization issues when running
multithreading on some models, for example, FasterRCNN.
This PR leverages per thread context implementation from CUDA EP.
Followings are the modifications:
- Move CUDA graph and IExecutionContext objects to per thread context.
- Remove lock_gruad that previously placed for the whole compute_func()
and put lock_gruad in the blocks where multiple threads may update
kernel function state, access one builder, create/serialize/save engine,
save profile and serialize/save timing cache.
- On CentOS, don't unload TRT EP shared library and leave it around, so
that destructor of thread local data is still accessible upon thread
exits.
Note: Tested this PR with onnxruntime_perf_test and the overhead of
PerThreadContext is small.
### Description
Added two kernels for Layer and Instance norm
Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.
### Motivation and Context
These two are used in StableDiffusion and many other networks
### Description
Enhanced SkipLayerNorm by implementing broadcasting for both CPU and
CUDA
### Motivation and Context
The input and skip tensors no longer have to be the same size which
means that it can accept data where the skip shape can be the same size
as the input shape, have a shape of {1, sequence_length, hidden_size},
or {sequence_length, hidden_size}.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Fix few bugs
1. symbolic shape infer, there is no None check before get length.
2. Rename PythonOp/PythonOpGrad's attribute `name` to `func_name`,
otherwise, when we use onnx.helper.make_node to create node, `name`
conflicts with node name.
3. Filter shape inference warnings for PythonOp for torch 2.0 or newer.
4. Close file descriptor for log suppression. Without the fix, two extra
fd is left after the log suppression exit its context.
Before enter log suppression (left), Before exit log suppression (right)

With the fix, no fd added after context exit.

If users use `trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`, they need to provide all the dynamic shape
input with associated shape profiles.
In the case of the main graph is partitioned into TRT/CUDA subgraphs, if
the input of the subgraph is also dynamic shape, users need to provide
its shape profiles as well. User might not notice, so TRT EP will tell
them which input shape profiles need to be provided.
New warning message is :
```
Traceback (most recent call last):
File "/home/azureuser/disk2/debug/optional_inputs.py", line 218, in <module>
test_optional_input_dynamic(trt_profile=True, optional=True)
File "/home/azureuser/disk2/debug/optional_inputs.py", line 195, in test_optional_input_dynamic
session = ort.InferenceSession(
File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line
419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line
471, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : User needs to provide all the
dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide
shape profiles for the TRT subgraph's input if it's dynamic shape input.
Following input(s) has no associated shape profiles provided: x1
```
Please see this github issue:
https://github.com/microsoft/onnxruntime/issues/16600
[DML] Model corrupter during layernorm fusion and DmlNonZeroOperator
crashes
Two issues fixed in this PR:
1) Changes to layernom fusion regressed DirectML. This has been disabled
for DML to unblock models.
2) DmlNonZero needs to create an operator call that needs to know the
number of non-zero elements (size in bytes). Therefore this needs to be
allocated during compute, but is being allocated during initialization.
This causes the output tensor size to mismatch with the operator's
expectations.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
### Description
1. Add valgrind to existing ep_perf CI MemTest and parse ORT-TRT memLeak
details
1. General Valgrind logs and logs related to ORT-TRT will be parsed in
[CI
artifacts](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=334122&view=artifacts&pathAsName=false&type=publishedArtifacts)
1. Logic:
1. Run valgrind with `onnxruntime-perf-test -e tensorrt` and export log
to `valgrind.log`
2. Identify if any `definitely lost` memleak happened
1. For log paragraphs which show `definitely lost`, parse if they have
keyword `TensorrtExecutionProvider`.
2. If so, extract these details to `ort_trt_memleak_detail.log`, and
return `build failure` to EP Perf CI
3. Fix existing addressSanitizer and sync the squeezenet testcase with
latest update from
[ort-inference-example](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/c_cxx/squeezenet/main.cpp)
1. Updates in short: Upgrade main.cpp to be using
OrtTensorRTProviderOptionsV2
4. Reorder the 7-min-MemTest to be ahead of 9-hr-model-tests, and enable
MemTest by default
Add a generic `UpdateTensorRTProviderOptionsWithValue()` C API to update
TensorRT provider options where its data type is pointer that can't be
represented by string.
Fixed ArgMin and ArgMax and refactored using functionality from Reduce
operator code.
### Description
Removed code/functionality duplication and fixed some issue.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Improves how unit tests measure the accuracy of QDQ models on QNN EP.
- Adds tests for ops: Add, Mul, Abs<sup>1</sup>, And<sup>1</sup>,
Or<sup>1</sup>, Ceil<sup>1</sup>, Cos<sup>1</sup>
<sup>1</sup>: Not previously supported due to missing node unit
handling.
### Motivation and Context
The new approach for testing QDQ operator accuracy requires running 3
inferences:
1. float model on CPU EP (baseline)
2. qdq model on CPU EP
3. qdq model on QNN EP
The units tests check that running the QDQ model on QNN EP (3) is at
least as accurate (+- small tolerance) as running the QDQ model on CPU
EP (2). We measure accuracy by comparing to the baseline (1).
This is essentially what we care about: is qnn ep as accurate as cpu ep.
If not, it is worth investigating as a potential bug.
### Description
Add an option to generate different formats of attention_mask for
testing transformers models:
1 - 1D mask index, actual sequence length excluding padding
2 - 2D attention mask. Value 0 means padding, 1 otherwise.
3 - 1D, key lengths and cumulated sequence lengths of query and key
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Update scripts for converting model with MulitHeadAttention to packing
mode.
- [x] Update symbolic shape inference for PackedMultiHeadAttention and
GatedRelativePositionBias
- [x] Update convert_to_packing_mode to handle model with
MulitHeadAttention
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Added Gather op that works with both i32 and i64 indices, assuming that
values fall into i32 limit. The assumption is safe because it's not
possible to allocate more than 2gb buffer for inputs.
It treats all data from input tensor as u32, copying 1 or 2 elements for
i64, u64 and double.
---------
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Two things done in this PR.
- [2nd commit] More tensor element types are supported because in
distributed computation, we need to re-shard tensors in many different
types.
- [1st commit] We now specify opset version in test models. Without this
change, those models will have opset=20 with latest ONNX and results
test errors.
- [3rd commit] Tests are modified to test `AllGather` and `AllToAll` for
boolean tensors. Several graph patterns are tried for tests. We found
that `int64_tensor -> Cast -> bool_tensor -> AllToAll -> bool_tensor ->
Cast -> int64_tensor` always generate random results. My guess is that
`AllToAll` needs to synchronize all GPUs before calling `ncclSend` and
`ncclRecv` since `AllGather` doesn't hit this problem. For reproducing
the error, search for `TODO` in this PR. Note that this PR doesn't fix
it.
### Description
Fixed the issue of finding nodes with empty name for vitis ai.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It is required because we encountered this error when testing newly
created models.
### Description
<!-- Describe your changes. -->
Simplify Shrink.
Replace Eigen code with the one that does not require fp16 conversion in
Sign.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
argmax and argmin are similar to reduce. Eventually we need to add
optimized flavors of the shader.
softmax is optimized but only works on the last axis for now which
should be the common use case.
todo: enable more ut for argmax/argmin
### Description
<!-- Describe your changes. -->
Support more data types for vitis ai.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
It is required because the models we are testing now have uint8 data
type. To solve this once for all, we changed the code to support generic
data type.
### Description
Added Resize NHWC domain kernel registration.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
BTW, reset minimal supported opset to 1, because with minimal supported
opset 7 will ignore all ops that have last since version less than 7.
e.g. GlobalLpPool, it only has two opset versions: 1, 2.
Padding value in ONNX Pad can be negative, which indicates remove pixel.
WebNN EP can not support such operation, so it needs to use slice to
handle this case.
Being able to leverage I/O binding for DML and registering `If` for the
DML EP allows us to avoid copying the past/present key/values back and
forth between the CPU and the GPU after every token.
This gives us a 25% performance increase for Dolly V2 with 128 tokens on
an RTX 4090.
### Description
This PR adds support for saving model optimizations after loading a
model that contains external data into an `InferenceSession`.
### Motivation and Context
This PR is a follow-up to a [previous
PR](https://github.com/microsoft/onnxruntime/pull/16716) for saving a
model optimized by an `InferenceSession`.
### Description
Implemented Resize operator support in JSEP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve graph transformer DoubleQDQPairsRemover
### Description
Improve DoubleQDQPairsRemover to not reset the scale & zero point if
existing value are same on the target DQ & Q nodes.
### Motivation and Context
Fix a bug that DoubleQDQPairsRemover reset the scale value while
removing unnecessary DQ & Q nodes.
### Description
Added Gelu operator to JSEP
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Python Package Pipeline failed since there is exception raised in
test_smooth_quant (from #16288):
```
File "/home/cloudtest/.local/lib/python3.8/site-packages/onnxruntime/quantization/quantize.py", line 384, in quantize_static
importlib.import_module("neural_compressor.adaptor.ox_utils.smooth_quant")
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/__init__.py", line 24, in <module>
from .contrib import *
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/contrib/__init__.py", line 19, in <module>
from .strategy import *
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/contrib/strategy/__init__.py", line 26, in <module>
__import__(basename(f)[:-3], globals(), locals(), level=1)
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/contrib/strategy/sigopt.py", line 22, in <module>
from neural_compressor.strategy.strategy import strategy_registry, TuneStrategy
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/strategy/__init__.py", line 20, in <module>
from .strategy import STRATEGIES
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/strategy/strategy.py", line 41, in <module>
from ..algorithm import AlgorithmScheduler, ALGORITHMS
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/algorithm/__init__.py", line 20, in <module>
from .algorithm import ALGORITHMS, Algorithm, AlgorithmScheduler, algorithm_registry
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/algorithm/algorithm.py", line 21, in <module>
from neural_compressor.utils.create_obj_from_config import get_algorithm
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/utils/create_obj_from_config.py", line 20, in <module>
from neural_compressor.metric import METRICS
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/metric/__init__.py", line 30, in <module>
__import__(basename(f)[:-3], globals(), locals(), level=1)
File "/home/cloudtest/.local/lib/python3.8/site-packages/neural_compressor/metric/coco_tools.py", line 54, in <module>
from pycocotools import coco
File "/usr/local/lib/python3.8/dist-packages/pycocotools/coco.py", line 52, in <module>
from . import mask as maskUtils
File "/usr/local/lib/python3.8/dist-packages/pycocotools/mask.py", line 3, in <module>
import pycocotools._mask as _mask
File "pycocotools/_mask.pyx", line 1, in init pycocotools._mask
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
```
The cause is pycocotools package uses "oldest-supported-numpy", which
might cause older version numpy in build pycocotools:
9e9164f979/PythonAPI/pyproject.toml (L4)
Related issue: https://github.com/cocodataset/cocoapi/issues/248
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->