### Description
This change upgrade emsdk to 3.1.44.
Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".
most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
### Description
1. Update model_tests.cc: avoid auto adding new tests from new opsets.
2. Simplify the "ConcatPathComponent" function. It does not need to be a
template.
### Motivation and Context
All our Windows/Linux CI build machines are preloaded with some test
data. In model_tests.cc, we auto add all of them to
onnxruntime_test_all.exe's unit tests. However, it causes problems when
we update the CI build machine images: new data could cause pipelines
suddenly failing.
Therefore, instead of auto discovering test data and adding all of them
to tests, this PR changes it to explicitly specify the opset names.
This change doesn't impact how Web CI pipeline runs its tests.
Going forward, the workflow would be like:
Step 1: update the onnx version in deps.txt
Step 2: Update js/scripts/prepare-onnx-node-tests.ts. Like #16943 .
Better to put step 1 and step 2 in the same PR.
Step 3: onnxruntime-es team regenerates VM images, test them and deploy
them.
Step 4: Enable the new opset test data for EPs.
[AB#18340](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18340)
### Description
<!-- Describe your changes. -->
return empty ComputeCapability when a graph contains nested subgraph.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For now, our architecture does not support nested subgraph. So, we
return empty ComputeCapability for this case.
### Description
<!-- Describe your changes. -->
a node arg can be matched multiple times.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previous, we thought the node name must be unique and thus can be used
as identifier. However, we recently found that a node's name can be
empty thus failed to identify which node is which. So, we use node arg
to differentiate the node. To do so, we need to match node arg more than
once.
### Description
Fix some Resize failing tests.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
With PaddingElimination optimizer, input1 of element-wise op may be
flattened like:
```
input1 (shape:[batch_size, seq_len, ...]) input1 (shape:[valid_tokens, ...])
\ \
\ input2 \ input2
\ / -----> \ /
\ / \ /
Element-wise Op Element-wise Op
```
So, the shape of input2 should be processed accordingly:
1. If input2.shape.dim_size <= input1.shape.dim_size-2, i.e. input2 has
no [batch_size, seq_len] at begining,
we needn't to process the shape of input2 because it's compatible with
the flattened shape of input1 (shape:[valid_tokens, ...]).
2. If the shape of input2 has the same dim_size with shape of input1 and
has [batch_size, seqlen] at begening,
to be compatible with flattened shape of input1, we need to insert
flatten pattern for input2 also,
which flatten the shape of input2 from [batch_size, seq_len, ...] to
[valida_tokens, ...].
3. (which done in this pr) In other case for shape of input2, like [1,
seq_len, ...] or [batch_size, 1, ...], we firstly need to expand it
to [batch_size, seq_len, ...] which is convenient to flatten. And then
insert flatten pattern.
### Check type for building gradient graph
**Bug1**:
To fix the error when running the model with ORTModule + Stage 3:
```
Exception happens when running <bound method Function.apply of <class 'onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction'>>
Traceback (most recent call last):
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py", line 207, in call_python_forward_function
wrapped_arg.requires_grad = is_training_mode and grad_flag
RuntimeError: only Tensors of floating point and complex dtype can require gradients
```
This is because when running PythonA, the 3rd input is int64, we find it
requires gradient during the check in gradient builder, so we set its
requires_grad = True, but PyTorch thinks it is incorrect, throwing the
exception. So we need understand why ORT gradient builder think the 3rd
input need gradients.
During `ReverseBFSWithStopGradient`, which do reverse BFS from graph
outputs, it collects all nodes that are needed for computing the graph
outputs. `ReverseBFSWithStopGradient` define a queue, initially add all
nodes that generate graph outputs, then iterate the nodes one by one,
checking each node's input, if the input did not hit stop edge and its
node arg type is allowed type (float, etc), then the input node is
append into the queue, do the next iteration of work.
PythonOpA is such a node that is needed to compute graph outputs, then
IsReachable(PythonOpA) will return True.

In the above code snippet, when node is PythonOpB, and next_node being
PythonOpA, we did not check node_arg type between node and next_node on
the connection of PythonOpA's 3rd input to PythonOpB's outputs. So we
append the int64 typed node args to sets that require gradient.
**Fix1**: add the node arg type check before appending it into require
grad lists.
After the fixing, A unit test failed
"orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax[data_type0-True-0-min]
Fatal Python error: Segmentation fault". After investigation, it is
another bug.
**Bug2**:
Without the above fix1, the execution graph looks like this

As you can see, int64 type has a gradient edge built, while it is not
used for any consumers. And the execution runs well. While think twice,
int type should not have grad edge built.
With the Fix1, the execution graph looks like this;

So the int type node arg did not has gradient edge built. **Fix1** is
fixing this problem.
But another bug happens if the inital "y_node_arg_names" e.g. in this
case Aten's two outputs, 1st one in float, 2nd one in int. When we check
the y_node
(6e6f582e08/orttraining/orttraining/core/framework/gradient_graph_builder.cc (L60C16-L60C16)),
we did not check the data type, then add it into `y_node_args_` which is
the list of graph output node args that requires gradient. Then
`non_differentiable_y_node_arg_names_` did not has the int type graph
output.
Then
6e6f582e08/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc (L312C18-L312C18)
will try to get the grad node arg into `yield_output_node_args`, BUT the
grad node arg is not built for int type node arg (with the **Fix1**). So
we insert a nullptr, later when we using it, we get segment fault.
**Fix2**
Again, we add the type check when handle y_node_args, also add null
check when getting gradient node arg and append into
yield_output_node_args
### Description
<!-- Describe your changes. -->
remove unused code
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
remove unused code
This is in preparation for planned ROCm 6.0 changes that are not
backward compatible. However, the adjustments made by this PR to the
current onnxruntime cmake files will work with ROCm 5.x and 6.x.
The approach is the following:
1. Build partitions
2. Try compiling each partition into a `IDMLCompiledOperator`
3. If the compiled operator's persistent resource is bigger than 4GB,
tell the partitioner to split the partition in the middle and try again.
4. Once all partitions have been successfully compiled into an
`IDMLCompiledOperator`, fuse the partitions into an ORT operator and
register them all.
This change is relatively simple (basically a basic retry mechanism),
but it required a lot of refactoring just to make sure that we don't
modify the graph until **all** partitions have been compiled
successfully. This is because partly modifying the graph before making
sure that all partitions can be compiled will break future retries.
This path is not expected to be used a lot, and even then the loop is
not expected to loop more than twice very often. This is a very specific
edge case for large models that were able to merge a large number of
nodes into a single partition.
### Description
<!-- Describe your changes. -->
Check the bound of the node_get_inputs for out of bound error.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Model with loop would encounter this error. Currrent we do not support
custom op for loop. So, ideally it would throw an error and fall back to
CPU evalution.
### Description
onnxjs contains a `Resize` op input check which is outdated since opset
9. Currently `Resize` supports up to 4 inputs. This PR looses the input
check.
### Motivation and Context
Fixes#15636
### Description
1. Add "--windows_sdk_version" argument to build.py
2. Fix Windows Static Analysis build pipeline. It is failing because it
picks up a different Windows SDK version after a build machine image
update. If we can explicitly specify Windows SDK version, we can avoid
such things happening again.
3. Remove --enable_training from Windows Static Analysis build pipeline
because PR #16993 makes it incompatible with "no_rtti".
AB#18315
### Description
Slightly increases the allowable error tolerance for ReduceProd tests on
x64 Windows/Linux with the QNN CPU backend.
### Motivation and Context
A recent [PR](https://github.com/microsoft/onnxruntime/pull/16916)
updated the input range for ReduceProd tests, which uncovered an
inaccuracy for ReduceProd on x64 Windows/Linux with the QNN CPU backend.
This PR updates the allowable error tolerance and adds a TODO for
investigation.
This is needed to ensure the QNN_Nuget_Windows pipeline runs
successfully.
OpenVINO EP ORT 5.1 Branch
Changes for the new API to take in OpenVINO Provider Options
and compatibility with OV 2023.1
### Motivation and Context
The change is required for the new API to take in OpenVINO Provider
Options
and make it seamless.
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: saurabhintel0 <saurabh1.kale@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
Reduces precision on the CoreML provider test as it returns slightly
different answers than the other tested providers. Checked on a 2020 13"
M1 MBP.
### Motivation and Context
Fixes Java CoreML test failure after #16763.
Add a generic `UpdateCUDAProviderOptionsWithValue()` C API to update
CUDA EP provider options where its data type is pointer that can't be
represented by string.
Note: Please see some comments for the similar [PR
](https://github.com/microsoft/onnxruntime/pull/16965)for TRT EP.
### Use full qualified name for PythonOp export
Originally, when there are duplicate named torch.autograd.Function in
different module, for example:
`a.b.c.Gelu` v.s. `d.e.func.<locals>.Gelu`
We by default will throw exception to let user be aware we cannot
distinguish the two Gelu because during model export, we did not module
path. The workaround is we introduced
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` to ignore those duplicated named
Gelu that is not used by model run. This has limitations obviously for
example if two Gelus are both used in training.
This PR finds a way to construct a full qualified name.
`def _export_pt_1_10(g, n, *args, **kwargs):`
1. in exporter function, kwargs contains `name` and `module`, in the
above example:
`a.b.c.Gelu` --> name: `Gelu`, module: `a.b.c`
`d.e.func.<locals>.Gelu` --> name: `Gelu`, module: `d.e`
Using name and module is not enough to get a full qualified name, for
the second case, where `d.e` is the module path, then there is a
function called `func`, in this function, there is a local
auto.grad.Function named `Gelu`. (Many of our UT looks like this). We
can only get `d.e.Gelu`, but this is not the correct full qual name.
The reason for this: `kwargs[name]` or `n.name` only return the class's
name, not the class's full qual name. (be noted kwargs[module]` is
correct).
2. `n` is torch.Node, we can access `pyobj` to get the
torch.autograd.Function's apply method instance, then use `._self` to
get the torch.autograd.Function class. Then we can get the `module` and
`class`'s ful qual name, added together, we get the full qual name.
With the above change, we don't need use `kwargs[name]` and
`kwargs[module]` , and don't need check naming conflicting or
`ORTMODULE_SKIPPED_AUTOGRAD_FUNCTIONS` env var any more.
Fix an obvious bug:
(1) In packing mode, the input for SLN has two dimensions (introduced by
#15283): [token_count, hidden_size]. Current code of `element_count =
input_dims[0] * sequence_length * hidden_size` will use element_size =
token_count * hidden_size * hidden_size, and causes invalid memory write
in cuda kernel and ORT crash
and two minor issues:
(2) potential integer overflow in `static_cast<int>(element_count)`
(3) some dead code after `return LaunchSkipLayerNormKernel` that will
never have chance to run.
Maintaining one execution context on a per thread basis is suggested per
TRT
[doc](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#threading)
to avoid synchronization issue.
For previous TRT EP, we did see synchronization issues when running
multithreading on some models, for example, FasterRCNN.
This PR leverages per thread context implementation from CUDA EP.
Followings are the modifications:
- Move CUDA graph and IExecutionContext objects to per thread context.
- Remove lock_gruad that previously placed for the whole compute_func()
and put lock_gruad in the blocks where multiple threads may update
kernel function state, access one builder, create/serialize/save engine,
save profile and serialize/save timing cache.
- On CentOS, don't unload TRT EP shared library and leave it around, so
that destructor of thread local data is still accessible upon thread
exits.
Note: Tested this PR with onnxruntime_perf_test and the overhead of
PerThreadContext is small.
### Description
enable webgpu in browser unit test.
The CI pipeline uses Edge v113+ which enables WebGPU.
===
**UPDATE on 08/07/2023:**
- add flags to Edge browser launch commandline so that Edge on CI agents
can initialize WebGPU correctly.
- ONLY enable webgpu on web release build. Other pipelines are using
flag `-b=wasm,webgl,xnnpack` to specify the other 3 backends explicitly.
- disable "Resize" related test failures. Once they are fixed the tests
can be re-enabled.
---------
Co-authored-by: Satya Jandhyala <satya.k.jandhyala@gmail.com>
### Description
Added two kernels for Layer and Instance norm
Also added maximum limits for `maxBufferSize` when requesting GPU device
as by default it's limited to 256mb and it fails allocating 600mb buffer
while running fp32 StableDiffusion weights.
### Motivation and Context
These two are used in StableDiffusion and many other networks
Add script to get iOS simulator device info so we don't need to use hardcoded specifiers which may or may not refer to a valid simulator device.
Add use-xcode-version step to a packaging pipeline so it uses a consistent version of Xcode.
Update the usage of torch.onnx.OnnxRegistry, as it's officially
published in PyTorch: https://github.com/pytorch/pytorch/pull/106140.
---------
Co-authored-by: Wei-Sheng Chin <wechi@microsoft.com>
### Description
Enhanced SkipLayerNorm by implementing broadcasting for both CPU and
CUDA
### Motivation and Context
The input and skip tensors no longer have to be the same size which
means that it can accept data where the skip shape can be the same size
as the input shape, have a shape of {1, sequence_length, hidden_size},
or {sequence_length, hidden_size}.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Fix few bugs
1. symbolic shape infer, there is no None check before get length.
2. Rename PythonOp/PythonOpGrad's attribute `name` to `func_name`,
otherwise, when we use onnx.helper.make_node to create node, `name`
conflicts with node name.
3. Filter shape inference warnings for PythonOp for torch 2.0 or newer.
4. Close file descriptor for log suppression. Without the fix, two extra
fd is left after the log suppression exit its context.
Before enter log suppression (left), Before exit log suppression (right)

With the fix, no fd added after context exit.

If users use `trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`, they need to provide all the dynamic shape
input with associated shape profiles.
In the case of the main graph is partitioned into TRT/CUDA subgraphs, if
the input of the subgraph is also dynamic shape, users need to provide
its shape profiles as well. User might not notice, so TRT EP will tell
them which input shape profiles need to be provided.
New warning message is :
```
Traceback (most recent call last):
File "/home/azureuser/disk2/debug/optional_inputs.py", line 218, in <module>
test_optional_input_dynamic(trt_profile=True, optional=True)
File "/home/azureuser/disk2/debug/optional_inputs.py", line 195, in test_optional_input_dynamic
session = ort.InferenceSession(
File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line
419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/home/azureuser/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line
471, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : User needs to provide all the
dynamic shape inputs with associated profiles if they want to explicitly set profiles through provider options.
Please note that main graph could be partitioned into TRT/CUDA/CPU subgraphs, in this case, user also needs to provide
shape profiles for the TRT subgraph's input if it's dynamic shape input.
Following input(s) has no associated shape profiles provided: x1
```
Please see this github issue:
https://github.com/microsoft/onnxruntime/issues/16600
### Description
<!-- Describe your changes. -->
Adjust nativs to display tagged strings.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Hard to debug without seeing names.
[DML] Model corrupter during layernorm fusion and DmlNonZeroOperator
crashes
Two issues fixed in this PR:
1) Changes to layernom fusion regressed DirectML. This has been disabled
for DML to unblock models.
2) DmlNonZero needs to create an operator call that needs to know the
number of non-zero elements (size in bytes). Therefore this needs to be
allocated during compute, but is being allocated during initialization.
This causes the output tensor size to mismatch with the operator's
expectations.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>