### Description
1. `onnxruntime_fetchcontent_makeavailable` works around unconditional
install commands so that can be used instead of `FetchContent_Populate`
2. This dependency is Windows specific, mark it as such.
### Motivation and Context
1. This simplifies `cmake/external/wil.cmake` not to do anything
specific wether WIL was fetched or found
2. Given it's specific to Windows, it might not be available on other OS
in specific air-gapped environment such as
[conan-center-index](https://github.com/conan-io/conan-center-index).
This allows downstream builds not to require specific patches for
something not required by the build in the first place.
The notebooks are not up to update.
(1) Update BERT and GPT-2 optimization notebooks for CPU EP with latest
PyTorch and ONNX Runtime.
(2) Add links to quantization example
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/16515
### PythonOp Enhancement: Bool and Tuple[Bool] Constants, Materialize
Grads, Empty Inputs, Save In Context
1. Support `bool` or `Tuple[bool]` constant type in inputs.
2. Support `ctx.set_materialize_grads(True|False)`
3. Backward op can accept empty input (that don't require grad)
4. Special handling for ORT tensors are saved in context
**Scenario**: a tensor is generated by ORT, then it might be saved for
backward by `ctx.save_for_backward(tensor)`, while `tensor`'s reference
count is not increased in ORT's allocation plan, so it is possible ORT
release the tensor data, before backward usage.
**Currently**: we copy every tensor before running
autograd.Function.forward(), this might be a problem for cases there are
many PythonOp (for example zero stage 3).
**Proposal**: To avoid those unnecessary copies for tensors that are not
saved in context, this change introduced a `_GlobalOpKernelInfoMap`.
During the kernel first run, we will anyway copy all tensors generated
from ORT, and give it to torch.autograd.Function for run, then we check
whether the inputs needs to be saved in context, and save the input
index that needs saving in `_GlobalOpKernelInfoMap`. Then for later
iterations, we just copy what is needed.
### Description
- Disables Resize tests that use nearest mode on QNN CPU.
- Fixes indentation problems on yaml for win x64 qnn pipeline.
### Motivation and Context
The QNN windows Nuget pipeline does not run due to failing unit tests on
Windows x64. These tests should not be enabled until we determine the
rounding behavior of QNN's ResizeNearestNeighbor operator.
On Windows, clang-format has a bug when AlignTrailingComments.Kind is
set to `Leave`
(https://clang.llvm.org/docs/ClangFormatStyleOptions.html#aligntrailingcomments),
where it will keep adding indentation to comments after each formatting
runs.
This PR changes to always align comments so we do not hit the bug.
As a consequence of the options change we need to reformat some of the
files. Note that this option is aligned with the rest of the repository.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ROCm python package pipeline failed because this
PR(https://github.com/microsoft/onnxruntime/pull/16325) changed onnx
version to a commit and we need to build onnx from source. Low protobuf
version will cause build errors.
This PR remove `cmake ` and `protobuf ` from Dockerfile, these two will
install by `install_os_deps.sh`.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Remove the onnxruntime-extensions submodule since it now was used via
cmake FetchContent
### Motivation and Context
The submodule relies on an outdated version of the extensions, and the
build instructions should be updated to eliminate any confusion.
### Description
The correct name should be onnxruntimecpubuildpython
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Allow defining customized PythonOp shape inferer
For `torch.autograd.Function`, we converted it to PythonOp in MSDomain,
there are two places to do shape inferencing for it:
1. in SymbolicShapeInfer, there is one.
2. in PythonOp op definition.
For common PythonOp, since we don't know the relation ship between
inputs and outputs, so we only infer the rank from output ranks, and
generate symbolic dimensions for each dim. While this will introduce
many meaningless symbolic dimensions, sometimes blocking our graph
transformers to do op fusion.
This PR provide a way to define custom shape inferencing for
`torch.autograd.Function` we defined, to propagate the original
dimensions across the PythonOp at the best efforts.
But the 2rd one is not covered yet, we could refine that later. Fixing
1st one is enough for ORTModule training/evaluation.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
test case 'test_batchnorm_epsilon_training_mode' on webgpu is failing.
the issue need time to investigate so comment this off and re-enable it
when the root cause is fixed.
### Description
Adding int4 quantization code in python
### Motivation and Context
Python quantization tool no-longer needs to invoke shell to call a
native exe
### Description
This PR fixes build break for WebAssembly introduced in
6986981482
(435ad2b1d8).
This change updates onnx.patch in onnxruntime repo. the corresponding PR
in onnx repo is: https://github.com/onnx/onnx/pull/5495.
It may takes a while for the next onnx version bump.
### Description
This PR introduces the new incides helper.
IndicesHelper is a helper class for generating WGSL code for
manipulating indices and data for a shader's input or output.
This class is designed to offer a unified way to generate WGSL code for
manipulating indices and data for a shader's input or output. The
following is a list of terminologies used in this class:
- `offset`: a uint32 value representing the offset of an element in the
data buffer.
- `indices`: an abstraction of a multi-dimensional array's indices
representing the data's index on each dimension.
- `value`: a value of a data element.
Users are expected to create an instance of this class for each shader's
input or output, and use the instance to generate WGSL code for
manipulating indices and data. The following 2 exported functions are
for users to call to create an instance of an indices helper:
- `inputVariable()`: create an indices helper instance for an input.
- `outputVariable()`: create an indices helper instance for an output.
An indices helper instance contains helper functions for the following
operations:
- access readonly basic information, including: `name`(the name of the
input or output), `usage`(whether it's an input or an output) and
`shape`(the passed in shape).
- `type`: access readonly type information, including: `indices`(the
type of indices), `value`(the type of value at runtime), `storage`(the
type of value at storage) and `tensor`(the tensor type as represented in
TensorView).
- generate WGSL code for getting indices from offset. Use
`offsetToIndices()` for WGSL code snippet to calculate incides from
offset, and use `indicesToOffset()` for WGSL code snippet to calculate
offset from indices.
- to manipulate an instance of indices, use `setIndices()` and
`getIndices()` to set and get the indices on an indices variable.
- to manipulate data, use `set()`/`get()` to access data at the given
indices from parameter list, use `setByIndices()`/`getByIndices()` to
access data at the given indices from an indices variable, and use
`setByOffset()`/`getByOffset()` to access data at the given offset.
- `impl`: get WGSL code of function implementation for the util
functions mentioned above.
This change applies the usage of new IndicesHelper through the code, but
not necessary for all code.
### Description
Some pipelines are failing. It is because PR #16325 set ONNX version to
`rel-1.14.1` . It is a branch name, not a commit or tag name. It means
whenever the branch got a new commit, we will auto pick it and use it.
### Description
updates the build.bat file in root folder to try best efforts to locate
patch.exe.
patch.exe is needed to apply patches to some of the dependencies. for
example, #17104. However, patch.exe is not available on every windows
developer's search path, and if cannot find patch.exe, the build will
continue silently. ( as a result, patch is not applied and for patches
like #17104, this will cause a build break )
This change adds folder `C:\Program Files\Git\usr\bin` to the PATH,
which is the default git installation directory. This may resolve the
patch not found issue for most (hopefully) users.
This addresses a DML performance regression from the following PR
resulting in allocations not being rounded and pooled in the DML
execution provider.
https://github.com/microsoft/onnxruntime/pull/15833
This also fixes a pre-existing limitation that allocations during
session initialization (primarily large weights and persistent
resources) only bypassed rounding and pooling while using the Winml API.
The allocator now also respects a caller's rounding mode parameter when
provided.
### Description
This change upgrade emsdk to 3.1.44.
Because backend is upgraded to LLVM 16, so need to fix a lot of build
failures caused by "-Wshorten-64-to-32".
most of the build failures comes from generated `onnx.pb.h`, and this
can be fixed by including "core/graph/onnx_protobuf.h", which detects
and ignore shorten-64-to-32 warnings.
### Description
1. Update model_tests.cc: avoid auto adding new tests from new opsets.
2. Simplify the "ConcatPathComponent" function. It does not need to be a
template.
### Motivation and Context
All our Windows/Linux CI build machines are preloaded with some test
data. In model_tests.cc, we auto add all of them to
onnxruntime_test_all.exe's unit tests. However, it causes problems when
we update the CI build machine images: new data could cause pipelines
suddenly failing.
Therefore, instead of auto discovering test data and adding all of them
to tests, this PR changes it to explicitly specify the opset names.
This change doesn't impact how Web CI pipeline runs its tests.
Going forward, the workflow would be like:
Step 1: update the onnx version in deps.txt
Step 2: Update js/scripts/prepare-onnx-node-tests.ts. Like #16943 .
Better to put step 1 and step 2 in the same PR.
Step 3: onnxruntime-es team regenerates VM images, test them and deploy
them.
Step 4: Enable the new opset test data for EPs.
[AB#18340](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/18340)
### Description
<!-- Describe your changes. -->
return empty ComputeCapability when a graph contains nested subgraph.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For now, our architecture does not support nested subgraph. So, we
return empty ComputeCapability for this case.
### Description
<!-- Describe your changes. -->
a node arg can be matched multiple times.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previous, we thought the node name must be unique and thus can be used
as identifier. However, we recently found that a node's name can be
empty thus failed to identify which node is which. So, we use node arg
to differentiate the node. To do so, we need to match node arg more than
once.
### Description
Fix some Resize failing tests.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
With PaddingElimination optimizer, input1 of element-wise op may be
flattened like:
```
input1 (shape:[batch_size, seq_len, ...]) input1 (shape:[valid_tokens, ...])
\ \
\ input2 \ input2
\ / -----> \ /
\ / \ /
Element-wise Op Element-wise Op
```
So, the shape of input2 should be processed accordingly:
1. If input2.shape.dim_size <= input1.shape.dim_size-2, i.e. input2 has
no [batch_size, seq_len] at begining,
we needn't to process the shape of input2 because it's compatible with
the flattened shape of input1 (shape:[valid_tokens, ...]).
2. If the shape of input2 has the same dim_size with shape of input1 and
has [batch_size, seqlen] at begening,
to be compatible with flattened shape of input1, we need to insert
flatten pattern for input2 also,
which flatten the shape of input2 from [batch_size, seq_len, ...] to
[valida_tokens, ...].
3. (which done in this pr) In other case for shape of input2, like [1,
seq_len, ...] or [batch_size, 1, ...], we firstly need to expand it
to [batch_size, seq_len, ...] which is convenient to flatten. And then
insert flatten pattern.
### Check type for building gradient graph
**Bug1**:
To fix the error when running the model with ORTModule + Stage 3:
```
Exception happens when running <bound method Function.apply of <class 'onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction'>>
Traceback (most recent call last):
File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py", line 207, in call_python_forward_function
wrapped_arg.requires_grad = is_training_mode and grad_flag
RuntimeError: only Tensors of floating point and complex dtype can require gradients
```
This is because when running PythonA, the 3rd input is int64, we find it
requires gradient during the check in gradient builder, so we set its
requires_grad = True, but PyTorch thinks it is incorrect, throwing the
exception. So we need understand why ORT gradient builder think the 3rd
input need gradients.
During `ReverseBFSWithStopGradient`, which do reverse BFS from graph
outputs, it collects all nodes that are needed for computing the graph
outputs. `ReverseBFSWithStopGradient` define a queue, initially add all
nodes that generate graph outputs, then iterate the nodes one by one,
checking each node's input, if the input did not hit stop edge and its
node arg type is allowed type (float, etc), then the input node is
append into the queue, do the next iteration of work.
PythonOpA is such a node that is needed to compute graph outputs, then
IsReachable(PythonOpA) will return True.

In the above code snippet, when node is PythonOpB, and next_node being
PythonOpA, we did not check node_arg type between node and next_node on
the connection of PythonOpA's 3rd input to PythonOpB's outputs. So we
append the int64 typed node args to sets that require gradient.
**Fix1**: add the node arg type check before appending it into require
grad lists.
After the fixing, A unit test failed
"orttraining_test_ortmodule_api.py::test_gradient_correctness_minmax[data_type0-True-0-min]
Fatal Python error: Segmentation fault". After investigation, it is
another bug.
**Bug2**:
Without the above fix1, the execution graph looks like this

As you can see, int64 type has a gradient edge built, while it is not
used for any consumers. And the execution runs well. While think twice,
int type should not have grad edge built.
With the Fix1, the execution graph looks like this;

So the int type node arg did not has gradient edge built. **Fix1** is
fixing this problem.
But another bug happens if the inital "y_node_arg_names" e.g. in this
case Aten's two outputs, 1st one in float, 2nd one in int. When we check
the y_node
(6e6f582e08/orttraining/orttraining/core/framework/gradient_graph_builder.cc (L60C16-L60C16)),
we did not check the data type, then add it into `y_node_args_` which is
the list of graph output node args that requires gradient. Then
`non_differentiable_y_node_arg_names_` did not has the int type graph
output.
Then
6e6f582e08/orttraining/orttraining/core/framework/ortmodule_graph_builder.cc (L312C18-L312C18)
will try to get the grad node arg into `yield_output_node_args`, BUT the
grad node arg is not built for int type node arg (with the **Fix1**). So
we insert a nullptr, later when we using it, we get segment fault.
**Fix2**
Again, we add the type check when handle y_node_args, also add null
check when getting gradient node arg and append into
yield_output_node_args
### Description
<!-- Describe your changes. -->
remove unused code
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
remove unused code
This is in preparation for planned ROCm 6.0 changes that are not
backward compatible. However, the adjustments made by this PR to the
current onnxruntime cmake files will work with ROCm 5.x and 6.x.
The approach is the following:
1. Build partitions
2. Try compiling each partition into a `IDMLCompiledOperator`
3. If the compiled operator's persistent resource is bigger than 4GB,
tell the partitioner to split the partition in the middle and try again.
4. Once all partitions have been successfully compiled into an
`IDMLCompiledOperator`, fuse the partitions into an ORT operator and
register them all.
This change is relatively simple (basically a basic retry mechanism),
but it required a lot of refactoring just to make sure that we don't
modify the graph until **all** partitions have been compiled
successfully. This is because partly modifying the graph before making
sure that all partitions can be compiled will break future retries.
This path is not expected to be used a lot, and even then the loop is
not expected to loop more than twice very often. This is a very specific
edge case for large models that were able to merge a large number of
nodes into a single partition.
### Description
<!-- Describe your changes. -->
Check the bound of the node_get_inputs for out of bound error.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Model with loop would encounter this error. Currrent we do not support
custom op for loop. So, ideally it would throw an error and fall back to
CPU evalution.
### Description
onnxjs contains a `Resize` op input check which is outdated since opset
9. Currently `Resize` supports up to 4 inputs. This PR looses the input
check.
### Motivation and Context
Fixes#15636
### Description
1. Add "--windows_sdk_version" argument to build.py
2. Fix Windows Static Analysis build pipeline. It is failing because it
picks up a different Windows SDK version after a build machine image
update. If we can explicitly specify Windows SDK version, we can avoid
such things happening again.
3. Remove --enable_training from Windows Static Analysis build pipeline
because PR #16993 makes it incompatible with "no_rtti".
AB#18315