…th-ort-leads-to-invalid-node-input-names
### Description
Fix issue where Quantizing DistilBERT models after optimizing with ORT
leads to invalid node input names
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add infrastructure so it's easy for a user to add the ORT extensions
nuget package and register the custom ops for C# apps.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Need to be able to use extensions on mobile platforms with Xamarin/MAUI
### Description
fix download failure due to buffer change.
WebAssembly buffer may change (growth triggered by memory allocation)
during an async function call.
### Description
<!-- Describe your changes. -->
Add registration for DML reduce functions in opset 18.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Linnea May <linneamay@microsoft.com>
### Description
All our Windows build pipelines already uses cmake 3.26 except one
pipeline: QNN ARM64.
This PR does the same for Linux build pipelines.
### Motivation and Context
This change is related to #15704 .
### Description
This PR changes an EmbedLayerNormalization node's mask index output to
be an optional output if a mask input is not provided.
### Motivation and Context
The documentation for EmbedLayerNormalization states
```
The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated.
```
However, if the mask input is not provided, the mask index output is
still calculated and required.
### Description
Extend the AllGather op to support perform allgather on different axis.
provide the implementation in nccl kernels.
### Motivation and Context
We hit some scenario in distributed inference that we need to support
gather on non-first axis.
---------
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
### Description
- Update to QNN SDK 2.9.0 for QNN pipelines
- Temporarily disable warnings as errors for QNN Windows x64 pipeline
- Note that this pipeline did not previously run to completion. It also
currently does not run for pull requests.
### Motivation and Context
Need to update and test the latest available version of the QNN SDK.
The PR is to allow custom op of different input types to have same op
name in a graph.
The idea to go over all ops of same name and merge their input/output
types into a type-inference function.
With the enhancement, custom op node inside a graph can have same
op-type given that the input/output types are different.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Rename onnxruntime-Linux-CPU-2019 machine pool to
"onnxruntime-Ubuntu2004-AMD-CPU". The old one has an internal error and
stuck there. I cannot make any change to it. It has been like this for
more than 1 week. So I created a new pool with the same setting except
the name is different.
Also, move some android pipelines to
"onnxruntime-Linux-CPU-For-Android-CI" which uses a standard image from
https://github.com/actions/runner-images
### Description
In #8953 I introduced a change in our onnxruntime_mlas.cmake that it
enables "ASM_MASM" cmake language for all Windows build.
```cmake
enable_language(ASM_MASM)
```
Before the change, it is only enabled when onnxruntime_target_platform
equals to x64.
However, cmake 3.26 added a new language: ASM_MARMASM.
According to cmake's manual,
ASM_MASM is for Microsoft Assembler
ASM_MARMASM is for Microsoft ARM Assembler. This one is new in cmake
3.26.
We should choose the right one according to
${onnxruntime_target_platform}.
### Description
* Update TensorRT 8.6 lib dependencies in dockerfile of TRT EP Perf
pipeline
* Avoid using `--allow_running_as_root` and build ORT with non-root user
### Motivation and Context
To fix the build issue on EP perf pipeline
Fixed
[AB#14615]
The follow code shows ROCm EP FusedConv produce incorrect results:
```py
import numpy as np
import onnx
import onnxruntime as ort
X = onnx.helper.make_tensor_value_info("input", onnx.TensorProto.FLOAT, [1, 64, 55, 55])
a = onnx.helper.make_tensor_value_info("tmp", onnx.TensorProto.FLOAT, [1, 64, 55, 55])
Y = onnx.helper.make_tensor_value_info("output", onnx.TensorProto.FLOAT, [1, 64, 55, 55])
weight_data = np.random.random([64, 64, 1, 1]).astype(np.float32)
weight1 = onnx.helper.make_tensor("weight1", onnx.TensorProto.FLOAT, [64, 64, 1, 1], weight_data)
bias_data = np.random.random(64).astype(np.float32)
bias1 = onnx.helper.make_tensor("bias1", onnx.TensorProto.FLOAT, [64], bias_data)
weight_data = np.random.random([64, 64, 1, 1]).astype(np.float32) # <------ comment out
weight2 = onnx.helper.make_tensor("weight2", onnx.TensorProto.FLOAT, [64, 64, 1, 1], weight_data)
bias_data = np.random.random(64).astype(np.float32) # <------ comment out
bias2 = onnx.helper.make_tensor("bias2", onnx.TensorProto.FLOAT, [64], bias_data)
node1 = onnx.helper.make_node("FusedConv", inputs=[X.name, weight1.name, bias1.name], outputs=[a.name], domain="com.microsoft", kernel_shape = [1,1], activation="Relu")
node2 = onnx.helper.make_node("FusedConv", inputs=[a.name, weight2.name, bias2.name], outputs=[Y.name], domain="com.microsoft", kernel_shape = [1,1], activation="Relu")
graph = onnx.helper.make_graph([node1, node2], "Graph", [X], [Y], initializer=[weight1, bias1, weight2, bias2])
model = onnx.helper.make_model(graph, producer_name="tmp", opset_imports=[
onnx.helper.make_opsetid('com.microsoft', 1),
onnx.helper.make_opsetid('ai.onnx.ml', 1),
onnx.helper.make_opsetid('', 14),
])
sess0 = ort.InferenceSession(model.SerializeToString(), providers=["CPUExecutionProvider"])
sess1 = ort.InferenceSession(model.SerializeToString(), providers=["ROCMExecutionProvider"])
ref = sess0.run(["output"], {"input" : 0.05 * np.ones([1, 64, 55, 55], dtype=np.float32)})[0]
our = sess1.run(["output"], {"input" : 0.05 * np.ones([1, 64, 55, 55], dtype=np.float32)})[0]
print(ref - our)
```
The root cause is that fusion args is cached together with fusion plan.
It seems that internal to MIOpen, the `miopenOperatorArgs_t` handle is
copied directly to execution engine, instread of the content of a
`miopenOperatorArgs_t`. If two ORT `OpKernel`s have the same conv kernel
spatial dimension and strides, etc, we then get the same hash for the
fusion plan, thus we also get the same fusion args handle. Then the
second node of `FusedConv` may modify the fusion args on the fly when it
is still pending execution for first node of `FusedConv` internal to
MIOpen. This PR moves the fusion args out of fusion plan cache to avoid
the problem.
### Fold shape related operation at best efforts.
This is a follow up for PR
https://github.com/microsoft/onnxruntime/pull/12561.
Create a specialized shape_optimzer to constant fold shape related
operation.
ShapeOptimizer at the best efforts to constant fold the dim values that
exists from shape inferencing. This is helpful to simplify the graph,
which on the other hand, help other graph transformers to do more.
Transformer that traverses the graph top-down and performs shape
optimizations.
Try the best effort to constant fold the shape related to Shape node
outputs:
1. Shape generates 1D tensor [12, 128, 512] (all dimensions have
concrete dim value), which can be constant folded
to an initializer including 1D tensor values [12, 128, 512]. (Some logic
of ConstantFolding also does the same thing.)
2. Shape generate 1D tensor [batch_size, 128, 512] ->
Slice(start=1,end=3), we can constant fold the Shape->Slice to
an initializer including 1D tensor values [128, 512].
3. Shape generate 1D tensor [batch_size, 128, 512] -> Gather(axes=[0],
index=[2]), we can constant fold the
Shape->Gather to an initializer including 1D tensor values [512].
4. Shape 15 takes input of shape [batch_size, 128, 512], slicing from 1
to 2(exclusive), we can constant fold the
Shape15(start=1,end=2) to an initializer including 1D tensor values
[128].
This would help clean up the graph, combined with ConstantFolding, the
graph would be much more simplified.
### Motivation and Context
One direct motivation to have this is, we have a model subgraph like
this:

The subgraph in the green rectangle is trying to get the value `30522`,
with the changes in this PR, the subgraph will be constant folded. Plus
ConstantFolding optimizer will further to optimize out the subsquent
`Squeeze`/`Unsqueeze`/`ConcatTraining`, then we will have a clean very
clean Reshape node, with its shape input be an constant `[-1, 20522]`.
Having this simplified graph, our other compute optimizer can help
further optimize the graph by re-ordering gather/reshape nodes.
### Description
Add parameters to make some stages could use other run's intermediate
output.
### Motivation and Context
nuget workflow has 38 stages of 4 layers.
We had to run the whole workflow from begining to test one stage.
It could make life easier to run only one stage for testing.
like

### N.B.
In this PR, Nuget_Test_Linux_CPU, Nuget_Test_LinuxGPU and
Jar_Packaging_GPU are enabled as the first step.
So I can start to move tests from Linux host to container
Adds skip for MIGraphX EP builds for Packed KV and QKV tests in
Multi Head attention. As it is not supported and causes CI failures
when building and testing EPs
---------
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
### Description
The BufferUniquePtrs in the old code doesn't have knowledge of the
allocator where the allocated memory was from, so it cannot free the
memory.
### Description
<!-- Describe your changes. -->
Reland previous reverted changes for loading model from buffer - Android
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#13903
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
DML's MVN metacommand needs all axes except for batch and channel to be
reduced. By adding trailing dimensions of 1's and their corresponding
axes, the operation stays the same but we are now able to call
metacommands.
### Description
In 2021 we restricted onnx node test CI execution in range of opset
14-15 for ORT-TRT, which was the latest opset that TRT EP could support
Update this range to opset 14-17 to improve the ORT-TRT unit test
coverage, as [Nvidia announced that TRT 8.6 supported
opset17](https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md)
### Description
This PR adds the Whisper custom export scripts to the wheel.
### Motivation and Context
This enables access to the custom export scripts in the wheel.
### Description
* Reverting default TensorRT version to 8.5 as temporary fix
* Apart from that, this PR temporarily leaves this CI as a place to
validate user behavior that uses TRT 8.5 with latest ORT
### Context
* This CI pool equips 2xTesla M60 GPUs, which are no longer supported by
TensorRT 8.6.
* Currently, other CIs are using single-T4 VM but there's no VM with
2xT4 or other suitable dualGPU in the range.
* Once we decide which VM instance for this CI to migrate to, TRT8.6 can
be enabled on this CI
* According to
[Nvidia](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html):
* TensorRT 8.5.3 was the last release supporting NVIDIA Kepler (SM 3.x)
and NVIDIA Maxwell (SM 5.x) devices. *These devices are no longer
supported in TensorRT 8.6*. NVIDIA Pascal (SM 6.x) devices are
deprecated in TensorRT 8.6.
### Description
This PR resolves a part of non-critical comments from code review
comments in #14579.
- use `USE_JSEP` instead of `USE_JS` in build definition to make it less
ambiguous
- remove unused util functions from util.ts
- fix transpose.h
- other misc fixes
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU.
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0
Dynamic Shapes Support for iGPU
### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
### Description
Update cuda 11.6 to 11.8 for Windows pipelines
This PR is just for Windows CUDA pipelines. It does include any change
for Linux pipelines or TensorRT pipelines
### Motivation and Context
It is a planned feature for the upcoming ONNX Runtime release.
### Description
<!-- Describe your changes. -->
support the latest deepspeed 0.9.1 for the next release
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This will avoid the warn message `Skip modifying optimizer because of
unsupported DeepSpeed version`
---------
Co-authored-by: ruiren <ruiren@microsoft.com>
Some math ops have very bad numerical stability and essential randomness
(e.g., exp/log with reduction on large elements). To maintain the same
test coverage with lower CI failing rate, we can gradually replace flaky
tests' RTG with the ones implemented in this PR --- try Discrete first.
If still unstable, use Circular.
Overall recommended strategy to handle flaky test
- Find if it uses `Uniform` in
`onnxruntime/test/common/tensor_op_test_utils.h`. If yes, replace
`Uniform` with `Discrete` implemented in this PR. For
`candidate_values`, we can try `[-2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5,
2]`, `[-2, -1, 0, 1, 2]`, `[-1, 0, 1]`, and `[0, 1]` and choose the most
difficult one among those passing 100 runs.
- If `Discrete` fails to meet the stability requirement, switch to
`Circular` and repeat the `candidate_values` selection process.
Let's keep an eye on the two bugs mentioned in
https://github.com/microsoft/onnxruntime/pull/15515. If the related unit
tests fail again, we can replace the underlying
`RandomValueGenerator::Uniform` with
`FixedPatternValueGenerator::Descrete` or
`FixedPatternValueGenerator::Circular` implemented in this PR.
### Description
<!-- Describe your changes. -->
### Error
```
RuntimeError: There was an error while exporting the PyTorch model to ONNX:-
Traceback (most recent call last):
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_utils.py", line 254, in get_exception_as_string
raise exception
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 385, in _get_exported_model
torch.onnx.export(self._flattened_module,
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/__init__.py", line 305, in export
return utils.export(model, args, f, export_params, verbose, training,
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 118, in export
_export(model, args, f, export_params, verbose, training, input_names, output_names,
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/onnx/utils.py", line 743, in _export
proto, export_map, val_use_external_data_format = graph._export_onnx(
RuntimeError: ONNX export failed: Couldn't export Python operator XDropout
```
The error leads to Out of Memory issue, because the log.txt file is **26
GB**.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The root cause is that in each `_forward`
```
if log_level <= _logger.LogLevel.WARNING and not self._raised_ORTModuleONNXModelException:
warnings.warn(
(
f"Fallback to PyTorch due to exception {type(self._exception)} was triggered. "
"Report this issue with a minimal repro at https://www.github.com/microsoft/onnxruntime. "
f"See details below:\n\n{_utils.get_exception_as_string(self._exception)}"
),
UserWarning,
)
```
above code will be called and log the `exception` through
`get_exception_as_string`,
In my training case, this will lead to 40 k times of `Traceback` stdout
and 110 millions lines of `onnx graph` output and run into OOM.
### Validation
After above fixes, the log.txt file will only be **2.4 MB**.
---------
Co-authored-by: ruiren <ruiren@microsoft.com>
### Description
make `RunFunction` return `void`.
the return value is meaningless in the OpResolveRule context. Allows any
JavaScript error to be caught and returns non-zero return value from
`computeKernel()`
### Description
Adding the fp16 onnx operator implementations:
maxpool, averagepool, global average pool, relu, leaky relu
### Motivation and Context
Continue with support for fp16. Standard onnx operator implementations are needed as a basis for the graph optimizers to work.
### Description
Fix iconv link issue. The library is used in string_normalizer.cc.
### Motivation and Context
Though iconv is part of POSIX standard, some systems may have additional iconv providers, for example GNU iconv, that is not in the standard c runtime library. In these cases we may need to link to additional libraries.
However, this change has two caveats:
1. It may silently pull in GNU libraries into libonnxruntime.so, and make the shared library not distributable.
2. The detection of iconv library runs before we add additional include folders to ORT. So the detection may be inaccurate.
### Description
<!-- Describe your changes. -->
previously it used create_attention_node() from base class in
fusion_attention.py. sometimes the changes in that file may silently
lead to generating a bad model.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
TRT EP test for timing cache has wrong logic where it enables timing
cache for both sessions to compare the trt engine build time, that's why
CI got some intermittent failures.
This PR disabled the timing cache test for comparing the engine build
time between enabling/disabling timing cache until we find a model that
can benefit from timing cache.