### Description
Enabled the use of per channel Bias and Mean normalization when converting an image <--> tensor.
Added a few bug fixes and updates to the relevant E2E tests.
---------
Co-authored-by: shalvamist <shalva.mist@microsoft.com>
Symbol visibility from DllImport is inconsistent across platforms resulting in the symbol not necessarily being visible to ORT native code that tries to look it up by name.
Best solution is to use DllImport to load the library and to call the registration function directly. That requires the native SessionOptions handle and OrtApiBase struct. We could either make those public, or provide a helper where the user passes in a delegate from their DllImport. Can add when needed.
Implement a set of new APIs for lightweight custom ops registration, to
save efforts on schema-composing.
A few highlights:
1. Support build-time type inference;
2. Support function-as-op for "stateless" ops;
3. Support structure-as-op for "stateful" ops;
4. Support varied input/output forms such as span, scalar, and tensors,
either optional or non-optional.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
### Description
Augment nhwc graph optimizer to accommodate fp16 operators.
### Motivation and Context
With new fp16 conv operator added. This operator prefers NHWC data
layout. We need to augment existing graph optimizers to better utilize
the new operator.
### Description
Originally VitisAI EP only works with old version of VitisAI release.
### Motivation and Context
Update VitisAI EP so that it works together with the current VitisiAI
3.5 and further version of VitisAI. We try our best to make it forward
compatible.
---------
Co-authored-by: Wang Chunye <chunywan@xilinx.com>
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: mingyueliuh <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: liumingyue <mingyue@xilinx.com>
Co-authored-by: moore-ch <129165652+moore-ch@users.noreply.github.com>
Co-authored-by: shoucair <shoucai.ren@amd.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
Co-authored-by: BoarQing <yuz75@Pitt.edu>
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
This addresses a performance regression in some INT8 models with the
DirectML EP by defaulting OrtSessionOptionsDisableQuantQDQ to 1 when the
EP is registered.
This regression occured due to the introduction of the QDQ propagation
transformer, which is based on this session option. That transformer
maximizes the number of nodes which are executed as quantized by
logically propagating quantize operators upstream and dequantize
operators downstream. However, it does this simply by inserting QDQ
pairs, with an expectation that something will recognize sequences of
DQ->Op->Q. This logic and related L2 transformers are not currently
enabled for the DirectML EP.
This change also removes a noisy warning when the session option for
memory pattern is overriden as the DirectML EP is registered.
### Description
Walkaround of https://github.com/microsoft/onnxruntime/issues/15521.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previous behavior of TRT EP to set TRT optimization profiles for dynamic
shape input is based on input tensor values. Users can't explicitly
specify the profiles.
This PR makes users capable of specifying min/max/opt profiles through
newly added three provider options:
`trt_profile_min_shapes`, `trt_profile_max_shapes` and
`trt_profile_opt_shapes`
with the format of "input1:dim1xdim2...,input2:dim3xdim4...".
(Note: It's similar to --minShapes, --maxShapes and --optShapes of
trtexec command-line
[flags](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec-flags))
For example, if you are using onnxruntime_perf_test, you can try this:
`./onnxruntime_perf_test -e tensorrt -r 1 -i
"trt_profile_min_shapes|imgs:1x3x384x288
trt_profile_max_shapes|imgs:32x3x384x288
trt_profile_opt_shapes|imgs:16x3x384x288" your_model_path`
If the engine cache is enabled, you still need to provide these three
explicit provider options in order to use this feature. ORT TRT will
compare the min/max/opt profile shape with the ones saved in .profile
file to decide whether to rebuild the engine.
Constraints to use these provider options: (1) Need to specify
min/max/opt profile shapes for all the dynamic shape input
This feature is also requested by other users:
https://github.com/microsoft/onnxruntime/issues/13851
### Description
<!-- Describe your changes. -->
Disable new test that is failing on linux. Not required for this
release. Will fix in the next week.
Marshal.Prelink can be used on Windows to make the symbol available but
Linux appears to work differently.
Also need to update the pre-checkin tests so this is tested early as
it's only failing in the E2E tests run in the packaging pipeline.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix packaging pipeline error.
### Description
This PR updates the documentation for using the Whisper custom export
scripts via the wheel.
### Motivation and Context
The path should say
`onnxruntime.transformers.models.whisper.convert_to_onnx` instead of
`onnxruntime.transformers.models.convert_to_onnx`.
Dynamic shapes was not working with serialized model so we are switching
to compile model method
### Motivation and Context
Dynamic shapes was not working with serialized model
- If it fixes an open issue, please link to the issue here. -->
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
### Description
Adds support for the LRN operator to QNN EP.
### Motivation and Context
Enables basic models like googlenet and alexnet to run entirely on QNN
EP.
### Description
<!-- Describe your changes. -->
js/react_native package dependency change to manage ort-extensions for
react-native app.
Enable optional inclusion of ort-ext aar/ ort-ext pods for react-native
extensions apps when specifiy `ortExtensionsEnabled` in user's
package.json file
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
### Description
<!-- Describe your changes. -->
-Add support for loading model from buffer on iOS
-Update OnnxruntimeModuleTest to use updated loadModelFromBuffer
Based on #12676
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Issue: #12500
---------
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
### Description
This PR adds `onnxruntime.transformers.models.whisper` to the wheel.
### Usage
There is a README.md document that shows sample commands. The following
command will show how to use the custom Whisper export script in more
detail.
```
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx --help
```
### Motivation and Context
This fixes an issue with adding the Whisper custom export scripts to the
wheel. The Whisper folder now appears in the wheel.

### Description
Cast optimizer may convert a fp16 node to fp32. This used to be safe as
all fp16 kernels has fp32 implementation. As this assumption is no
longer true, we need to check the validity of the operation
### Motivation and Context
Main work here is to introduce an API to check whether a kernel is
registered. Currently we don't have a way to do that without an operator
node. This needs to be augmented. We need to query whether a kernel is
registered by its property only, so that we can judge whether it is safe
to construct a node long before we actually do so.
…th-ort-leads-to-invalid-node-input-names
### Description
Fix issue where Quantizing DistilBERT models after optimizing with ORT
leads to invalid node input names
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add infrastructure so it's easy for a user to add the ORT extensions
nuget package and register the custom ops for C# apps.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Need to be able to use extensions on mobile platforms with Xamarin/MAUI
### Description
fix download failure due to buffer change.
WebAssembly buffer may change (growth triggered by memory allocation)
during an async function call.
### Description
<!-- Describe your changes. -->
Add registration for DML reduce functions in opset 18.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Linnea May <linneamay@microsoft.com>
### Description
All our Windows build pipelines already uses cmake 3.26 except one
pipeline: QNN ARM64.
This PR does the same for Linux build pipelines.
### Motivation and Context
This change is related to #15704 .
### Description
This PR changes an EmbedLayerNormalization node's mask index output to
be an optional output if a mask input is not provided.
### Motivation and Context
The documentation for EmbedLayerNormalization states
```
The last input mask is optional. If mask is provided, mask index (that is position of first 0 in mask, or number of words) will be calculated.
```
However, if the mask input is not provided, the mask index output is
still calculated and required.
### Description
Extend the AllGather op to support perform allgather on different axis.
provide the implementation in nccl kernels.
### Motivation and Context
We hit some scenario in distributed inference that we need to support
gather on non-first axis.
---------
Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
### Description
- Update to QNN SDK 2.9.0 for QNN pipelines
- Temporarily disable warnings as errors for QNN Windows x64 pipeline
- Note that this pipeline did not previously run to completion. It also
currently does not run for pull requests.
### Motivation and Context
Need to update and test the latest available version of the QNN SDK.
The PR is to allow custom op of different input types to have same op
name in a graph.
The idea to go over all ops of same name and merge their input/output
types into a type-inference function.
With the enhancement, custom op node inside a graph can have same
op-type given that the input/output types are different.
---------
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Rename onnxruntime-Linux-CPU-2019 machine pool to
"onnxruntime-Ubuntu2004-AMD-CPU". The old one has an internal error and
stuck there. I cannot make any change to it. It has been like this for
more than 1 week. So I created a new pool with the same setting except
the name is different.
Also, move some android pipelines to
"onnxruntime-Linux-CPU-For-Android-CI" which uses a standard image from
https://github.com/actions/runner-images
### Description
In #8953 I introduced a change in our onnxruntime_mlas.cmake that it
enables "ASM_MASM" cmake language for all Windows build.
```cmake
enable_language(ASM_MASM)
```
Before the change, it is only enabled when onnxruntime_target_platform
equals to x64.
However, cmake 3.26 added a new language: ASM_MARMASM.
According to cmake's manual,
ASM_MASM is for Microsoft Assembler
ASM_MARMASM is for Microsoft ARM Assembler. This one is new in cmake
3.26.
We should choose the right one according to
${onnxruntime_target_platform}.
### Description
* Update TensorRT 8.6 lib dependencies in dockerfile of TRT EP Perf
pipeline
* Avoid using `--allow_running_as_root` and build ORT with non-root user
### Motivation and Context
To fix the build issue on EP perf pipeline
Fixed
[AB#14615]
The follow code shows ROCm EP FusedConv produce incorrect results:
```py
import numpy as np
import onnx
import onnxruntime as ort
X = onnx.helper.make_tensor_value_info("input", onnx.TensorProto.FLOAT, [1, 64, 55, 55])
a = onnx.helper.make_tensor_value_info("tmp", onnx.TensorProto.FLOAT, [1, 64, 55, 55])
Y = onnx.helper.make_tensor_value_info("output", onnx.TensorProto.FLOAT, [1, 64, 55, 55])
weight_data = np.random.random([64, 64, 1, 1]).astype(np.float32)
weight1 = onnx.helper.make_tensor("weight1", onnx.TensorProto.FLOAT, [64, 64, 1, 1], weight_data)
bias_data = np.random.random(64).astype(np.float32)
bias1 = onnx.helper.make_tensor("bias1", onnx.TensorProto.FLOAT, [64], bias_data)
weight_data = np.random.random([64, 64, 1, 1]).astype(np.float32) # <------ comment out
weight2 = onnx.helper.make_tensor("weight2", onnx.TensorProto.FLOAT, [64, 64, 1, 1], weight_data)
bias_data = np.random.random(64).astype(np.float32) # <------ comment out
bias2 = onnx.helper.make_tensor("bias2", onnx.TensorProto.FLOAT, [64], bias_data)
node1 = onnx.helper.make_node("FusedConv", inputs=[X.name, weight1.name, bias1.name], outputs=[a.name], domain="com.microsoft", kernel_shape = [1,1], activation="Relu")
node2 = onnx.helper.make_node("FusedConv", inputs=[a.name, weight2.name, bias2.name], outputs=[Y.name], domain="com.microsoft", kernel_shape = [1,1], activation="Relu")
graph = onnx.helper.make_graph([node1, node2], "Graph", [X], [Y], initializer=[weight1, bias1, weight2, bias2])
model = onnx.helper.make_model(graph, producer_name="tmp", opset_imports=[
onnx.helper.make_opsetid('com.microsoft', 1),
onnx.helper.make_opsetid('ai.onnx.ml', 1),
onnx.helper.make_opsetid('', 14),
])
sess0 = ort.InferenceSession(model.SerializeToString(), providers=["CPUExecutionProvider"])
sess1 = ort.InferenceSession(model.SerializeToString(), providers=["ROCMExecutionProvider"])
ref = sess0.run(["output"], {"input" : 0.05 * np.ones([1, 64, 55, 55], dtype=np.float32)})[0]
our = sess1.run(["output"], {"input" : 0.05 * np.ones([1, 64, 55, 55], dtype=np.float32)})[0]
print(ref - our)
```
The root cause is that fusion args is cached together with fusion plan.
It seems that internal to MIOpen, the `miopenOperatorArgs_t` handle is
copied directly to execution engine, instread of the content of a
`miopenOperatorArgs_t`. If two ORT `OpKernel`s have the same conv kernel
spatial dimension and strides, etc, we then get the same hash for the
fusion plan, thus we also get the same fusion args handle. Then the
second node of `FusedConv` may modify the fusion args on the fly when it
is still pending execution for first node of `FusedConv` internal to
MIOpen. This PR moves the fusion args out of fusion plan cache to avoid
the problem.
### Fold shape related operation at best efforts.
This is a follow up for PR
https://github.com/microsoft/onnxruntime/pull/12561.
Create a specialized shape_optimzer to constant fold shape related
operation.
ShapeOptimizer at the best efforts to constant fold the dim values that
exists from shape inferencing. This is helpful to simplify the graph,
which on the other hand, help other graph transformers to do more.
Transformer that traverses the graph top-down and performs shape
optimizations.
Try the best effort to constant fold the shape related to Shape node
outputs:
1. Shape generates 1D tensor [12, 128, 512] (all dimensions have
concrete dim value), which can be constant folded
to an initializer including 1D tensor values [12, 128, 512]. (Some logic
of ConstantFolding also does the same thing.)
2. Shape generate 1D tensor [batch_size, 128, 512] ->
Slice(start=1,end=3), we can constant fold the Shape->Slice to
an initializer including 1D tensor values [128, 512].
3. Shape generate 1D tensor [batch_size, 128, 512] -> Gather(axes=[0],
index=[2]), we can constant fold the
Shape->Gather to an initializer including 1D tensor values [512].
4. Shape 15 takes input of shape [batch_size, 128, 512], slicing from 1
to 2(exclusive), we can constant fold the
Shape15(start=1,end=2) to an initializer including 1D tensor values
[128].
This would help clean up the graph, combined with ConstantFolding, the
graph would be much more simplified.
### Motivation and Context
One direct motivation to have this is, we have a model subgraph like
this:

The subgraph in the green rectangle is trying to get the value `30522`,
with the changes in this PR, the subgraph will be constant folded. Plus
ConstantFolding optimizer will further to optimize out the subsquent
`Squeeze`/`Unsqueeze`/`ConcatTraining`, then we will have a clean very
clean Reshape node, with its shape input be an constant `[-1, 20522]`.
Having this simplified graph, our other compute optimizer can help
further optimize the graph by re-ordering gather/reshape nodes.
### Description
Add parameters to make some stages could use other run's intermediate
output.
### Motivation and Context
nuget workflow has 38 stages of 4 layers.
We had to run the whole workflow from begining to test one stage.
It could make life easier to run only one stage for testing.
like

### N.B.
In this PR, Nuget_Test_Linux_CPU, Nuget_Test_LinuxGPU and
Jar_Packaging_GPU are enabled as the first step.
So I can start to move tests from Linux host to container
Adds skip for MIGraphX EP builds for Packed KV and QKV tests in
Multi Head attention. As it is not supported and causes CI failures
when building and testing EPs
---------
Co-authored-by: Ted Themistokleous <tthemist@amd.com>
### Description
The BufferUniquePtrs in the old code doesn't have knowledge of the
allocator where the allocated memory was from, so it cannot free the
memory.
### Description
<!-- Describe your changes. -->
Reland previous reverted changes for loading model from buffer - Android
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#13903
---------
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>