This makes min and max with NaN for either operand always return NaN for
float16 data, matching the behaviour of float and double.
The behaviour for floats and doubles was previously fixed for the CPU
provider in #21492 and the CUDA provider in #19984, but these PRs didn't
fix the behaviour for float16 due to tests causing asan errors. The
memory access violations with float16 data have now been fixed in
#22135, so this PR is a follow up to make float16 min and max behave the
same as float and double for both the CPU and CUDA providers now that we
can add tests for this.
### Motivation and Context
Relevant previous issues (not float16 specific):
* #21455
* https://github.com/onnx/onnx/issues/6003
### Description
Following from #16578 and #16835 this migrates over
`OnnxTensor.createTensor(<array>)` to first instantiate a
`java.nio.Buffer` and then copy the array into that buffer in Java
before creating the tensor. It also changes the `OnnxTensor.getValue()`
method which returns a multidimensional array so it does the array
construction and value copy in Java. This allows the removal of some
unpleasant recursive C code which repeatedly calls into the JVM to
traverse Java's arrays. The equivalent Java code is still unpleasant and
recursive, but it's easier to reason about and memory safe. As a bonus,
more `OnnxTensor`s are now backed by buffers which allow users to pin
memory and reduce allocations by reusing them for same sized inputs.
Some of the JNI code which parses Java arrays still exists as it's used
by `OnnxMap`, removing that will be the target of a future refactor.
Strings are still processed in JNI as it is easier to work with String
tensors and UTF-8 arrays in C.
### Motivation and Context
Minimizing the amount of JNI code makes it easier to maintain and using
buffers in preference to arrays allows for fewer allocations.
### Description
<!-- Describe your changes. -->
Add handling of a missing optional axes input to the ROCm reduction ops.
Matches CUDA EP change from #22149
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix pipeline.
### Description
* Add lintrunner to requirements-lintrunner.txt
* Lock lintrunner and lintrunner-adapter version
* Update documentation
### Motivation and Context
The document is not up to date.
Composable Kernel build fails under ROCm 6.2.
This PR patches Composable Kernel the same way as
https://github.com/ROCm/composable_kernel/pull/1346
* fix buffer resource to match "s" constraint
* add missing memory clobber
### Description
<!-- Describe your changes. -->
For InstanceNormalization, it has `y = scale * (x - mean) /
sqrt(variance + epsilon) + B` , where mean and variance are computed per
instance per channel. Calculating mean and variance per channel is a
reduce processing, which is NCHW layout friendly since it makes the
adjacent threads can access contiguous data in gpu memory.
This PR optimizes both NHWC and NCHW InstanceNormalization. To
efficiently calculate the mean and variance, we need to make sure the
input is NCHW instead of NHWC. Then use shared memory to do the reduce
operation to get `channel_scale` and `channel_shift`.
With this PR, getting `channel_scale` and `channel_shift` are same for
NHWC and NCHW InstanceNormalization. And the overall performance becomes
very close now.
Below data comes from SD Turbo profiling results.
Before (InstanceNormalization overall time: 140.84 ms)
InstanceNormalization\|InstanceNormComputeMean | 129.70
-- | --
InstanceNormalization\|InstanceNormalizationNHWC | 10.55
InstanceNormalization\|InstanceNormComputeChannelScaleShift | 0.59
After (InstanceNormalization overall time: 59.44 ms)
InstanceNormalization\|InstanceNormComputeChannelScaleShift | 28.57
-- | --
InstanceNormalization\|TransposeShared | 20.19
InstanceNormalization\|InstanceNormalizationNHWC | 10.68
### Description
<!-- Describe your changes. -->
Specify the path of `ar`, `ld` and `libtool` when building apple
framework.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sometimes non-system executables will comes before the system-provided
ones. This PR intends to prevent it from happening.
### Description
Fix an issue that QNN models shared from other session use the session logger from that producer session also which cause confusion. Make QNN model compute function use the session logger from current session.
### Description
* Add MultiHeadAttention fusion for SAM2.
* Add LayerNormalization fusion for NCHW format by inserting Transpose
from NCHW to NHWC before layer normalization, and add another Transpose
after layer norm to convert NHWC back to NCHW. Hopefully, those extra
Transpose nodes will be removed when prefer_nhwc is enabled later.
* Add a condition that the input shall be 3D when fuse SkipLayerNorm.
* Update convert_to_onnx.py to add `--optimize` and `--use_gpu` options
to output optimized onnx model for CPU/CUDA eps.
* Add an option `--dtype fp16|fp32` in convert_to_onnx.py to support
converting optimized model to float16.
* Update the demo to use the optimized onnx models.
### Motivation and Context
To support optimization of SAM2 for CPU/CUDA eps that is exported in
https://github.com/microsoft/onnxruntime/pull/22119
### Description
When K == 0 output a MxN matrix filled with bias if present or filled
with zeros.
This brings it inline with MatMul behavior especially when Gemm is used
to fuse MatMul with Add.
### Motivation and Context
* Comply with numpy spec of MatMul
* Address a case when empty initializers are used for computation.
### Description
<!-- Describe your changes. -->
The optional `axes` input may exist with an empty name and be a nullptr.
Update the CUDA implementation to handle this.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#22035
### Description
Fixes the logic for getting the number of elements for the input and
output spans in the `MinMaxMLFloat16` method. This was incorrectly using
the full number of elements in the output rather than the number of
elements in the current span, which worked fine with 1D inputs but
breaks with 2D inputs.
This meant that as the `BroadcastLooper` iterated over spans,
`MinMaxMLFloat16` would start at a position further forward in the input
and output and read and write further beyond the end of the input and
output respectively, causing the asan error in #21558 and sometimes
segfaults in larger examples.
### Motivation and Context
Fixes#21558.
From further testing, this issue didn't only cause asan errors in tests
but causes segfaults with larger sized inputs.
### Description
Decouple implementation for different A types to improve readability and
maintainability.
### Motivation and Context
As more types are added, the implementation can differ a lot between
types. Besides, different hardware may require different
implementations.
This PR creates an abstraction boundary where different implemetation
can plug in easily.
followed the rocm example below it which isn't the naming convention we
want to follow. didn't fix rocm because i'm not sure if there are
consumers using its naming convention.
### Description
Fix random crash for QNN UTs with multi-thread run like
QnnHTPBackendTests.MultithreadHtpPowerCfgDefaultAndRunOption
Root cause, last minute code change
b4e26bd5f9
static std::mutex mutex; -> OrtMutex mutex;
missed static.
### Description
Update DML EP for `FusedMatMul` ORT graph node have TransA/B attribute
set instead of updating the strides.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Use the latest nuget.exe for the `readme` property to be supported.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#22137
The spec renames MLOperandDescriptor.dimensions to
MLOperandDescriptor.shape, in order to support older Chromium versions,
we will keep both in WebNN EP for a while.
Fixed#22120
### Description
<!-- Describe your changes. -->
ONNXRuntime implementation of S8S8 was using the default C++
implementation; with this new ISA, all variants of QGemm Int8 can
support VNNI dot product and full AVX2 instructions.
All signed/unsigned variants support VNNI instructions starting with
LNL.
Renamed structs and functions to better indicate support of all Int8 vs
U8X8
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
LNL HW implemented new ISA, and this code enables that ISA in QGemm.
Speed is improved for S8S8 to match with existing U8S8 code. S8U8 would
also match speed if ONNX formally accepted the data type.
### Description
Fix regression caused by #17361
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR refactors the `CPU` kernel for the `CumSum` operator. The new
implementation strives to have as little indirection as possible.
### Motivation and Context
Currently the `CumSum` operator perform very poorly in the case of 1D
tensors(it was slower than a python loop). This is caused by the
extensive use of the `SliceIterator`-s.
Here is a relevant snippet:
```python
import time
import ndonnx as ndx
import onnxruntime as ort
import numpy as np
import onnx
def test_cumsum(sz):
a = ndx.array(shape=(sz,), dtype=ndx.int64)
b = ndx.cumsum(a)
model = ndx.build({'a': a}, {'b': b})
onnx.save(model, "model.onnx")
input = np.ones(sz, np.int64)
start = time.time()
result = ort.InferenceSession(model.SerializeToString()).run(None, {'a': input})
end = time.time()
return end - start
def test_cumsum_by_hand(sz):
input = np.ones(sz, np.int64)
start = time.time()
answer = [0]
for i in input:
answer.append(answer[-1] + i)
end = time.time()
return end - start
print(test_cumsum(int(1e7)))
print(test_cumsum_by_hand(int(1e7)))
```
Before
```console
0.9794480800628662
0.4518160820007324
```
After
```console
0.02483987808227539
0.5496008396148682
```
The `model.onnx`:
<img width="214" alt="image"
src="https://github.com/user-attachments/assets/a213d6ff-86c3-49b5-a493-ebfd97deaa41">
The flame graph:

### Description
Update XNNPack to latest version (Sep 4)
- Some op outputs are changed, channel or stride paras are moved into
reshape func.
e.g.
96962a602d
- input params of xnnpack's resize related function are changed a lot
- KleidiAI is added as a dependency in ARM64
- The latest XNNPACK includes 2 static libs microkernels-prod and
xnnpack.
Without microkernels-prod, it throws the exception of Undefined symbols.
- Add ORT_TARGET_PROCESSOR to get the real processor target in CMake
### Description
See https://github.com/microsoft/onnxruntime-extensions/pull/476
and https://github.com/actions/runner-images/issues/7671
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Current issue
- [ ] For default xcode 15.2, that come with the MacOS-13, We Need to
update the boost container header boost/container_hash/hash.hpp version
to pass the build
- [x] For xcode 14.2 The Build passed but the `Run React Native Detox
Android e2e Test` Failed.
Possible flaky test, https://github.com/microsoft/onnxruntime/pull/21969
- [x] For xcode 14.3.1 We encountered following issue in `Build React
Native Detox iOS e2e Tests`
```
ld: file not found: /Applications/Xcode_14.3.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/arc/libarclite_iphonesimulator.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```
Applied following code to the eof in both ios/Podfile and fixed the
issue
```
post_install do |installer|
installer.generated_projects.each do |project|
project.targets.each do |target|
target.build_configurations.each do |config|
config.build_settings['IPHONEOS_DEPLOYMENT_TARGET'] = '13.0'
end
end
end
end
```
- [x] https://github.com/facebook/react-native/issues/32483
Applying changes to ios/Pofile
```
pre_install do |installer|
# Custom pre-install script or commands
puts "Running pre-install script..."
# Recommended fix for https://github.com/facebook/react-native/issues/32483
# from https://github.com/facebook/react-native/issues/32483#issuecomment-966784501
system("sed -i '' 's/typedef uint8_t clockid_t;//' \"${SRCROOT}/Pods/RCT-Folly/folly/portability/Time.h\"")
end
```
- [ ] Detox environment setting up exceeded time out of 120000ms during
iso e2e test
### dependent
- [x] https://github.com/microsoft/onnxruntime/pull/21159
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
`supportsModel` is deprecated in TRT 10.1.
Add `supportsModelV2 `but still keep `supportsModel` as we still need to
support TRT 8.6 where `supportsModelV2 ` is not
supported.
Perf test data(100000 times)
Array: 12.599999997764826ms
String: 1.6000000014901161ms
Perf test case:
```
const permFunctionBodyArray = (rank: number, input: string): string => {
const reverseFunc = [];
reverseFunc.push(`fn perm(i: int) -> int {
var a: int};`);
for (let i = 0; i < rank; ++i) {
reverseFunc.push(input);
}
reverseFunc.push('return a;}');
return reverseFunc.join('\n');
};
const permFunctionBodyString = (rank: number, input: string): string => {
let reverseFunc= `fn perm(i: int}) -> int {
var a: int;`;
for (let i = 0; i < rank; ++i) {
reverseFunc+=input;
}
reverseFunc+='return a;}';
return reverseFunc;//.join('\n');
};
const count = 100000;
let start, end
console.time('array');
start = performance.now();
for(let i =0 ; i < count; i ++) {
permFunctionBodyArray(3, 'input');
}
end = performance.now();
console.timeEnd('array');
console.log("Array: "+ (end-start));
console.time('string');
start = performance.now();
for(let i =0 ; i < count; i ++) {
permFunctionBodyString(3, 'input');
}
end = performance.now();
console.log("String: " +(end-start));
console.timeEnd('string');
```
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is to fix issue #22031 to run model demucs.
For conv-transpose, outputPadding.length could be 1, while spatialRank
is 2. The fix is to append enough 0s to outputPadding. For conv, the
issue is similar. kernelShape.length sometimes could be 1, while
inputs[1].dims.length is 4. The fix is also to append enough 0s to
kernelShape.
### Description
Added checks to convert partial vectors in the early stages of the FP16
to FP32 cast using AVX NE CONVERT ISA.
### Motivation and Context
Avoid storing data in sections outside of the output buffer, these
checks are missing on the [original
PR](https://github.com/microsoft/onnxruntime/pull/21183).
This fix prevents memory corruption when the output buffer has a size
[n*16 + 1, n*16 + 7] with 0< n
patch from @john-dance
"The main change is simple: Use the original node name rather than the
original node op_type when creating new nodes. Here are my comments on
the change:
------
The onnx runtime uses the op_type as the basis for a new node name, so a
node claimed by QNN EP might be named
Conv_token_1 with no relation to the original /conv1/Conv. This patch:
1. Adds OpName as a virtual function in NodeRef and implements it in
ApiNode.
2. AddNode now takes an op_name and op_type and passes them both to
CreateNodeHelper.
3. CreateNodeHelper uses the op_name rather than the op_type in
GenerateNodeName
4. Direct calls to AddNode are modified to either use the NodeRef if
available, or just repeat the op_type if not available.
The result is that the new nodes are named something like
/conv1/Conv_token_1, allowing a straight forward mapping back to the
original model node (if they exist in the original graph)."
### Description
Adds support for constructing an `OrtSession` from a
`java.nio.ByteBuffer`. These buffers can be memory mapped from files
which means there doesn't need to be copies of the model protobuf held
in Java, reducing peak memory usage during session construction.
### Motivation and Context
Reduces memory usage on model construction by not requiring as many
copies on the Java side. Should help with #19599.
- Remove hard code data type checks and use WebNN's opSupportLimits
instead
- Add HasSupportedOutputsImpl for output data type validation
- Get preferred layout info from opSupportLimits
- Move Not op to logical_op_builder.cc because it should be there. This
avoid the inconsistent input names in `unary_op_builder.cc`.
### Description
This PR will add support for Continuous Decoding for batch_size = 1
input. From now on, GQA can take arbitrary length input using seqlens_k
as total_sequence_length - 1 and the sequence length of qkv as
new_sequence_length.
**This change will not affect the default behavior of GQA**
### Motivation and Context
Prior to this change it was impossible to support sequence_length > 1
inputs when past context was given. This use case is essential to making
continuous decoding work, which is one of our current efforts in
ORT-GenAI.