### Description
Add onnxruntime/test/run_benchmark.py helper script to repeat benchmark
runs until a target coefficient of variance is reached. It works with
[Google Benchmark](https://github.com/google/benchmark) programs like
`onnxruntime_mlas_benchmark`.
### Motivation and Context
Sometimes there is variability in benchmark run results. This automates
the repeated running needed to get results that are stable enough.
### Description
<!-- Describe your changes. -->
Re-order so that we don't get two messages for the one node.
Currently the batched matmul 'not supported' message will appear for 2D
input which is valid, which can be confusing to understand.
Change the order so we only check if batched matmul can be used when the
input ranks are > 3, as that is one of the requirements.
c311d1faf5/onnxruntime/core/providers/nnapi/nnapi_builtin/builders/op_builder_helpers.cc (L257-L264)
### Description
This adjusts the path used in the nuget script for dnnl to the new
location of the file.
There isn't a CI pipeline for this as far as I can tell, and I can't
easily confirm this change works on master, so please check.
### Motivation and Context
It is currently not possible to build onednn nuget packages. It's
possible that the correct action would be to move the file not fix this
path, but I'm not familiar enough with the repository layout.
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
For C++ standards >= 20, use `std::chrono::operator<<` in place of
`date::operator<<` to fix ambiguous operator compile error.
### Motivation and Context
The external dependency HowardHinnant/date has a conflict with
std::chrono for >=C++20.
Solves #20137
### How to run it locally
1. conda install ninja
2. "C:\Program Files\Microsoft Visual
Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
3. python.exe {ort_repo}\tools\ci_build\build.py --config RelWithDebInfo
--build_dir {ort_repo}\build_cuda --skip_submodule_sync --build_csharp
--update --parallel --cmake_generator "Ninja" --build_shared_lib
--enable_onnx_tests --enable_pybind --build_java --build_nodejs
--use_cuda "--cuda_home=C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v11.8" --enable_cuda_profiling --cmake_extra_defines
CMAKE_CUDA_ARCHITECTURES=60
4. cd build_cuda\RelWithDebInfo
5. cmake --build . j16
### Motivation and Context
In packaging pipelines, we often come across a random issue that the
building with CUDA on Windows takes too much time.
Although it has been reduced much by moving the building to the CPU
machine.
We're planning to build with Ninja instead of msbuild in Packaging
pipelines, thus, nvcc can run parallelly.
It's the first step to support it locally.
Adds an example to demonstrate the export of openai whipser
implemenation with batch_size > 1 and addition of prompts for each audio
snippet.
Also handles the scenario for when prompts are not of the same size. For
example if our prompt ids are [p1_id_1, p1_id_2] and [p2_id_1], the
final decoder_input_ids will look as such after padding:
`[prev_token, p1_id_1, p1_id_2, start_token, lang_token,
transcribe_token]
[prev_token, p2_id_1, PAD_TOKEN, start_token, lang_token,
transcribe_token]`
---------
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
See #19921 Just to address one comment:
https://github.com/microsoft/onnxruntime/pull/19921#discussion_r1543398640
since this is an external branch. need to open another pull request for
this.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Sai Kishan Pampana <sai.kishan.pampana@intel.com>
Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Jian Chen <cjian@microsoft.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Certain transformers slow down session loading time while providing no
runtime perf benefits.
Allow clients to exclude them.
Previous [PR
](https://github.com/microsoft/onnxruntime/pull/19663)changes msvc
compiler warning level from set_msvc_c_cpp_compiler_warning_level(3) to
set_msvc_c_cpp_compiler_warning_level(4) when using CUDA EP (it also
applies to TRT EP).
Some warnings still need to be addressed in TRT EP code.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The dml_provider_factory header file can't be used in C programs as it
defines C++ inline operators. This PR rearranges that header file so
that it looks like valid C when used from C, and also makes a couple of
small modifications to the Java code so it correctly binds to the DML EP
at build time.
I'm having some difficulty testing it as I think it's pulling in the old
version of DirectML on my computer and I can't figure out what the
library loading path is in Java to make it look at the recent version I
downloaded. So the test I added fails with:
```
InferenceTest > testDirectML() FAILED
ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Exception during initialization: <path-to-ort>\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(518)\onnxruntime.dll!00007FFF74819333: (caller: 00007FFF74793509) Exception(3) tid(4f58) 80070057 The parameter is incorrect.
at app//ai.onnxruntime.OrtSession.createSession(Native Method)
at app//ai.onnxruntime.OrtSession.<init>(OrtSession.java:74)
at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:236)
at app//ai.onnxruntime.OrtEnvironment.createSession(OrtEnvironment.java:221)
at app//ai.onnxruntime.InferenceTest.openSessionSqueezeNet(InferenceTest.java:1961)
at app//ai.onnxruntime.InferenceTest.runProvider(InferenceTest.java:665)
at app//ai.onnxruntime.InferenceTest.testDirectML(InferenceTest.java:657)
```
But it does correctly compile, and this error seems very similar to
other issues with the DML provider when it doesn't like a model due to
the loaded library being old. The test is using the squeezenet file
that's been in the repo since 2019. If someone can help me figure out
how to get the right version of DML in the library path I can test it
more on my end. I tried adding the folder with the new version into the
system path, but I'm not very familiar with Windows' library loading
behaviour.
### Motivation and Context
Fixes#19656 to allow use of the DirectML EP from ORT Java.
cc @martinb35
### Description
New flag of `dump_om_model` for **CANN EP**, which defaults to "True".
### Motivation and Context
When building an onnx model with CANN EP, the intermediate **OM(offline
model for Ascend NPU)** is automatically saved. There are some users
don't want to dump OM when resources are limited.
This PR will resovle this situation with `dump_om_model=False`
### Description
See issue 15465: https://github.com/microsoft/onnxruntime/issues/15465
This PR just applies the workaround suggested in the thread that I and
numerous others on the thread have validated to work for them and allows
them to successfully export a PyTorch model with LSTM layers that are
dynamically quantized by ONNX.
### Motivation and Context
It is not possible to successfully export a dynamically quantized LSTM
model that I have trained for use in the onnx runtime without this
change.
Currently, this workaround lives as a local change in my python package
directory, and makes it basically impossible for anyone else at the
place I work at to successfully export the quantized model that I am
exporting.
See issue 15465: https://github.com/microsoft/onnxruntime/issues/15465
Co-authored-by: Dhruv Matani <dhruv.matani@grammarly.com>
### Description
<!-- Describe your changes. -->
The cost computation of ComputeVxAttentionScore is wrong. It should be
sequence_length * v_head_size * total_sequence_length instead of
sequence_length * v_head_size * sequence_length.
The PR also fine-tuned the cost computation.
on my local box with i9 cpu, the performance is same as unfused version,
but it is much faster on an azure vm with 16 threads.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/19924
### Description
Created a new `LabelEncoderFusion` pass. This is useful in model that
result from automatic conversion tools related to data-science:
sometimes the produced model contains consecutive `LabelEncoder`-s.
To merge 2 `LabelEncoder`-s the optimizer propagates the outputs of the
first encoder through the second one.
### Motivation and Context
This enhances the capabilities of the `onnxruntime::optimizer` by fusing
consecutive `LabelEncoder` nodes.
### Fusion examples
```
Applying fusion
node1: (a,C) (b,B) (c,A) -> Default: _Unused
node2: (A,1) (B,2) (C,3) -> Default: -1
fused: (a,3) (b,2) (c,1) -> Default: -1
Applying fusion
node1: (a,C) (b,B) (c,A) -> Default: D
node2: (A,a) (B,b) (C,c) (D,d) -> Default: default
fused: (a,c) (b,b) (c,a) -> Default: d
Applying fusion
node1: (a,0) (b,1) (c,2) -> Default: -1
node2: (2,a) (1,b) (0,c) -> Default: default
fused: (a,c) (b,b) (c,a) -> Default: default
Applying fusion
node1: (a,3) (b,2) (c,1) -> Default: -1
node2: (1,a) (2,b) (3,c) -> Default: d
fused: (a,c) (b,b) (c,a) -> Default: d
```
---------
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
### Description
Except [Python-CUDA-Packaging
pipeline](https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1299&_a=summary),
all windows cuda packaging jobs have been running well now.
After comparison, enable_lto isn't added in the pipeline, which might be
one root cause of the random hang.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This pull request addresses several issues:
- The DML Graph's nodes were not sorted in a topologically ordered
sequence, leading to crashes during deserialization when a child node
preceded its parent node. This PR resolves this issue by implementing a
topological sorting algorithm before serialization.
- During the `RemoveUnconnectedNodes` process:
- we update `intermeidateEdge.FromNodeIndex`. Additionally, we must
update `intermediateEdge.Name` when it includes
`intermediateEdge.FromNodeIndex`, as serialization/deserialization
heavily relies on edge names.
- we also eliminate unused edges. Consequently, we must erase inputs
(now unused) from corresponding maps
`serializedGraphInputIndexToSubgraphInputIndex` and
`serializedGraphLargeConstantNameToSubgraphInputIndex`.
### Motivation and Context
Why is this change required? What problem does it solve?
There are few ONNX Zoo public models which were crashing during
deserialization.
<!-- - - If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Jeff Bloomfield <38966965+jeffbloo@users.noreply.github.com>
### Description
This PR adds flash attention v2 and support for INT4 CUDA benchmarking
in PyTorch.
### Motivation and Context
The [flash attention v2](https://github.com/Dao-AILab/flash-attention)
algorithm helps improve model performance in PyTorch. Support for INT4
CUDA in PyTorch is done through the
[`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.
### Description
<!-- Describe your changes. -->
- update axis value for per_channel quantization of QDQConv
- we should use `axis=1` for ConvTranspose operator.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- this PR fixes https://github.com/microsoft/onnxruntime/issues/19694,
which I have opened
### Description
<!-- Describe your changes. -->
Previously the expert weights are in row-major. But with the updated
cutlass extension introduced by
https://github.com/microsoft/onnxruntime/pull/20108, weights are stored
in col-major that aligns with Pytorch implementation. This change fixes
the way the tensors are sliced across shards.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and
D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX
headers through the DirectX-Headers repo.
Note that this includes an update to cgamanifest.json for onnx-tensorrt
which is triggered during re-generation due to a prior changes to
deps.txt.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Add API function GetAliasMap and ReleaseAliasMap in OrtCustomOp
### Description
for nodes like add, their input should be merged dynamically
### Motivation and Context
when doing shape inference, for nodes like Add, currently when doing _onnx_infer_single_node, their inputs are generated from last node's output, but they should be merged.
### Description
Fixes code that extracts the accuracy level when creating a MatMulNBits
node in the `DefaultWeightOnlyQuantizer` class.
### Motivation and Context
Error from line 443: `AttributeError: 'DefaultWeightOnlyQuantizer'
object has no attribute 'accuracy_level'`. The solution is to access
`self.config.accuracy_level` instead of `self.accuracy_level`.
Relevant commit: https://github.com/microsoft/onnxruntime/pull/19106
### Description
<!-- Describe your changes. -->
1. Introduce latest cutlass extension from TRTLLM that gives us cutlass
upgrade(to 3.4) opportunity from MoE side.
2. Fix Windows build issue
3. Add Int4 MoE op and ut
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Fix transformer layer detection for recompute
Originally logic miss detecting the layer boudary node in Mistral model.
This PR simplifies the searching, by using more strong pattern's match,
to make sure it is flexible enough to cover different transformer
variants.
Also add a UT.
Add a warning when user enable layerwise recompute but no layer boudary
nodes are found.
### Description
<!-- Describe your changes. -->
use OrtCustomOp's new API GetMayInplace in CreateKernelCreateInfo to
hook the inplace map of custom ops
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR is to use OrtCustomOp's new API GetMayInplace in
CreateKernelCreateInfo to hook the inplace map of custom ops
### Description
Bump spotless and the Gradle wrapper to 6.25.0 and 8.6 respectively to
allow compiling ORT on Java 21. The build still targets Java 8.
I'm not sure if there will be CI changes necessary to use this PR,
specifically for the Gradle version as I don't know if that is cached
somewhere earlier in the CI build process.
The new Gradle version adds a warning that using `--source` and
`--target` to select the Java language version is obsolete which is
annoying, we can fix it if we decide to only allow building on newer
versions of Java, while still supporting running on Java 8.
### Motivation and Context
Java 21 is the latest LTS release of Java and ORT should be able to
build on it.
### Description
Expose Reserve() in OrtAllocator to allow custom allocators to work when
session.use_device_allocator_for_initializers is specified.
Update: this change has been verified by Bing Ads and brings a
significant benefit in terms of memory utilization: 30GB less memory and
also better CPU utilization.
### Motivation and Context
https://microsoft-my.sharepoint.com/:w:/p/prs/Eeidf5YNtWtKrPVkfuTDsuABak1oL4QRpuBGuhqRbLKoJg?e=Zl3bah
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
reactor win-ci.yml to solve the random hang issue in more GPU workflows,
move nugget-zip packages and python cuda12 packages building to CPU
machine.
---------
Co-authored-by: Yi Zhang <your@email.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Keep original name during fusion
This could be helpful to know where the fused node coming from, I feel
this is very useful when debugging the execution order issues between
different transformer layers.
For example:
- A node named
`/_original_module/model/layers.1/self_attn/MatMul/MatmulTransposeFusion//MatMulScaleFusion/`
goes through two fusion paths in the 1st transformer layer - e.g.
`MatmulTransposeFusion` and `MatMulScaleFusion`.
-
`/_original_module/model/layers.2/post_attention_layernorm/Mul_1/SimplifiedLayerNormFusion/`
node is a fused node by `SimplifiedLayerNormFusion`.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Address build issues and source code discrepancies.
Fix cuda_test_provider gtest argument stack corruption.
### Motivation and Context
`OpTester` class that is widely used for kernel testing is not
suitable for testing internal classes for EPs that are built as shared
objects.
Currently, CUDA EP tests run only on Linux.
We want to enable testing and developments on Windows,
and create a usable pattern for testing of other EPs internals.
Alternatives considered:
Abstracting EP unit tests into separate test executable such as
`onnxruntime_test_all`.
This alternative was rejected as it would create a lot more changes in
the established patterns,
and potentially interfere with CUDA functionality with more complex
source code maintanence.
### Description
In #20073, I use pin onnx version to unblock the whole PR CI.
In fact, we could use the onnx that installed by building source code,
that the onnx version is controlled by deps.txt.
For some history reason, DML stage installed onnx from pypi. Now, the
onnx can be installed as other stages.
add an option to skip installing onnx in win-ci-prebuild-step
### Description
Add ModelProto support for `quantize` api
### Motivation and Context
Currently, the `quantize` API only accepts a model path as the input
model. However, for large models, saving and loading from disk can be
time-consuming. By adding `ModelProto` as an input option to the
`quantize` API, significant time can be saved.
### Description
Make error friendly when isOrtFormat is undefined
(`onnxruntime.InferenceSession.create` is called with ArrayBuffer or
Uint8Array).
### Motivation and Context
I was trying to run my onnx model in WebGL EP, but it gave me the error
"Cannot read properties of null (reading 'irVersion')".
I used debugger to find that actual error is `int64 is not supported`,
but the error was invisible for me.
So I made it to show both error when isOrtFormat is undefined.
<s>I haven't written unit test yet, so I'm making it draft. (I have no
idea about how do I test this though...)</s>
[d62d942](d62d9425ba)
### Description
Sometimes the `npm test` failed with an error of "TypeError: Failed to
fetch".
I checked the callback entry of the localhost server started by karma.
When the "Failed to fetch" happens, no request is reflected on the
server side. The root cause is still not identified. However, as this
issue only happens sometimes when the browser is just launched by karma
runner, doing retry can workaround this issue for most of the time.
The `CreateTensorRTCustomOpDomainList()` is not thread-safe due to its
static variables, `created_custom_op_list` and `custom_op_domain`.
This PR makes sure synchronization using mutex.
see issue: https://github.com/microsoft/onnxruntime/issues/20089