### Description
Add a temporary path to RN 0.69.3 to update the boost url
### Motivation and Context
Fix the React-native CI until we update the RN to 0.70.15 or 0.73.3+
versions
ONNX's MatMul is same as numpy.matmul, which supports input tensors with
rank >= 1. But QNN's MatMul can only support input tensors with rank >=
2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to
support all possible cases of ONNX's MatMul, by adding Reshape nodes if
necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output
to expected shape at the end.
This PR also tries to use FullyConnected Op for MatMul if 2nd input is
2D initializer or 1D tensor because FullyConnected is faster than MatMul
on QNN EP. If 2nd input is 2D tensor, we require it an initializer
because FullyConnected requires 2nd input in [n, k] shape, we can
transpose it when graph building if it's an initializer (we don't want
to add extra Transpose node).
Use swin_base model as example, which contains several MatMul nodes with
2nd input is 2D initializer (not followed by Add), running on Gen3
mobile device, before the change, it takes 34.8876 ms, after this
change, it's 27.0639 ms.
Some quantized models have QDQ around Conv/Gemm but the weight and/or
bias are not quantized. This PR adds WeightBiasQuantization optimizer to
quantize float weight and/or bias to INT8 and INT32 tensors
respectively. We only do this for weight and/or bias initializer so that
ConstantFolding will fold the sub-graph to real quantized initializers
during the graph optimization next round.
### Description
Fix comparison of narrow type with wide type in loop condition.
### Motivation and Context
Comparison between types of different widths in a loop condition can
cause the loop to fail to terminate.
### Description
Fusing Pad & AveragePool requires AveragePool to use
`count_include_pad=1`. If the AveragePool already set some padding and
`count_include_pad=0`, fusion can't happen.
This PR adds a condition to perform fusion depending on those
attributes. If fusion occurs, `count_include_pad` is always set to `1`.
### Motivation and Context
Fix#22177 (mislabelled as a performance issue but there's an actual bug
in the implementation)
Bug introduced in #21556
### Description
This PR 1) uses override shape instead of tensor original shape in
shader key to reduce some shader variants; 2) adds indices shape rank to
shader key in case some potential errors.
### Description
Separating result processor out from profiler.py without changing the
behaviors of current profile.py
### Motivation and Context
Less dependency and smaller code for processing profile from other
scenarios.
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
1. Currently Python-Cuda-Publishing-Pipeline only publishes Linux
wheels, not Windows wheels. It is because recently we refactored the
upstream pipeline("Python-CUDA-Packaging-Pipeline") to use 1ES PT. This
PR fixed the issue
2. tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml no
longer includes component-governance-component-detection-steps.yml ,
because 1ES PT already inserted such a thing
3. Delete tools/ci_build/github/windows/eager/requirements.txt because
it is no longer used.
### Motivation and Context
The "Python-CUDA-Packaging-Pipeline" is for CUDA 12.
"Python CUDA ALT Packaging Pipeline" is for CUDA 11.
The two pipelines are very similar, except the CUDA versions are
different.
Each of them has three parts: build, test, publish.
"Python-CUDA-Packaging-Pipeline" is the first part: build.
"Python CUDA12 Package Test Pipeline" is the second part.
"Python-Cuda-Publishing-Pipeline" is the third part that publishes the
packages to an internal ADO feed.
Move Linux GPU CI pipeline to A10 machines which are more advanced.
Retire onnxruntime-Linux-GPU-T4 machine pool.
Disable run_lean_attention test because the new machines do not have
enough shared memory.
```
skip loading trt attention kernel fmha_mhca_fp16_128_256_sm86_kernel because no enough shared memory
[E:onnxruntime:, sequential_executor.cc:505 ExecuteKernel] Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: CUDA error cudaErrorInvalidValue:invalid argument
```
### Description
This PR is convenient to do post processing for the generated json file
when profiling is enabled. Kernel type can be used to aggregate the same
type kernels' overall time.
### Description
Use `https.get` instead of `fetch` in ORT Nodejs binding package install
script.
### Motivation and Context
According to discussions in #23232, the package `global-agent` cannot
work with `fetch` API. To make it work with the proxy agent, this PR
replaces the `fetch` API with `https.get` in the install script.
### Description
The Web CI pipeline uses three different Windows machine pools:
1. onnxruntime-Win2022-webgpu-A10
2. onnxruntime-Win2022-VS2022-webgpu-A10
3. onnxruntime-Win-CPU-2022-web
This PR merges them together to reduce ongoing maintenance cost.
### Description
<!-- Describe your changes. -->
Changed all support tensor type from ir 9 to ir 10.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- See issue https://github.com/microsoft/onnxruntime/issues/23205
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
### Description
<!-- Describe your changes. -->
For legacy jetson users who use jetpack 5.x, the latest TRT version is
8.5.
Add version check to newer trt features to fix build on jetpack 5.x
(cuda11.8+gcc11 are required)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Make arrays with cubin data const.
### Motivation and Context
Non-const arrays are put into the .data section which might cause
excessive memory usage in some scenarios. Making cubin arrays const
allows them to be put into the .rodata section.
Remove PostBuildCleanup tasks since it is deprecated. It is to address a
warning in our pipelines:
"Task 'Post Build Cleanup' version 3 (PostBuildCleanup@3) is dependent
on a Node version (6) that is end-of-life. Contact the extension owner
for an updated version of the task. Task maintainers should review Node
upgrade guidance: https://aka.ms/node-runner-guidance"
Now the cleanup is controlled in another place:
https://learn.microsoft.com/en-us/azure/devops/pipelines/yaml-schema/workspace?view=azure-pipelines
The code change was generated by the following Linux command:
```bash
find . -name \*.yml -exec sed -i '/PostBuildCleanup/,+2d' {} \;
```
### Description
Refactor compute plan profiling
Support cache coreml model to speed up session initialization. this is
only support by user provided entry and user responsible to manage the
cache
With the cache, session initialization time can be reduced by 50% or
more:
|model| before| after|
|--|--|--|
|yolo11.onnx| 0.6s|0.1s|
|yolo11-fp16.onnx|1.8s|0.1s|
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: wejoncy <wejoncy@.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
The algorithm of `SkipSimplifiedLayerNormalization` is quite similar to
the `SimplifiedLayerNormalization`, only different is
`SkipSimplifiedLayerNormalization` provides an additional output used
for calculating the sum of the input, skip and bias (if it exits).
BTW, fix a bug in `SimplifiedLayerNormalization`, adding bias if it
exits.
### Description
Fixes crash in QNN dlls when an ETW callback tries to change the QNN log
level. This is caused by a function that does not lock a mutex before
modifying the QNN log level.
### Motivation and Context
An ETW callback into QNN EP leads to a crash within QNN SDK dlls. It
happens approximately 1 out of 3 full QNN unit tests runs.
The cause is a multithreading synchronization bug in QNN EP. We're not
always locking a mutex when ETW calls QNN EP to notify of ETW config
change.
There are two branches in the QNN EP callback function that try to
update the QNN log handle. One branch correctly locks a mutex, but other
does not lock it at all. This causes crashes within QNN dlls.
- Does not lock mutex:
[onnxruntime/onnxruntime/core/providers/qnn/qnn_execution_provider.cc at
main ·
microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/qnn/qnn_execution_provider.cc#L426)
- Locks mutex:
[onnxruntime/onnxruntime/core/providers/qnn/qnn_execution_provider.cc at
main ·
microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/qnn/qnn_execution_provider.cc#L442)
The fix is to lock the mutex in both paths.
### Description
Introduces a new optional input (encoder_ibnput_ids) in the decoder
graph of the T5 implementation for BeamSearch. This allows usage of
pointer generator networks in decoder graph.
### Motivation and Context
- Fixes#23123
### Description
<!-- Describe your changes. -->
1. Add support for throwing error when hardware is not supported for
VitisAI.
2. Add support for unloading VitisAI EP.
3. Add API for Win25.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is requirement for Win25
### Description
This change fixes the WebGPU delay load test.
<details>
<summary>Fix UB in macro</summary>
The following C++ code outputs `2, 1` in MSVC, while it outputs `1, 1`
in GCC:
```c++
#include <iostream>
#define A 1
#define B 1
#define ENABLE defined(A) && defined(B)
#if ENABLE
int x = 1;
#else
int x = 2;
#endif
#if defined(A) && defined(B)
int y = 1;
#else
int y = 2;
#endif
int main()
{
std::cout << x << ", " << y << "\n";
}
```
Clang reports `macro expansion producing 'defined' has undefined
behavior [-Wexpansion-to-defined]`.
</details>
<details>
<summary>Fix condition of build option
onnxruntime_ENABLE_DELAY_LOADING_WIN_DLLS</summary>
Delay load is explicitly disabled when python binding is being built.
modifies the condition.
</details>
### Description
CMake's
[target_link_libraries](https://cmake.org/cmake/help/latest/command/target_link_libraries.html#id2)
function accepts plain library name(like `re2`) or target name(like
`re2::re2`) or some other kinds of names. "plain library names" are
old-fashioned, for compatibility only. We should use target names.
### Motivation and Context
To make vcpkg work with winml build. See #23158
### Description
<!-- Describe your changes. -->
Pre-packing is a feature, that allows kernels to re-arrange weights data
to gain performance at interference time
Currently, pre-packed blobs are shared when a cross-session weight
sharing is enabled and only for those weights that are marked as shared
by the user. Otherwise, data resides on the heap, the kernels own the
data which may be duplicated.
This change enables pre-packed data to be stored on disk alongside with
the external initializers.
The pre-packed blobs are memory mapped and are loaded into either the
X-session shared container
or a new container that shares pre-packed blobs within the session.
With the new approach, pre-packed blobs are always owned by the shared
container using the existing pre-pack mechanism for sharing. When
X-session sharing is enabled, then the external container owns the data.
A separate container owned by a root `SessionState` owns and shares the
data when X-session sharing is not enabled.
To facilitate this new approach, we introduce a new container that works
in two modes. When an optimized model is being saved, and pre-packed
weights saving is enabled, the new container will record pre-packed
blobs and serialize them to disk using existing
`ToGraphProtoWithExternalInitializers` function.
To externalize the pre-packed weights, we introduce a new session option
`kOrtSessionOptionsSavePrePackedConstantInitializers.` Note, that
pre-packing should be enabled (default) for this to work.
`ToGraphProtoWithExternalInitializers`function is modified to recurse
into subgraphs to make sure we properly account for local initializer
names.
In the second mode, the container would simply hold the pre-packed
weights memory-mapped from disk and share them with the kernels.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reduce memory usage by pre-packed initializers and externalize them.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enhancements to EPContext Operations:
1. Introduced support for the bfloat16 data type in EPContext operations.
2. Bug Fix: Missing Custom OP Schema Registration when generator EPContext ONNX model
---------
Co-authored-by: mingyue <mingyue@xilinx.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
### Description
After the optimization of prefill time with #23102, it seems that always
using the tile matmulnibits with block_size = 32 can bring better
performance even for discrete gpu for phi3 model.
Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my
NV RTX 2000 GPU.
### Description
This change allows the `WebGpuContext` class to be released after all
active inference sessions are released. This will cause:
- for default context (ID=0), the underlying `wgpu::Device` and
`wgpu::Adapter` to be released, together with all resources created by
the Device.
- for custom context (ID>0), the reference counts of passed in Instance,
Adapter and Device will decrement correctly.
### Description
<!-- Describe your changes. -->
Update CIs to TRT10.7
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change fixes the DLL delay load problem for the WebGPU EP and
DirectML EP. See detailed explanation below.
### Problem
When onnxruntime.dll uses delay loading for its dependencies, the
dependencies are loaded using `LoadLibraryEx()`, which search the
directory of process (.exe) instead of this library (onnxruntime.dll).
This is a problem for usages of Node.js binding and python binding,
because Windows will try to find the dependencies in the directory of
node.exe or python.exe, which is not the directory of onnxruntime.dll.
There was previous attempt to fix this by loading DirectML.dll in the
initialization of onnxruntime nodejs binding, which works for DML EP but
is not a good solution because it does not really "delay" the load.
For WebGPU, the situation became worse because webgpu_dawn.dll depends
on dxil.dll and dxcompiler.dll, which are explicitly dynamically loaded
in the code using `LoadLibraryA()`. This has the same problem of the DLL
search.
### Solutions
For onnxruntime.dll loading its direct dependencies, it can be resolved
by set the [`__pfnDliNotifyHook2`
hook](https://learn.microsoft.com/en-us/cpp/build/reference/understanding-the-helper-function?view=msvc-170#structure-and-constant-definitions)
to load from an absolute path that constructed from the onnxruntime.dll
folder and the DLL name.
For webgpu_dawn.dll loading dxil.dll and dxcompiler.dll, since they are
explicitly loaded in the code, the hook does not work. Instead, it can
be resolved by ~~using WIN32 API `SetDllDirectory()` to add the
onnxruntime.dll folder to the search path.~~ preloading the 2 DLLs from
the onnxruntime.dll folder .
### Description
This change fixes the build break for Node.js binding on latest
AppleClang:
```
...tensor_helper.cc:65:5 error: integer value -1 is outside of the valid range of values [0,15] for the enumeration type 'napi_typedarray_type' [-Wenum-constexpr-conversion]
```
Use the underlying type of enum `napi_typedarray_type` for
`DATA_TYPE_TYPEDARRAY_MAP` to solve this issue.
Because the underlying type is implementation defined (it's `int` for
MSVC and `unsigned int` for Clang), we use `std::underlying_type_t` to
get the correct type.
### Description
Previously we wanted to add DirectML EP to existing onnxruntime Windows
CUDA packages. After careful consideration, we will postpone the change.
This PR reverts some pipeline changes previously made by @mszhanyi and
@jchen351 .
### Description
* Update python version metadata to be in sync with latest python
packages (onnxruntime, onnxruntime-gpu and onnxruntime-qnn).
* Update black format target-version to 3.10, and use lintrunner to
format all files.
* Update the lintrunner installation command line to be consistent.
* Include `requirements-lintrunner.txt` in `requirements-dev.txt` to
avoid duplicated settings.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22993
Python support by numpy:
https://numpy.org/neps/nep-0029-deprecation_policy.html#drop-schedule
```
On Apr 05, 2024 drop support for Python 3.9
On Apr 04, 2025 drop support for Python 3.10
```