update to use integrated vaip
update vaip as the single entry point for cmake
add dummy vaip_xcompile_run.
onnxruntime_vitisia_ep.dll is optional.
vaip_xcompiler_compile maybe nullptr.
update create_ep_context_nodes.
### Description
This PR is to update the win-ort-main branch to the tip main branch as
of 2025-01-23.
### PR List
ddf0d377a7 [QNN EP] Add LoggingManager::HasDefaultLogger() to provider
bridge API (#23467)
05fbbdf91f [QNN EP] Make QNN EP a shared library (#23120)
1336566d7f Add custom vcpkg ports (#23456)
2e1173c411 Update the compile flags for vcpkg packages (#23455)
1f628a9858 [Mobile] Add BrowserStack Android MAUI Test (#23383)
009cae0ec8 [js/webgpu] Optimize ConvTranspose (Continue) (#23429)
04a4a694cb Use onnx_protobuf.h to suppress some GCC warnings (#23453)
2e3b62b4b0 Suppress some strict-aliasing related warnings in WebGPU EP
(#23454)
b708f9b1dc Bump ruff from 0.9.1 to 0.9.2 (#23427)
c0afc66b2a [WebNN] Remove workarounds for TFLite backend (#23406)
8a821ff7f9 Bump vite from 6.0.7 to 6.0.11 in
/js/web/test/e2e/exports/testcases/vite-default (#23446)
220c1a203e Make ORT and Dawn use the same protobuf/abseil source code
(#23447)
b7b5792147 Change MacOS-13 to ubuntu on for
android-java-api-aar-test.yml. (#23444)
19d0d2a30f WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP
(#23365)
95b8effbc4 [QNN EP]: Clean up QNN logging resources if an error occurs
during initialization (#23435)
626134c5b5 Bump clang-format from 19.1.6 to 19.1.7 (#23428)
0cf975301f Fix eigen external deps (#23439)
f9440aedce Moving RN_CI Android Testing to Linux (#23422)
1aa5902ff4 [QNN EP] workaround for QNN validation bug for Tanh with
uint16 quantized output (#23432)
7f5582a0e2 Seperate RN andriod and IOS into 2 separated Stages. (#23400)
73deac2e7f Implement some missing element wise Add/Sub/Mul/Div/Neg
operations for CPU and CUDA EPs (#23090)
949fe42af4 Upgrade Java version from react-native/android to Java 17
(#23066)
0892c23463 Update Qnn SDK default version to 2.30 (#23411)
94c099bcec Fix type cast build error (#23423)
d633e571d1 [WebNN EP] Fix AddInitializersToSkip issues (#23354)
e988ef00e2 [QNN EP] Fix regression for MatMul with two quantized/dynamic
uint16 inputs (#23419)
7538795f6b Update onnxruntime binary size checks ci pipeline's docker
image (#23405)
6c5ea41cad Revert "[QNN EP] Clean up correctly from a partial setup
(#23320)" (#23420)
e866804bbe Enable comprehension simplification in ruff rules (#23414)
0a5f1f392c bugfix: string_view of invalid memory (#23417)
4cc38e0277 fix crash when first input of BatchNormalization is 1-D
(#23387)
033441487f Target py310 and modernize codebase with ruff (#23401)
87341ac010 [QNN EP] Fix segfault when unregistering HTP shared memory
handles (#23402)
### Motivation and Context
This update includes the change to make QNN-EP a shared library.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Peishen Yan <peishen.yan@intel.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Jian Chen <cjian@microsoft.com>
Co-authored-by: Alexis Tsogias <1114095+Zyrin@users.noreply.github.com>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sushraja-msft <44513542+sushraja-msft@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Caroline Zhu <wolfivyaura@gmail.com>
### Description
This change fixes the WebGPU delay load test.
<details>
<summary>Fix UB in macro</summary>
The following C++ code outputs `2, 1` in MSVC, while it outputs `1, 1`
in GCC:
```c++
#include <iostream>
#define A 1
#define B 1
#define ENABLE defined(A) && defined(B)
#if ENABLE
int x = 1;
#else
int x = 2;
#endif
#if defined(A) && defined(B)
int y = 1;
#else
int y = 2;
#endif
int main()
{
std::cout << x << ", " << y << "\n";
}
```
Clang reports `macro expansion producing 'defined' has undefined
behavior [-Wexpansion-to-defined]`.
</details>
<details>
<summary>Fix condition of build option
onnxruntime_ENABLE_DELAY_LOADING_WIN_DLLS</summary>
Delay load is explicitly disabled when python binding is being built.
modifies the condition.
</details>
### Description
CMake's
[target_link_libraries](https://cmake.org/cmake/help/latest/command/target_link_libraries.html#id2)
function accepts plain library name(like `re2`) or target name(like
`re2::re2`) or some other kinds of names. "plain library names" are
old-fashioned, for compatibility only. We should use target names.
### Motivation and Context
To make vcpkg work with winml build. See #23158
### Description
<!-- Describe your changes. -->
Pre-packing is a feature, that allows kernels to re-arrange weights data
to gain performance at interference time
Currently, pre-packed blobs are shared when a cross-session weight
sharing is enabled and only for those weights that are marked as shared
by the user. Otherwise, data resides on the heap, the kernels own the
data which may be duplicated.
This change enables pre-packed data to be stored on disk alongside with
the external initializers.
The pre-packed blobs are memory mapped and are loaded into either the
X-session shared container
or a new container that shares pre-packed blobs within the session.
With the new approach, pre-packed blobs are always owned by the shared
container using the existing pre-pack mechanism for sharing. When
X-session sharing is enabled, then the external container owns the data.
A separate container owned by a root `SessionState` owns and shares the
data when X-session sharing is not enabled.
To facilitate this new approach, we introduce a new container that works
in two modes. When an optimized model is being saved, and pre-packed
weights saving is enabled, the new container will record pre-packed
blobs and serialize them to disk using existing
`ToGraphProtoWithExternalInitializers` function.
To externalize the pre-packed weights, we introduce a new session option
`kOrtSessionOptionsSavePrePackedConstantInitializers.` Note, that
pre-packing should be enabled (default) for this to work.
`ToGraphProtoWithExternalInitializers`function is modified to recurse
into subgraphs to make sure we properly account for local initializer
names.
In the second mode, the container would simply hold the pre-packed
weights memory-mapped from disk and share them with the kernels.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reduce memory usage by pre-packed initializers and externalize them.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Enhancements to EPContext Operations:
1. Introduced support for the bfloat16 data type in EPContext operations.
2. Bug Fix: Missing Custom OP Schema Registration when generator EPContext ONNX model
---------
Co-authored-by: mingyue <mingyue@xilinx.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
### Description
After the optimization of prefill time with #23102, it seems that always
using the tile matmulnibits with block_size = 32 can bring better
performance even for discrete gpu for phi3 model.
Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my
NV RTX 2000 GPU.
### Description
This change allows the `WebGpuContext` class to be released after all
active inference sessions are released. This will cause:
- for default context (ID=0), the underlying `wgpu::Device` and
`wgpu::Adapter` to be released, together with all resources created by
the Device.
- for custom context (ID>0), the reference counts of passed in Instance,
Adapter and Device will decrement correctly.
### Description
<!-- Describe your changes. -->
Update CIs to TRT10.7
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change fixes the DLL delay load problem for the WebGPU EP and
DirectML EP. See detailed explanation below.
### Problem
When onnxruntime.dll uses delay loading for its dependencies, the
dependencies are loaded using `LoadLibraryEx()`, which search the
directory of process (.exe) instead of this library (onnxruntime.dll).
This is a problem for usages of Node.js binding and python binding,
because Windows will try to find the dependencies in the directory of
node.exe or python.exe, which is not the directory of onnxruntime.dll.
There was previous attempt to fix this by loading DirectML.dll in the
initialization of onnxruntime nodejs binding, which works for DML EP but
is not a good solution because it does not really "delay" the load.
For WebGPU, the situation became worse because webgpu_dawn.dll depends
on dxil.dll and dxcompiler.dll, which are explicitly dynamically loaded
in the code using `LoadLibraryA()`. This has the same problem of the DLL
search.
### Solutions
For onnxruntime.dll loading its direct dependencies, it can be resolved
by set the [`__pfnDliNotifyHook2`
hook](https://learn.microsoft.com/en-us/cpp/build/reference/understanding-the-helper-function?view=msvc-170#structure-and-constant-definitions)
to load from an absolute path that constructed from the onnxruntime.dll
folder and the DLL name.
For webgpu_dawn.dll loading dxil.dll and dxcompiler.dll, since they are
explicitly loaded in the code, the hook does not work. Instead, it can
be resolved by ~~using WIN32 API `SetDllDirectory()` to add the
onnxruntime.dll folder to the search path.~~ preloading the 2 DLLs from
the onnxruntime.dll folder .
### Description
This change fixes the build break for Node.js binding on latest
AppleClang:
```
...tensor_helper.cc:65:5 error: integer value -1 is outside of the valid range of values [0,15] for the enumeration type 'napi_typedarray_type' [-Wenum-constexpr-conversion]
```
Use the underlying type of enum `napi_typedarray_type` for
`DATA_TYPE_TYPEDARRAY_MAP` to solve this issue.
Because the underlying type is implementation defined (it's `int` for
MSVC and `unsigned int` for Clang), we use `std::underlying_type_t` to
get the correct type.
### Description
Previously we wanted to add DirectML EP to existing onnxruntime Windows
CUDA packages. After careful consideration, we will postpone the change.
This PR reverts some pipeline changes previously made by @mszhanyi and
@jchen351 .
### Description
* Update python version metadata to be in sync with latest python
packages (onnxruntime, onnxruntime-gpu and onnxruntime-qnn).
* Update black format target-version to 3.10, and use lintrunner to
format all files.
* Update the lintrunner installation command line to be consistent.
* Include `requirements-lintrunner.txt` in `requirements-dev.txt` to
avoid duplicated settings.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22993
Python support by numpy:
https://numpy.org/neps/nep-0029-deprecation_policy.html#drop-schedule
```
On Apr 05, 2024 drop support for Python 3.9
On Apr 04, 2025 drop support for Python 3.10
```
This is the webgpu native ep implementation of #23092.
I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to
test. Meanwhile, applied
https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype/pull/2 to
print the first token time.
The result is like below:
The latest main branch:
Intel Arc Graphics
```
659 tokens in 24.8sec, 26.57 tokens/sec
Decoding first token with input 449 tokens: 13.0 sec
Decoding remaining 210 tokens:
11.8 sec
17.79 tokens/sec
```
NV RTX 2000
```
659 tokens in 14.4sec, 45.85 tokens/sec
Decoding first token with input 449 tokens: 7.3 sec
Decoding remaining 210 tokens:
7.0 sec
29.81 tokens/sec
```
-------------------------------------------------------------------------
With this PR:
Intel Arc Graphics
```
657 tokens in 20.6sec, 31.92 tokens/sec
Decoding first token with input 449 tokens: 8.5 sec
Decoding remaining 208 tokens:
12.1 sec
17.23 tokens/sec
```
NV RTX 2000
```
659 tokens in 11.4sec, 57.93 tokens/sec
Decoding first token with input 449 tokens: 4.1 sec
Decoding remaining 210 tokens:
7.2 sec
28.98 tokens/sec
```
From above data, you can see that with this PR, both intel (13s -> 8.5s)
and NV (7.3s -> 4.1s) GPUs for the first token time are performing
better.
### Description
Those test cases start to fail for unknown reasons.
To unblock the CI, I disabled those tests temporarily to earn time to
investigate the root cause.
### Description
<!-- Describe your changes. -->
Add common interfaces for vitis ep profiler.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ep can collect and record api and kernel timestamps in file when
onnxruntime '-p' is enabled.
### Description
This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
### Description
Add AttributeProto.release_s interface, which is used to obtain the
string in the attribute using move semantics instead of copying it
### Motivation and Context
The ep_context node stores a lot of information in attributes, which may
cause the memory usage to increase. Use this interface to avoid memory
waste
---------
Co-authored-by: GenMing Zhong <genmingz@xlnx.xilinx.com>
Co-authored-by: genmingz <genmingz@amd.com>
### Description
<!-- Describe your changes. -->
Fix a bug caused by potential out-of-bound reads of `W` in the
Conv2DMatMul shader.
### Motivation and Context
Fixes#22983
### Description
<!-- Describe your changes. -->
This patches Eigen source to remove an unused deprecated static var.
### Motivation and Context
Internal customer request.
### Description
OVEP development changes for ORT 1.21 Release
### Motivation and Context
- Has Critical Bug Fixes
- Improved Performance optimizations for both memory & inference latency
(https://github.com/intel/onnxruntime/pull/513)
- Enabled Model Compilation using NPUW
(https://github.com/intel/onnxruntime/pull/508)
- Fixed support for EPContext embed mode 0 for lower memory utilization
- Updated NuGet package name as `Intel.ML.OnnxRuntime.OpenVino`
- Fixed QDQ Stripping logic on NPU
### Description
This PR is a replacement of #21671. It offers a new way for accessing
the following:
- `ort.env.webgpu.adapter`:
- **deprecating**. There is no point to get the value of it. Once
`GPUDevice.adapterInfo` is supported, there is no point to set the value
too.
- `ort.env.webgpu.device`:
- set value of `GPUDevice` if user created it. Use at user's own risk.
- get value of `Promise<GPUDevice>`. if not exist, create a new one. if
exist return it.
- `ort.env.webgpu.powerPreference`:
- **deprecating**. encouraging users to set `ort.env.webgpu.device` if
necessary.
- `ort.env.webgpu.forceFallbackAdapter`:
- **deprecating**. encouraging users to set `ort.env.webgpu.device` if
necessary.
### Description
This change implements matmul4bits with tiling both for A and B. This is
beneficial for prefill scenarios on Intel integrated GPUs, because each
row of A has to run through the same set of shared rows of B. This
change should improve core occupancy and model_benchmark does indicate
improvements for prefill.
The same shader is not used for generation because when A has just a
single row, the other threads in the workgroup get unused and that hurts
performance.
```
-- Baseline run on an Alderlake GPU --
C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 1.72338e+07
avg (tokens/s): 29.0707 <<
p50 (us): 1.72548e+07
stddev (us): 57012.8
n: 5 * 501 token(s)
Token generation:
avg (us): 79227.5
avg (tokens/s): 12.6219
p50 (us): 79284.4
stddev (us): 2109.72
n: 635 * 1 token(s)
Token sampling:
avg (us): 15.8198
avg (tokens/s): 63211.8
p50 (us): 14.3
stddev (us): 8.67178
n: 640 * 1 token(s)
E2E generation (entire generation loop):
avg (ms): 27297.8
p50 (ms): 27269.8
stddev (ms): 89.4322
n: 5
Peak working set size (bytes): 5490987008
WebGPU device lost (2): Device was destroyed.
----------------------------------- With Prefill Optimization ----
C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 1.2135e+07
avg (tokens/s): 41.2856 <<
p50 (us): 1.21288e+07
stddev (us): 21282.1
n: 5 * 501 token(s)
Token generation:
avg (us): 78945.3
avg (tokens/s): 12.667
p50 (us): 78900.7
stddev (us): 2232.43
n: 635 * 1 token(s)
Token sampling:
avg (us): 20.5608
avg (tokens/s): 48636.3
p50 (us): 18.7
stddev (us): 19.0409
n: 640 * 1 token(s)
E2E generation (entire generation loop):
avg (ms): 22163.8
p50 (ms): 22160.1
stddev (ms): 31.3122
n: 5
Peak working set size (bytes): 5478862848
WebGPU device lost (2): Device was destroyed.
```
### Description
Change the implementation of BeamSearch op when using CUDA EP: in case
of T5 model, and in case the decoder input_ids are sequences, copy the
sequences device-to-device instead of host-to-device
### Motivation and Context
- Fixes#20667
A follow-up of [[WebNN] Support negative steps for
slice](https://github.com/microsoft/onnxruntime/pull/22871#discussion_r1847929774).
Slice op is emulated by reverse+slice when steps < 0 so
`SliceOpBuilder::HasSupportedInputsImpl()` should also check the
supported data types of reverse.
---------
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
- Use `ANDROID` instead of `CMAKE_SYSTEM_NAME STREQUAL "Android"`.
- Put common gradle arguments into `COMMON_GRADLE_ARGS` to make them easier to reuse.
### Description
This PR only upgrade the gradle version and
`com.android.tools.build:gradle` version from build.gradle.
This only update the react-native library gradle version, not the e2e
test.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
We have use cases where multiple sessions are created concurrently.
Minimizing the usage of the default logger is important for these
scenarios.
Wire through the session logger to as many places as possible. The EP
logger can also be used once the session is created (can't be used
during EP construction/kernel registration but can be used in
GetCapability and Compile).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve logging when there are concurrent sessions.
### Description
refactor unsquzee's implementation
add more flags to boost peformance.
add profile flag
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: jicwen <jicwen@YiMacBook-Pro.local>
Co-authored-by: wejoncy <wejoncy@.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>