### Description
Previously we wanted to add DirectML EP to existing onnxruntime Windows
CUDA packages. After careful consideration, we will postpone the change.
This PR reverts some pipeline changes previously made by @mszhanyi and
@jchen351 .
### Description
* Update python version metadata to be in sync with latest python
packages (onnxruntime, onnxruntime-gpu and onnxruntime-qnn).
* Update black format target-version to 3.10, and use lintrunner to
format all files.
* Update the lintrunner installation command line to be consistent.
* Include `requirements-lintrunner.txt` in `requirements-dev.txt` to
avoid duplicated settings.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22993
Python support by numpy:
https://numpy.org/neps/nep-0029-deprecation_policy.html#drop-schedule
```
On Apr 05, 2024 drop support for Python 3.9
On Apr 04, 2025 drop support for Python 3.10
```
This is the webgpu native ep implementation of #23092.
I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to
test. Meanwhile, applied
https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype/pull/2 to
print the first token time.
The result is like below:
The latest main branch:
Intel Arc Graphics
```
659 tokens in 24.8sec, 26.57 tokens/sec
Decoding first token with input 449 tokens: 13.0 sec
Decoding remaining 210 tokens:
11.8 sec
17.79 tokens/sec
```
NV RTX 2000
```
659 tokens in 14.4sec, 45.85 tokens/sec
Decoding first token with input 449 tokens: 7.3 sec
Decoding remaining 210 tokens:
7.0 sec
29.81 tokens/sec
```
-------------------------------------------------------------------------
With this PR:
Intel Arc Graphics
```
657 tokens in 20.6sec, 31.92 tokens/sec
Decoding first token with input 449 tokens: 8.5 sec
Decoding remaining 208 tokens:
12.1 sec
17.23 tokens/sec
```
NV RTX 2000
```
659 tokens in 11.4sec, 57.93 tokens/sec
Decoding first token with input 449 tokens: 4.1 sec
Decoding remaining 210 tokens:
7.2 sec
28.98 tokens/sec
```
From above data, you can see that with this PR, both intel (13s -> 8.5s)
and NV (7.3s -> 4.1s) GPUs for the first token time are performing
better.
### Description
Those test cases start to fail for unknown reasons.
To unblock the CI, I disabled those tests temporarily to earn time to
investigate the root cause.
### Description
<!-- Describe your changes. -->
Add common interfaces for vitis ep profiler.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ep can collect and record api and kernel timestamps in file when
onnxruntime '-p' is enabled.
### Description
This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
### Description
Add AttributeProto.release_s interface, which is used to obtain the
string in the attribute using move semantics instead of copying it
### Motivation and Context
The ep_context node stores a lot of information in attributes, which may
cause the memory usage to increase. Use this interface to avoid memory
waste
---------
Co-authored-by: GenMing Zhong <genmingz@xlnx.xilinx.com>
Co-authored-by: genmingz <genmingz@amd.com>
### Description
<!-- Describe your changes. -->
Fix a bug caused by potential out-of-bound reads of `W` in the
Conv2DMatMul shader.
### Motivation and Context
Fixes#22983
### Description
<!-- Describe your changes. -->
This patches Eigen source to remove an unused deprecated static var.
### Motivation and Context
Internal customer request.
### Description
OVEP development changes for ORT 1.21 Release
### Motivation and Context
- Has Critical Bug Fixes
- Improved Performance optimizations for both memory & inference latency
(https://github.com/intel/onnxruntime/pull/513)
- Enabled Model Compilation using NPUW
(https://github.com/intel/onnxruntime/pull/508)
- Fixed support for EPContext embed mode 0 for lower memory utilization
- Updated NuGet package name as `Intel.ML.OnnxRuntime.OpenVino`
- Fixed QDQ Stripping logic on NPU
### Description
This PR is a replacement of #21671. It offers a new way for accessing
the following:
- `ort.env.webgpu.adapter`:
- **deprecating**. There is no point to get the value of it. Once
`GPUDevice.adapterInfo` is supported, there is no point to set the value
too.
- `ort.env.webgpu.device`:
- set value of `GPUDevice` if user created it. Use at user's own risk.
- get value of `Promise<GPUDevice>`. if not exist, create a new one. if
exist return it.
- `ort.env.webgpu.powerPreference`:
- **deprecating**. encouraging users to set `ort.env.webgpu.device` if
necessary.
- `ort.env.webgpu.forceFallbackAdapter`:
- **deprecating**. encouraging users to set `ort.env.webgpu.device` if
necessary.
### Description
This change implements matmul4bits with tiling both for A and B. This is
beneficial for prefill scenarios on Intel integrated GPUs, because each
row of A has to run through the same set of shared rows of B. This
change should improve core occupancy and model_benchmark does indicate
improvements for prefill.
The same shader is not used for generation because when A has just a
single row, the other threads in the workgroup get unused and that hurts
performance.
```
-- Baseline run on an Alderlake GPU --
C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 1.72338e+07
avg (tokens/s): 29.0707 <<
p50 (us): 1.72548e+07
stddev (us): 57012.8
n: 5 * 501 token(s)
Token generation:
avg (us): 79227.5
avg (tokens/s): 12.6219
p50 (us): 79284.4
stddev (us): 2109.72
n: 635 * 1 token(s)
Token sampling:
avg (us): 15.8198
avg (tokens/s): 63211.8
p50 (us): 14.3
stddev (us): 8.67178
n: 640 * 1 token(s)
E2E generation (entire generation loop):
avg (ms): 27297.8
p50 (ms): 27269.8
stddev (ms): 89.4322
n: 5
Peak working set size (bytes): 5490987008
WebGPU device lost (2): Device was destroyed.
----------------------------------- With Prefill Optimization ----
C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 1.2135e+07
avg (tokens/s): 41.2856 <<
p50 (us): 1.21288e+07
stddev (us): 21282.1
n: 5 * 501 token(s)
Token generation:
avg (us): 78945.3
avg (tokens/s): 12.667
p50 (us): 78900.7
stddev (us): 2232.43
n: 635 * 1 token(s)
Token sampling:
avg (us): 20.5608
avg (tokens/s): 48636.3
p50 (us): 18.7
stddev (us): 19.0409
n: 640 * 1 token(s)
E2E generation (entire generation loop):
avg (ms): 22163.8
p50 (ms): 22160.1
stddev (ms): 31.3122
n: 5
Peak working set size (bytes): 5478862848
WebGPU device lost (2): Device was destroyed.
```
### Description
Change the implementation of BeamSearch op when using CUDA EP: in case
of T5 model, and in case the decoder input_ids are sequences, copy the
sequences device-to-device instead of host-to-device
### Motivation and Context
- Fixes#20667
A follow-up of [[WebNN] Support negative steps for
slice](https://github.com/microsoft/onnxruntime/pull/22871#discussion_r1847929774).
Slice op is emulated by reverse+slice when steps < 0 so
`SliceOpBuilder::HasSupportedInputsImpl()` should also check the
supported data types of reverse.
---------
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
- Use `ANDROID` instead of `CMAKE_SYSTEM_NAME STREQUAL "Android"`.
- Put common gradle arguments into `COMMON_GRADLE_ARGS` to make them easier to reuse.
### Description
This PR only upgrade the gradle version and
`com.android.tools.build:gradle` version from build.gradle.
This only update the react-native library gradle version, not the e2e
test.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
We have use cases where multiple sessions are created concurrently.
Minimizing the usage of the default logger is important for these
scenarios.
Wire through the session logger to as many places as possible. The EP
logger can also be used once the session is created (can't be used
during EP construction/kernel registration but can be used in
GetCapability and Compile).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve logging when there are concurrent sessions.
### Description
refactor unsquzee's implementation
add more flags to boost peformance.
add profile flag
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: jicwen <jicwen@YiMacBook-Pro.local>
Co-authored-by: wejoncy <wejoncy@.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
### Description
This PR adds the logic needed to consider only the needed implicit
inputs on BeamSearch op in case of T5 model (encoder/decoder, 2 graphs).
The logic added is similar to what happens in the _If_ kernel setup.
### Motivation and Context
Fixes#23043
### Description
<!-- Describe your changes. -->
- fix some missing end of version markers and since_version info
- fix include to use onnx_protobuf.h which handles minimal builds
- we should always prefer that header over directly using the onnx ones
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Some ops should allow empty tensor as input, e.g. roi, scales inputs in
Resize
### Motivation and Context
It avoid some unexpected fallback for optional input with empty tensor.
e.g. roi and scales are both optional inputs in Resize, in some models
they have non-empty name but with empty initializer presented as `[0]`,
WebNN currently will fallback all nodes with 0 dimension, which is not
expected.

### Description
The default thread count methodology by onnxruntime did not account for
new upcoming Intel microarchitectures leading to a suboptimal thread
count. Optimizing the thread count for new Intel microarchitectures
reveal gains on the majority of models across datatypes and shows gains
up to ~1.5x speedup.
### Motivation and Context
Applications should run on Intel with the most performant thread
configuration for the majority of models. With new microarchitectures,
adjusting the thread count methodology is required to take advantage of
their differences.
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add fp16 kernel to rotary embedding to boost performance.
### Motivation and Context
Part of performance optimization work for group query attention
### Description
Enable QNN HTP spill fill buffer setting to save RAM usage.
This feature is available after QNN 2.28. Need to re-generate QNN
context binary.
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_backend.html#qnn-htp-backend-api
Requirements:
1. Need to re-generate the Onnx model with QNN context binary by set the
EP option enable_htp_spill_fill_buffer = 1.
2. Works for a model with multiple Context binaries. Need manually merge
2 Onnx model with context binary into 1 Onnx model.
3. Requires Linux platform if generate the context binary offline since
QnnSystem lib is not available for Windows x86_64 platform.
No need to do extra thing while running the model inference.
The generated EPContext node will have a max_size attribute with the
maximum spill fill buffer size for the context binary
<img width="353" alt="image"
src="https://github.com/user-attachments/assets/a3bf48be-a8da-4381-8a1d-3f2558eea37d">
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
This change uses a TypeScript trick to infer global types in
onnxruntime-common. Thanks to the strong type system of TypeScript, we
are able to refer to types that may not be available in the context.
This helps to keep onnxruntime-common not to include dependencies like
"@webgpu/types", and still being able to use the types in the
declaration. See comments of `TryGetGlobalType` in `type-helper.ts`.
Merge the util functions to create or retrieve:
- A WebNN constant MLOperand filled with the specified value, data type,
and shape.
- A WebNN scalar constant MLOperand with the specified value and data
type.
### Description
Increase fp16 qnbitgemm UT tol and use fixed seeds.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
#22380 removes the file
`tools/ci_build/github/linux/docker/inference/x86_64/python/cpu/scripts/requirements.txt`
but it is still used in `dockerfiles/Dockerfile.cuda`.
This change updates the file path of the requirements.txt
fixes#22945.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->