Commit graph

12135 commits

Author SHA1 Message Date
Yueqing Zhang
aedb49beb4
[VitisAI] change all support tensor type from ir 9 to ir 10 (#23204)
### Description
<!-- Describe your changes. -->
Changed all support tensor  type from ir 9 to ir 10.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- See issue https://github.com/microsoft/onnxruntime/issues/23205

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
2025-01-02 06:45:21 -08:00
Yifan Li
bc91f5c72e
[TensorRT EP] Fix to build ORT on legacy TRT8.5 (#23215)
### Description
<!-- Describe your changes. -->
For legacy jetson users who use jetpack 5.x, the latest TRT version is
8.5.
Add version check to newer trt features to fix build on jetpack 5.x
(cuda11.8+gcc11 are required)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-01-01 19:24:24 -08:00
xhcao
a3833a5e79
[js/webgpu] validate transpose perm if specified (#23197)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-01-01 15:58:54 -08:00
Dmitry Deshevoy
0b87bccca8
[CUDA] Make cubins const (#23225)
### Description
Make arrays with cubin data const.


### Motivation and Context
Non-const arrays are put into the .data section which might cause
excessive memory usage in some scenarios. Making cubin arrays const
allows them to be put into the .rodata section.
2024-12-31 16:20:21 -08:00
Changming Sun
afd3e81c94
Remove PostBuildCleanup (#23233)
Remove PostBuildCleanup tasks since it is deprecated. It is to address a
warning in our pipelines:

"Task 'Post Build Cleanup' version 3 (PostBuildCleanup@3) is dependent
on a Node version (6) that is end-of-life. Contact the extension owner
for an updated version of the task. Task maintainers should review Node
upgrade guidance: https://aka.ms/node-runner-guidance"

Now the cleanup is controlled in another place:

https://learn.microsoft.com/en-us/azure/devops/pipelines/yaml-schema/workspace?view=azure-pipelines


The code change was generated by the following Linux command:
```bash
find . -name \*.yml -exec sed -i '/PostBuildCleanup/,+2d' {} \;
```
2024-12-31 13:12:33 -08:00
Jean-Michaël Celerier
2116fd1999
Update onnxruntime_c_api.h to work with MinGW (#23169)
The SAL2 macros are not always available there

### Description

Make SAL2 macros only available on MSVC.

### Motivation and Context

https://github.com/microsoft/onnxruntime/issues/1175
2024-12-31 11:05:10 -08:00
Changming Sun
69bb53db85
Enable delay loading hooker for python packages (#23227)
### Description
Enable delay loading hooker for python packages
2024-12-31 10:12:31 -08:00
wejoncy
86870114eb
[CoreML] support coreml model cache (#23065)
### Description
Refactor compute plan profiling

Support cache coreml model to speed up session initialization. this is
only support by user provided entry and user responsible to manage the
cache


With the cache, session initialization time can be reduced by 50% or
more:
|model| before| after|
|--|--|--|
|yolo11.onnx| 0.6s|0.1s|
|yolo11-fp16.onnx|1.8s|0.1s|


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: wejoncy <wejoncy@.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-12-31 09:29:41 +08:00
Wanming Lin
2d05c4bcd9
[WebNN] Support SkipSimplifiedLayerNormalization op (#23151)
The algorithm of `SkipSimplifiedLayerNormalization` is quite similar to
the `SimplifiedLayerNormalization`, only different is
`SkipSimplifiedLayerNormalization` provides an additional output used
for calculating the sum of the input, skip and bias (if it exits).

BTW, fix a bug in `SimplifiedLayerNormalization`, adding bias if it
exits.
2024-12-24 12:44:14 -08:00
liqun Fu
a9a881cc98
Integrate onnx 1.17.0 (#21897)
### Description
<!-- Describe your changes. -->
for ORT 1.21.0 release

Create following related issues to track skipped tests due to updated
ONNX operators in the ONNX 1.17.0 release:
https://github.com/microsoft/onnxruntime/issues/23162
https://github.com/microsoft/onnxruntime/issues/23164
https://github.com/microsoft/onnxruntime/issues/23163
https://github.com/microsoft/onnxruntime/issues/23161

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com>
Co-authored-by: yf711 <yifanl@microsoft.com>
2024-12-24 09:02:02 -08:00
Adrian Lizarraga
81cd6eacd0
[QNN EP] Fix multithread sync bug in ETW callback (#23156)
### Description

Fixes crash in QNN dlls when an ETW callback tries to change the QNN log
level. This is caused by a function that does not lock a mutex before
modifying the QNN log level.

### Motivation and Context
An ETW callback into QNN EP leads to a crash within QNN SDK dlls. It
happens approximately 1 out of 3 full QNN unit tests runs.

The cause is a multithreading synchronization bug in QNN EP. We're not
always locking a mutex when ETW calls QNN EP to notify of ETW config
change.
 
There are two branches in the QNN EP callback function that try to
update the QNN log handle. One branch correctly locks a mutex, but other
does not lock it at all. This causes crashes within QNN dlls.
- Does not lock mutex:
[onnxruntime/onnxruntime/core/providers/qnn/qnn_execution_provider.cc at
main ·
microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/qnn/qnn_execution_provider.cc#L426)
- Locks mutex:
[onnxruntime/onnxruntime/core/providers/qnn/qnn_execution_provider.cc at
main ·
microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/qnn/qnn_execution_provider.cc#L442)

The fix is to lock the mutex in both paths.
2024-12-23 10:02:04 -08:00
amancini-N
c6ba7edd83
Enable pointer-generator T5 models in BeamSearch (#23134)
### Description
Introduces a new optional input (encoder_ibnput_ids) in the decoder
graph of the T5 implementation for BeamSearch. This allows usage of
pointer generator networks in decoder graph.

### Motivation and Context
- Fixes #23123
2024-12-22 21:30:49 -08:00
Yueqing Zhang
ebdbbb7531
[VitisAI] Int4 support (#22850)
### Description
<!-- Describe your changes. -->
1. Add support for throwing error when hardware is not supported for
VitisAI.
2. Add support for unloading VitisAI EP.
3. Add API for Win25.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is requirement for Win25
2024-12-20 22:03:27 -08:00
Yulong Wang
6806174096
fix webgpu delay load test (#23157)
### Description

This change fixes the WebGPU delay load test.


<details>
<summary>Fix UB in macro</summary>

The following C++ code outputs `2, 1` in MSVC, while it outputs `1, 1`
in GCC:

```c++
#include <iostream>

#define A 1
#define B 1

#define ENABLE defined(A) && defined(B)

#if ENABLE
int x = 1;
#else
int x = 2;
#endif

#if defined(A) && defined(B)
int y = 1;
#else
int y = 2;
#endif

int main()
{
    std::cout << x << ", " << y << "\n";
}
```

Clang reports `macro expansion producing 'defined' has undefined
behavior [-Wexpansion-to-defined]`.

</details>

<details>
<summary>Fix condition of build option
onnxruntime_ENABLE_DELAY_LOADING_WIN_DLLS</summary>

Delay load is explicitly disabled when python binding is being built.
modifies the condition.

</details>
2024-12-20 13:37:12 -08:00
Changming Sun
fcc34da5e9
Fix a tiny problem in winml.cmake (#23173)
### Description
CMake's
[target_link_libraries](https://cmake.org/cmake/help/latest/command/target_link_libraries.html#id2)
function accepts plain library name(like `re2`) or target name(like
`re2::re2`) or some other kinds of names. "plain library names" are
old-fashioned, for compatibility only. We should use target names.

### Motivation and Context
To make vcpkg work with winml build. See #23158
2024-12-20 11:48:43 -08:00
Dmitri Smirnov
00b262dbb4
Implement pre-packed blobs serialization on disk and their memory mapping on load (#23069)
### Description
<!-- Describe your changes. -->
Pre-packing is a feature, that allows kernels to re-arrange weights data
to gain performance at interference time

Currently, pre-packed blobs are shared when a cross-session weight
sharing is enabled and only for those weights that are marked as shared
by the user. Otherwise, data resides on the heap, the kernels own the
data which may be duplicated.

This change enables pre-packed data to be stored on disk alongside with
the external initializers.
The pre-packed blobs are memory mapped and are loaded into either the
X-session shared container
or a new container that shares pre-packed blobs within the session.

With the new approach, pre-packed blobs are always owned by the shared
container using the existing pre-pack mechanism for sharing. When
X-session sharing is enabled, then the external container owns the data.
A separate container owned by a root `SessionState` owns and shares the
data when X-session sharing is not enabled.

To facilitate this new approach, we introduce a new container that works
in two modes. When an optimized model is being saved, and pre-packed
weights saving is enabled, the new container will record pre-packed
blobs and serialize them to disk using existing
`ToGraphProtoWithExternalInitializers` function.

To externalize the pre-packed weights, we introduce a new session option
`kOrtSessionOptionsSavePrePackedConstantInitializers.` Note, that
pre-packing should be enabled (default) for this to work.

`ToGraphProtoWithExternalInitializers`function is modified to recurse
into subgraphs to make sure we properly account for local initializer
names.

In the second mode, the container would simply hold the pre-packed
weights memory-mapped from disk and share them with the kernels.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reduce memory usage by pre-packed initializers and externalize them.
2024-12-20 10:49:08 -08:00
xhcao
29bccad96d
[webgpu] fix compiling error (#23139)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-12-20 09:05:23 -08:00
mingyue
4aca8f33df
[Bug Fix] Missing CustomOp SchemaRegister when generator EPContext ONNX model (#23091)
### Description
Enhancements to EPContext Operations:
1. Introduced support for the bfloat16 data type in EPContext operations.
2. Bug Fix: Missing Custom OP Schema Registration when generator EPContext ONNX model

---------

Co-authored-by: mingyue <mingyue@xilinx.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
2024-12-19 16:47:13 -08:00
Jiajia Qin
7c782f6741
[webgpu] Always use tile matmulnbits for block_size = 32 (#23140)
### Description
After the optimization of prefill time with #23102, it seems that always
using the tile matmulnibits with block_size = 32 can bring better
performance even for discrete gpu for phi3 model.

Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my
NV RTX 2000 GPU.
2024-12-19 16:22:53 -08:00
Yulong Wang
b4a6a0d511
[WebGPU EP] allows GPUDevice to be released after use (#23144)
### Description

This change allows the `WebGpuContext` class to be released after all
active inference sessions are released. This will cause:
- for default context (ID=0), the underlying `wgpu::Device` and
`wgpu::Adapter` to be released, together with all resources created by
the Device.
- for custom context (ID>0), the reference counts of passed in Instance,
Adapter and Device will decrement correctly.
2024-12-19 15:33:40 -08:00
Yifan Li
d9d07ad8ae
[TensorRT EP] support TensorRT 10.7-GA (#23011)
### Description
<!-- Describe your changes. -->
Update CIs to TRT10.7

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-12-19 10:39:15 -08:00
Yifan Li
a3bb3f1487
[TensorRT EP] New CIs to test TRT+minimal CUDA build (#23028)
### Description
<!-- Describe your changes. -->
New CI:
[Linux_TRT_Minimal_CUDA_Test_CI](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=230&_a=summary)
and [Win_TRT_Minimal_CUDA_Test_CI
](https://dev.azure.com/onnxruntime/onnxruntime/_build?definitionId=231)
Setting config for new CI to monitor if there's no issue to build
ORT-TRTEP with minimal CUDA
* yaml content is following Linux TRT CI yaml, with different build
arg/cache name
* build arg is following [[TensorRT EP] Enable a minimal CUDA EP
compilation without
kernels](https://github.com/microsoft/onnxruntime/pull/19052#issuecomment-1888066851)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Monitor if user is able to build ORT-TRTEP-minimalCUDA without any
blocker
(which takes ~30min to build)
2024-12-19 10:30:39 -08:00
Yulong Wang
8680244ebc
Fix delay load for WebGPU EP and DML EP (#23111)
### Description

This change fixes the DLL delay load problem for the WebGPU EP and
DirectML EP. See detailed explanation below.

### Problem

When onnxruntime.dll uses delay loading for its dependencies, the
dependencies are loaded using `LoadLibraryEx()`, which search the
directory of process (.exe) instead of this library (onnxruntime.dll).
This is a problem for usages of Node.js binding and python binding,
because Windows will try to find the dependencies in the directory of
node.exe or python.exe, which is not the directory of onnxruntime.dll.

There was previous attempt to fix this by loading DirectML.dll in the
initialization of onnxruntime nodejs binding, which works for DML EP but
is not a good solution because it does not really "delay" the load.

For WebGPU, the situation became worse because webgpu_dawn.dll depends
on dxil.dll and dxcompiler.dll, which are explicitly dynamically loaded
in the code using `LoadLibraryA()`. This has the same problem of the DLL
search.

### Solutions

For onnxruntime.dll loading its direct dependencies, it can be resolved
by set the [`__pfnDliNotifyHook2`
hook](https://learn.microsoft.com/en-us/cpp/build/reference/understanding-the-helper-function?view=msvc-170#structure-and-constant-definitions)
to load from an absolute path that constructed from the onnxruntime.dll
folder and the DLL name.

For webgpu_dawn.dll loading dxil.dll and dxcompiler.dll, since they are
explicitly loaded in the code, the hook does not work. Instead, it can
be resolved by ~~using WIN32 API `SetDllDirectory()` to add the
onnxruntime.dll folder to the search path.~~ preloading the 2 DLLs from
the onnxruntime.dll folder .
2024-12-19 10:23:48 -08:00
Yulong Wang
780735098d
[nodejs binding] Fix building in latest clang (#23146)
### Description

This change fixes the build break for Node.js binding on latest
AppleClang:

```
...tensor_helper.cc:65:5 error: integer value -1 is outside of the valid range of values [0,15] for the enumeration type 'napi_typedarray_type' [-Wenum-constexpr-conversion]

```

Use the underlying type of enum `napi_typedarray_type` for
`DATA_TYPE_TYPEDARRAY_MAP` to solve this issue.

Because the underlying type is implementation defined (it's `int` for
MSVC and `unsigned int` for Clang), we use `std::underlying_type_t` to
get the correct type.
2024-12-19 10:23:27 -08:00
Yulong Wang
ae6dcc839e
Revert "[js/webgpu] disable failed tests temporarily (#23127)" (#23130)
### Description

This reverts commit 9115682d69.

### Motivation and Context
2024-12-18 18:07:50 -08:00
Prathik Rao
31e6e1010c
gather elements webgpu implementation (#23137)
Increases operator coverage for WebGPU EP.
2024-12-18 16:29:26 -08:00
Changming Sun
5d7030e4c6
Revert DML pipeline changes (#23135)
### Description
Previously we wanted to add DirectML EP to existing onnxruntime Windows
CUDA packages. After careful consideration, we will postpone the change.
This PR reverts some pipeline changes previously made by @mszhanyi and
@jchen351 .
2024-12-18 10:42:10 -08:00
Changming Sun
e76bd2f5e9
Update CODEOWNERS: remove onnxruntime-es (#21677)
Removing this restriction for now.
2024-12-17 13:39:13 -08:00
Wanming Lin
a5b60ec03f
[WebNN] Add limit to QDQ ops (#23076)
WebNN requires the `scale_shape` to be a subsample of the `input_shape`.
2024-12-17 12:52:08 -08:00
Enrico Galli
54edb43e77
[WebNN] Fixes MLTensor caching across different contexts (#23100)
We weren't checking that MLTensors were from the same context before
reusing them.

Found while debugging microsoft/webnn-developer-preview#69
2024-12-17 12:51:16 -08:00
Tianlei Wu
5afab787db
Update python version metadata (remove 3.7, 3.8, 3.9; add 3.13). (#23067)
### Description

* Update python version metadata to be in sync with latest python
packages (onnxruntime, onnxruntime-gpu and onnxruntime-qnn).
* Update black format target-version to 3.10, and use lintrunner to
format all files.
* Update the lintrunner installation command line to be consistent.
* Include `requirements-lintrunner.txt` in `requirements-dev.txt` to
avoid duplicated settings.

### Motivation and Context

https://github.com/microsoft/onnxruntime/issues/22993

Python support by numpy:
https://numpy.org/neps/nep-0029-deprecation_policy.html#drop-schedule
```
On Apr 05, 2024 drop support for Python 3.9
On Apr 04, 2025 drop support for Python 3.10
```
2024-12-17 10:59:20 -08:00
Jiajia Qin
0981bbf4ca
[webgpu] Optimize matmulnbits with M > 1 (#23102)
This is the webgpu native ep implementation of #23092.

I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to
test. Meanwhile, applied
https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype/pull/2 to
print the first token time.

The result is like below:
The latest main branch:
Intel Arc Graphics
```
659 tokens in 24.8sec, 26.57 tokens/sec
    Decoding first token with input 449 tokens: 13.0 sec
    Decoding remaining 210 tokens:
        11.8 sec
        17.79 tokens/sec
```
NV RTX 2000
```
659 tokens in 14.4sec, 45.85 tokens/sec
    Decoding first token with input 449 tokens: 7.3 sec
    Decoding remaining 210 tokens:
        7.0 sec
        29.81 tokens/sec
```

-------------------------------------------------------------------------
With this PR:
Intel Arc Graphics
```
657 tokens in 20.6sec, 31.92 tokens/sec
    Decoding first token with input 449 tokens: 8.5 sec
    Decoding remaining 208 tokens:
        12.1 sec
        17.23 tokens/sec
```
NV RTX 2000
```
659 tokens in 11.4sec, 57.93 tokens/sec
    Decoding first token with input 449 tokens: 4.1 sec
    Decoding remaining 210 tokens:
        7.2 sec
        28.98 tokens/sec
```

From above data, you can see that with this PR, both intel (13s -> 8.5s)
and NV (7.3s -> 4.1s) GPUs for the first token time are performing
better.
2024-12-16 20:47:40 -08:00
Yulong Wang
9115682d69
[js/webgpu] disable failed tests temporarily (#23127)
### Description


Those test cases start to fail for unknown reasons.

To unblock the CI, I disabled those tests temporarily to earn time to
investigate the root cause.
2024-12-16 15:35:47 -08:00
Dmitri Smirnov
ae97068137
Fix Pybind memory leak (#23105)
### Description
<!-- Describe your changes. -->
Array GETITEM returns new reference which is a leak


### Motivation and Context
Address  https://github.com/microsoft/onnxruntime/issues/22271
2024-12-16 10:38:23 -08:00
tianf-fff
a4eb8f27b6
[VitisAI] Add profiler interface for vitisai (#23032)
### Description
<!-- Describe your changes. -->
Add common interfaces for vitis ep profiler.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ep can collect and record api and kernel timestamps in file when
onnxruntime '-p' is enabled.
2024-12-16 09:09:48 -08:00
Changming Sun
2ff66b80e0
Fix a deadlock bug in EigenNonBlockingThreadPool.h (#23098)
### Description
This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
2024-12-16 09:05:12 -08:00
Yulong Wang
3a0b958586
add 2 CMake build options of Dawn (#23096)
### Description

This change adds the following CMake build options for Dawn:
- onnxruntime_BUILD_DAWN_MONOLITHIC_LIBRARY
  - OFF by default
  - when enabled, builds Dawn as a monolithic library (webgpu_dawn.dll)
- onnxruntime_ENABLE_DAWN_BACKEND_VULKAN
  - OFF by default
  - when enabled, build with Vulkan backend for Dawn on Windows
- onnxruntime_ENABLE_DAWN_BACKEND_D3D12
  - ON by default
  - when enabled, build with DirectX 12 backend for Dawn on Windows



### File Size Comparison (Windows)

|  Build | cmdline  |  File Size  |
|---|---|---|
| Baseline | --config Release<br/> --build_shared_lib | `12,755,456
onnxruntime.dll` |
| WebGPU D3D12 (default) | --use_webgpu<br/> --config Release<br/>
--build_shared_lib | `17,082,368 dxcompiler.dll`<br/>` 1,508,472
dxil.dll`<br/>`18,708,480 onnxruntime.dll` |
| WebGPU D3D12+Vulkan | --use_webgpu<br/> --config Release<br/>
--build_shared_lib<br/> --cmake_extra_defines<br/>
onnxruntime_ENABLE_DAWN_BACKEND_D3D12=1<br/>
onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=1 | `17,081,344
dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`19,388,416
onnxruntime.dll` |
| WebGPU Vulkan | --use_webgpu<br/> --config Release<br/>
--build_shared_lib<br/> --cmake_extra_defines<br/>
onnxruntime_ENABLE_DAWN_BACKEND_D3D12=0<br/>
onnxruntime_ENABLE_DAWN_BACKEND_VULKAN=1 | `17,615,872 onnxruntime.dll`
|
| Monolithic | --use_webgpu<br/> --config Release<br/>
--build_shared_lib<br/> --cmake_extra_defines<br/>
onnxruntime_BUILD_DAWN_MONOLITHIC_LIBRARY=1 | `17,082,368
dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`13,277,696
onnxruntime.dll`<br/>` 5,616,640 webgpu_dawn.dll` |
| External Dawn | --use_webgpu<br/> --config Release<br/>
--build_shared_lib<br/> --cmake_extra_defines<br/>
onnxruntime_USE_EXTERNAL_DAWN=1<br/> --skip_tests | `17,081,344
dxcompiler.dll`<br/>` 1,508,472 dxil.dll`<br/>`13,277,184
onnxruntime.dll`
2024-12-13 16:05:48 -08:00
genmingz@AMD
62e7e24f17
Add attrProto.release_s interface (#22977)
### Description
Add AttributeProto.release_s interface, which is used to obtain the
string in the attribute using move semantics instead of copying it



### Motivation and Context
The ep_context node stores a lot of information in attributes, which may
cause the memory usage to increase. Use this interface to avoid memory
waste

---------

Co-authored-by: GenMing Zhong <genmingz@xlnx.xilinx.com>
Co-authored-by: genmingz <genmingz@amd.com>
2024-12-12 21:13:43 -08:00
Hector Li
2a36fd4f6e
Fix the ctx_gen tool to make sure all generated ctx.onnx have max_size (#23097)
### Description
Fix the qnn_ctx_gen tool to make sure all generated ctx.onnx have max_size
2024-12-12 21:12:02 -08:00
Hector Li
f43f40facf
Backward compatible with old QNN version (#23095)
### Description
Make QNN EP compliable with old QNN version
2024-12-12 17:04:20 -08:00
Yulong Wang
01539ee7ab
[js/webgpu] fix Conv2DMatMul shader's out-of-bound read (#23085)
### Description
<!-- Describe your changes. -->

Fix a bug caused by potential out-of-bound reads of `W` in the
Conv2DMatMul shader.

### Motivation and Context

Fixes #22983
2024-12-12 11:33:53 -08:00
Dmitri Smirnov
890a719c91
Remove deprecated static from Eigen that contributes to size increase (#23084)
### Description
<!-- Describe your changes. -->
This patches Eigen source to remove an unused deprecated static var.

### Motivation and Context
Internal customer request.
2024-12-12 10:19:47 -08:00
Ankit Maheshkar
1f88284f96
OVEP 1.21.0 Development Updates (#23080)
### Description
OVEP development changes for ORT 1.21 Release
 
 
### Motivation and Context
- Has Critical Bug Fixes
- Improved Performance optimizations for both memory & inference latency
(https://github.com/intel/onnxruntime/pull/513)
- Enabled Model Compilation using NPUW
(https://github.com/intel/onnxruntime/pull/508)
- Fixed support for EPContext embed mode 0 for lower memory utilization
- Updated NuGet package name as `Intel.ML.OnnxRuntime.OpenVino`
- Fixed QDQ Stripping logic on NPU
2024-12-11 22:26:32 -08:00
Hector Li
ebb968d34a
disable the EP context embed model by default in session option (#23070)
change the default value for session option ep.context_embed_mode to 0 to avoid the model loading memory overhead
2024-12-11 17:26:29 -08:00
Yulong Wang
e605870783
[js/web] Update API for ort.env.webgpu (#23026)
### Description

This PR is a replacement of #21671. It offers a new way for accessing
the following:
- `ort.env.webgpu.adapter`:
- **deprecating**. There is no point to get the value of it. Once
`GPUDevice.adapterInfo` is supported, there is no point to set the value
too.
- `ort.env.webgpu.device`:
  - set value of `GPUDevice` if user created it. Use at user's own risk.
- get value of `Promise<GPUDevice>`. if not exist, create a new one. if
exist return it.
- `ort.env.webgpu.powerPreference`:
- **deprecating**. encouraging users to set `ort.env.webgpu.device` if
necessary.
- `ort.env.webgpu.forceFallbackAdapter`:
- **deprecating**. encouraging users to set `ort.env.webgpu.device` if
necessary.
2024-12-11 10:24:14 -08:00
sushraja-msft
8800830a44
Implement 2d tiled matmulnbits specialized for prefill (#23058)
### Description
This change implements matmul4bits with tiling both for A and B. This is
beneficial for prefill scenarios on Intel integrated GPUs, because each
row of A has to run through the same set of shared rows of B. This
change should improve core occupancy and model_benchmark does indicate
improvements for prefill.

The same shader is not used for generation because when A has just a
single row, the other threads in the workgroup get unused and that hurts
performance.

```
-- Baseline run on an Alderlake GPU --

C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.72338e+07
        avg (tokens/s): 29.0707                          << 
        p50 (us):       1.72548e+07
        stddev (us):    57012.8
        n:              5 * 501 token(s)
Token generation:
        avg (us):       79227.5
        avg (tokens/s): 12.6219
        p50 (us):       79284.4
        stddev (us):    2109.72
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       15.8198
        avg (tokens/s): 63211.8
        p50 (us):       14.3
        stddev (us):    8.67178
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       27297.8
        p50 (ms):       27269.8
        stddev (ms):    89.4322
        n:              5
Peak working set size (bytes): 5490987008
WebGPU device lost (2): Device was destroyed.

----------------------------------- With Prefill Optimization ----

C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500                                                                                                                                                               
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.2135e+07
        avg (tokens/s): 41.2856                                 << 
        p50 (us):       1.21288e+07
        stddev (us):    21282.1
        n:              5 * 501 token(s)
Token generation:
        avg (us):       78945.3
        avg (tokens/s): 12.667
        p50 (us):       78900.7
        stddev (us):    2232.43
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       20.5608
        avg (tokens/s): 48636.3
        p50 (us):       18.7
        stddev (us):    19.0409
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       22163.8
        p50 (ms):       22160.1
        stddev (ms):    31.3122
        n:              5
Peak working set size (bytes): 5478862848
WebGPU device lost (2): Device was destroyed.
```
2024-12-10 17:07:11 -08:00
amancini-N
d8de3c4096
[CUDA EP] Fix BeamSearch on T5 with sequence_as_input_ids (#20667) (#20668)
### Description
Change the implementation of BeamSearch op when using CUDA EP: in case
of T5 model, and in case the decoder input_ids are sequences, copy the
sequences device-to-device instead of host-to-device

### Motivation and Context
- Fixes #20667
2024-12-10 16:20:47 -08:00
shiyi
02f0af0d08
[WebNN] Improve data type check of slice op (#22988)
A follow-up of [[WebNN] Support negative steps for
slice](https://github.com/microsoft/onnxruntime/pull/22871#discussion_r1847929774).
Slice op is emulated by reverse+slice when steps < 0 so
`SliceOpBuilder::HasSupportedInputsImpl()` should also check the
supported data types of reverse.

---------

Co-authored-by: Wanming Lin <wanming.lin@intel.com>
2024-12-10 15:48:16 -08:00
Edward Chen
fa6ad202aa
Minor updates to onnxruntime_java.cmake (#23068)
- Use `ANDROID` instead of `CMAKE_SYSTEM_NAME STREQUAL "Android"`.
- Put common gradle arguments into `COMMON_GRADLE_ARGS` to make them easier to reuse.
2024-12-10 15:44:36 -08:00
Jiajia Qin
defcc4f819
[webgpu] Optimize Expand (#23052)
### Description
<!-- Describe your changes. -->
Use components = 4 if possible.

This is the webgpu native implementation from #22752
2024-12-10 14:58:57 -08:00