Commit graph

12166 commits

Author SHA1 Message Date
Scott McKay
04ff0ceeed Merge 2025-01-18 11:24:02 +10:00
Scott McKay
e915f01b41 Merge 2025-01-18 11:22:58 +10:00
Scott McKay
a52e268613 Fix 2 more x86 issues 2025-01-18 09:40:27 +10:00
Scott McKay
e84eb00af1 Fix x86 error 2025-01-17 19:10:43 +10:00
Scott McKay
5db0b520c4 Fix x86 build 2025-01-16 20:59:40 +10:00
Scott McKay
453f13a2b5 Address PR comments
Add unit tests
2025-01-16 19:43:46 +10:00
Scott McKay
45d5906358 Merge 2025-01-14 10:21:06 +10:00
Scott McKay
0e145e0d0b Tweak comment 2025-01-08 07:45:52 +10:00
Scott McKay
0dcf0864d3 Update test to use 128 bytes for initializer so it can be allocated externally. 2025-01-07 18:54:38 +10:00
Scott McKay
d360f76626 Merge 2025-01-07 16:14:18 +10:00
Scott McKay
347bd7a3f2 Take ownership of node attributes for consistency
Updates comments for clarity.
Copy external data into initializer when saving model for debugging.
2025-01-07 16:11:35 +10:00
Changming Sun
704523c2d8
[build] Be compatible with the latest protobuf (#23260)
Resolve #21308
2025-01-06 13:10:43 -08:00
Changming Sun
c6cbda3257
Update Python-Cuda-Publishing-Pipeline (#23253)
### Description
1. Currently Python-Cuda-Publishing-Pipeline only publishes Linux
wheels, not Windows wheels. It is because recently we refactored the
upstream pipeline("Python-CUDA-Packaging-Pipeline") to use 1ES PT. This
PR fixed the issue
2. tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml no
longer includes component-governance-component-detection-steps.yml ,
because 1ES PT already inserted such a thing
3. Delete tools/ci_build/github/windows/eager/requirements.txt because
it is no longer used.

### Motivation and Context
The "Python-CUDA-Packaging-Pipeline" is for CUDA 12.
"Python CUDA ALT Packaging Pipeline" is for CUDA 11.

The two pipelines are very similar, except the CUDA versions are
different.
Each of them has three parts: build, test, publish.
"Python-CUDA-Packaging-Pipeline" is the first part: build.
"Python CUDA12 Package Test Pipeline" is the second part.
"Python-Cuda-Publishing-Pipeline" is the third part that publishes the
packages to an internal ADO feed.
2025-01-06 11:50:58 -08:00
Yulong Wang
c53c9caf17
[js] update mocha to v11.0.1 (#23254)
### Description

Update `mocha` to v11.0.1 and `fs-extra` to v11.2.0

```
# npm audit report

nanoid  <3.3.8
Severity: moderate
Predictable results in nanoid generation when given non-integer values - https://github.com/advisories/GHSA-mwcw-c2x4-8c55
fix available via `npm audit fix`
node_modules/nanoid
  mocha  8.2.0 - 10.2.0
  Depends on vulnerable versions of nanoid
  node_modules/mocha

2 moderate severity vulnerabilities
```
2025-01-05 22:29:02 -08:00
Yulong Wang
21b4d2ac9f
fix pipeline build-perf-test-binaries (#23255) 2025-01-05 22:28:41 -08:00
Wu, Junze
2a16ad0215
[js/node] add proxy agent support for onnxruntime-node install script (#23232)
### Description
Add proxy agent to fetch request



### Motivation and Context
Fixes #23231

---------

Signed-off-by: Junze Wu <junze.wu@intel.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
2025-01-04 20:27:55 -08:00
Changming Sun
b7ef81a034
Move Linux GPU CI pipeline to A10 (#23235)
Move Linux GPU CI pipeline to A10 machines which are more advanced.
Retire onnxruntime-Linux-GPU-T4 machine pool.
Disable run_lean_attention test because the new machines do not have
enough shared memory.

```
skip loading trt attention kernel fmha_mhca_fp16_128_256_sm86_kernel because no enough shared memory
[E:onnxruntime:, sequential_executor.cc:505 ExecuteKernel] Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: CUDA error cudaErrorInvalidValue:invalid argument
```
2025-01-04 19:11:37 -08:00
Jiajia Qin
4247153bb2
[webgpu] Add kernel type to profile info (#23167)
### Description
This PR is convenient to do post processing for the generated json file
when profiling is enabled. Kernel type can be used to aggregate the same
type kernels' overall time.
2025-01-03 14:28:48 -08:00
Yulong Wang
5c2e60c5af
[js/node] update install script to allow use proxy (#23242)
### Description

Use `https.get` instead of `fetch` in ORT Nodejs binding package install
script.

### Motivation and Context

According to discussions in #23232, the package `global-agent` cannot
work with `fetch` API. To make it work with the proxy agent, this PR
replaces the `fetch` API with `https.get` in the install script.
2025-01-03 14:27:15 -08:00
Changming Sun
5d692b0136
Merge web machine pools (#23243)
### Description
The Web CI pipeline uses three different Windows machine pools:
1. onnxruntime-Win2022-webgpu-A10
2. onnxruntime-Win2022-VS2022-webgpu-A10
3. onnxruntime-Win-CPU-2022-web

This PR merges them together to reduce ongoing maintenance cost.
2025-01-03 13:53:17 -08:00
Yueqing Zhang
aedb49beb4
[VitisAI] change all support tensor type from ir 9 to ir 10 (#23204)
### Description
<!-- Describe your changes. -->
Changed all support tensor  type from ir 9 to ir 10.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- See issue https://github.com/microsoft/onnxruntime/issues/23205

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
2025-01-02 06:45:21 -08:00
Yifan Li
bc91f5c72e
[TensorRT EP] Fix to build ORT on legacy TRT8.5 (#23215)
### Description
<!-- Describe your changes. -->
For legacy jetson users who use jetpack 5.x, the latest TRT version is
8.5.
Add version check to newer trt features to fix build on jetpack 5.x
(cuda11.8+gcc11 are required)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-01-01 19:24:24 -08:00
xhcao
a3833a5e79
[js/webgpu] validate transpose perm if specified (#23197)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2025-01-01 15:58:54 -08:00
Dmitry Deshevoy
0b87bccca8
[CUDA] Make cubins const (#23225)
### Description
Make arrays with cubin data const.


### Motivation and Context
Non-const arrays are put into the .data section which might cause
excessive memory usage in some scenarios. Making cubin arrays const
allows them to be put into the .rodata section.
2024-12-31 16:20:21 -08:00
Changming Sun
afd3e81c94
Remove PostBuildCleanup (#23233)
Remove PostBuildCleanup tasks since it is deprecated. It is to address a
warning in our pipelines:

"Task 'Post Build Cleanup' version 3 (PostBuildCleanup@3) is dependent
on a Node version (6) that is end-of-life. Contact the extension owner
for an updated version of the task. Task maintainers should review Node
upgrade guidance: https://aka.ms/node-runner-guidance"

Now the cleanup is controlled in another place:

https://learn.microsoft.com/en-us/azure/devops/pipelines/yaml-schema/workspace?view=azure-pipelines


The code change was generated by the following Linux command:
```bash
find . -name \*.yml -exec sed -i '/PostBuildCleanup/,+2d' {} \;
```
2024-12-31 13:12:33 -08:00
Jean-Michaël Celerier
2116fd1999
Update onnxruntime_c_api.h to work with MinGW (#23169)
The SAL2 macros are not always available there

### Description

Make SAL2 macros only available on MSVC.

### Motivation and Context

https://github.com/microsoft/onnxruntime/issues/1175
2024-12-31 11:05:10 -08:00
Changming Sun
69bb53db85
Enable delay loading hooker for python packages (#23227)
### Description
Enable delay loading hooker for python packages
2024-12-31 10:12:31 -08:00
wejoncy
86870114eb
[CoreML] support coreml model cache (#23065)
### Description
Refactor compute plan profiling

Support cache coreml model to speed up session initialization. this is
only support by user provided entry and user responsible to manage the
cache


With the cache, session initialization time can be reduced by 50% or
more:
|model| before| after|
|--|--|--|
|yolo11.onnx| 0.6s|0.1s|
|yolo11-fp16.onnx|1.8s|0.1s|


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: wejoncy <wejoncy@.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-12-31 09:29:41 +08:00
Scott McKay
6fb01c19a7 Remove temp debug code 2024-12-30 12:01:41 +10:00
Scott McKay
019edc9264 Fix minimal build.
Fix some more old 'graph api' naming
2024-12-30 11:15:03 +10:00
Scott McKay
6f2a5c3c46 More debug info 2024-12-30 10:37:06 +10:00
Scott McKay
0dc1e6ee61 Add Constant test 2024-12-30 09:42:34 +10:00
Wanming Lin
2d05c4bcd9
[WebNN] Support SkipSimplifiedLayerNormalization op (#23151)
The algorithm of `SkipSimplifiedLayerNormalization` is quite similar to
the `SimplifiedLayerNormalization`, only different is
`SkipSimplifiedLayerNormalization` provides an additional output used
for calculating the sum of the input, skip and bias (if it exits).

BTW, fix a bug in `SimplifiedLayerNormalization`, adding bias if it
exits.
2024-12-24 12:44:14 -08:00
liqun Fu
a9a881cc98
Integrate onnx 1.17.0 (#21897)
### Description
<!-- Describe your changes. -->
for ORT 1.21.0 release

Create following related issues to track skipped tests due to updated
ONNX operators in the ONNX 1.17.0 release:
https://github.com/microsoft/onnxruntime/issues/23162
https://github.com/microsoft/onnxruntime/issues/23164
https://github.com/microsoft/onnxruntime/issues/23163
https://github.com/microsoft/onnxruntime/issues/23161

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com>
Co-authored-by: yf711 <yifanl@microsoft.com>
2024-12-24 09:02:02 -08:00
Scott McKay
002b6cc238 Fix some more builds/tests 2024-12-24 17:12:46 +10:00
Scott McKay
275f762b3d Last linux build error fixes. 2024-12-24 08:43:30 +10:00
Scott McKay
5e85fce91e Add missed changed. 2024-12-24 07:52:29 +10:00
Scott McKay
d8ef92b4ce Remove unused function to fix build error.
Fix some long lines.
2024-12-24 07:24:17 +10:00
Adrian Lizarraga
81cd6eacd0
[QNN EP] Fix multithread sync bug in ETW callback (#23156)
### Description

Fixes crash in QNN dlls when an ETW callback tries to change the QNN log
level. This is caused by a function that does not lock a mutex before
modifying the QNN log level.

### Motivation and Context
An ETW callback into QNN EP leads to a crash within QNN SDK dlls. It
happens approximately 1 out of 3 full QNN unit tests runs.

The cause is a multithreading synchronization bug in QNN EP. We're not
always locking a mutex when ETW calls QNN EP to notify of ETW config
change.
 
There are two branches in the QNN EP callback function that try to
update the QNN log handle. One branch correctly locks a mutex, but other
does not lock it at all. This causes crashes within QNN dlls.
- Does not lock mutex:
[onnxruntime/onnxruntime/core/providers/qnn/qnn_execution_provider.cc at
main ·
microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/qnn/qnn_execution_provider.cc#L426)
- Locks mutex:
[onnxruntime/onnxruntime/core/providers/qnn/qnn_execution_provider.cc at
main ·
microsoft/onnxruntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/qnn/qnn_execution_provider.cc#L442)

The fix is to lock the mutex in both paths.
2024-12-23 10:02:04 -08:00
Scott McKay
41fb824df9 Fix some linux build errors. 2024-12-23 21:49:37 +10:00
Scott McKay
351d12df9e Improve consistency. Update some comments. 2024-12-23 19:31:21 +10:00
Scott McKay
dece8b8e6a Model Builder API
- Create new model
- Augment existing model
2024-12-23 18:39:14 +10:00
amancini-N
c6ba7edd83
Enable pointer-generator T5 models in BeamSearch (#23134)
### Description
Introduces a new optional input (encoder_ibnput_ids) in the decoder
graph of the T5 implementation for BeamSearch. This allows usage of
pointer generator networks in decoder graph.

### Motivation and Context
- Fixes #23123
2024-12-22 21:30:49 -08:00
Yueqing Zhang
ebdbbb7531
[VitisAI] Int4 support (#22850)
### Description
<!-- Describe your changes. -->
1. Add support for throwing error when hardware is not supported for
VitisAI.
2. Add support for unloading VitisAI EP.
3. Add API for Win25.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is requirement for Win25
2024-12-20 22:03:27 -08:00
Yulong Wang
6806174096
fix webgpu delay load test (#23157)
### Description

This change fixes the WebGPU delay load test.


<details>
<summary>Fix UB in macro</summary>

The following C++ code outputs `2, 1` in MSVC, while it outputs `1, 1`
in GCC:

```c++
#include <iostream>

#define A 1
#define B 1

#define ENABLE defined(A) && defined(B)

#if ENABLE
int x = 1;
#else
int x = 2;
#endif

#if defined(A) && defined(B)
int y = 1;
#else
int y = 2;
#endif

int main()
{
    std::cout << x << ", " << y << "\n";
}
```

Clang reports `macro expansion producing 'defined' has undefined
behavior [-Wexpansion-to-defined]`.

</details>

<details>
<summary>Fix condition of build option
onnxruntime_ENABLE_DELAY_LOADING_WIN_DLLS</summary>

Delay load is explicitly disabled when python binding is being built.
modifies the condition.

</details>
2024-12-20 13:37:12 -08:00
Changming Sun
fcc34da5e9
Fix a tiny problem in winml.cmake (#23173)
### Description
CMake's
[target_link_libraries](https://cmake.org/cmake/help/latest/command/target_link_libraries.html#id2)
function accepts plain library name(like `re2`) or target name(like
`re2::re2`) or some other kinds of names. "plain library names" are
old-fashioned, for compatibility only. We should use target names.

### Motivation and Context
To make vcpkg work with winml build. See #23158
2024-12-20 11:48:43 -08:00
Dmitri Smirnov
00b262dbb4
Implement pre-packed blobs serialization on disk and their memory mapping on load (#23069)
### Description
<!-- Describe your changes. -->
Pre-packing is a feature, that allows kernels to re-arrange weights data
to gain performance at interference time

Currently, pre-packed blobs are shared when a cross-session weight
sharing is enabled and only for those weights that are marked as shared
by the user. Otherwise, data resides on the heap, the kernels own the
data which may be duplicated.

This change enables pre-packed data to be stored on disk alongside with
the external initializers.
The pre-packed blobs are memory mapped and are loaded into either the
X-session shared container
or a new container that shares pre-packed blobs within the session.

With the new approach, pre-packed blobs are always owned by the shared
container using the existing pre-pack mechanism for sharing. When
X-session sharing is enabled, then the external container owns the data.
A separate container owned by a root `SessionState` owns and shares the
data when X-session sharing is not enabled.

To facilitate this new approach, we introduce a new container that works
in two modes. When an optimized model is being saved, and pre-packed
weights saving is enabled, the new container will record pre-packed
blobs and serialize them to disk using existing
`ToGraphProtoWithExternalInitializers` function.

To externalize the pre-packed weights, we introduce a new session option
`kOrtSessionOptionsSavePrePackedConstantInitializers.` Note, that
pre-packing should be enabled (default) for this to work.

`ToGraphProtoWithExternalInitializers`function is modified to recurse
into subgraphs to make sure we properly account for local initializer
names.

In the second mode, the container would simply hold the pre-packed
weights memory-mapped from disk and share them with the kernels.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reduce memory usage by pre-packed initializers and externalize them.
2024-12-20 10:49:08 -08:00
xhcao
29bccad96d
[webgpu] fix compiling error (#23139)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-12-20 09:05:23 -08:00
mingyue
4aca8f33df
[Bug Fix] Missing CustomOp SchemaRegister when generator EPContext ONNX model (#23091)
### Description
Enhancements to EPContext Operations:
1. Introduced support for the bfloat16 data type in EPContext operations.
2. Bug Fix: Missing Custom OP Schema Registration when generator EPContext ONNX model

---------

Co-authored-by: mingyue <mingyue@xilinx.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
2024-12-19 16:47:13 -08:00
Jiajia Qin
7c782f6741
[webgpu] Always use tile matmulnbits for block_size = 32 (#23140)
### Description
After the optimization of prefill time with #23102, it seems that always
using the tile matmulnibits with block_size = 32 can bring better
performance even for discrete gpu for phi3 model.

Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my
NV RTX 2000 GPU.
2024-12-19 16:22:53 -08:00