Commit graph

1674 commits

Author SHA1 Message Date
mindest
eecc11afc7
[ROCm] Disable ck_tile in Debug build (#21178)
### Description
tmp fix: disable ck_tile for Debug build.



### Motivation and Context
Release build works fine for ck_tile, while Debug build fails.
<details>
<summary> Typical error log to revisit
</summary>

```
[880/1797] Building HIP object CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o
FAILED: CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o 
/opt/rocm/llvm/bin/clang++ -DEIGEN_MPL2_ONLY -DENABLE_ROCM_PROFILING -DENABLE_STRIDED_TENSORS -DENABLE_TRAINING -DENABLE_TRAINING_APIS -DENABLE_TRAINING_CORE -DENABLE_TRAINING_OPS -DENABLE_TRAINING_TORCH_INTEROP -DMIOPEN_VERSION=30100 -DORT_ENABLE_STREAM -DROCM_VERSION=60100 -DUSE_ROCM=1 -D_GNU_SOURCE -D__HIP_ROCclr__=1 -D__bf16__ -D__fp16__ -D__fp32__ -I/build/Debug/_deps/utf8_range-src -I/ws/onnxruntime/include/onnxruntime -I/ws/onnxruntime/include/onnxruntime/core/session -I/ws/onnxruntime/orttraining/orttraining/training_api/include -I/build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha -I/build/Debug/_deps/composable_kernel-src/include -I/build/Debug/_deps/composable_kernel-build/include -I/build/Debug/_deps/composable_kernel-src/library/include -isystem /opt/rocm-6.1.0/include -g -O -std=gnu++17 --offload-arch=gfx90a -fPIC -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -MD -MT CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -MF CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o.d -o CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -x hip -c /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp
In file included from /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp:5:
In file included from /build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha/fmha_fwd.hpp:6:
In file included from /build/Debug/_deps/composable_kernel-src/include/ck_tile/core.hpp:11:
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
   27 |     asm volatile("s_add_u32 m0, %0, m0" : : "n"(v) : "memory");
      |                  ^
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
/build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated when compiling for gfx90a.
...
```
</details>
2024-06-27 12:04:17 +08:00
aciddelgado
ebd0368bb0
Make Flash Attention work on Windows (#21015)
### Description
Previously, Flash Attention only worked on Linux systems. This PR will
make it work and enable it to be built and run on Windows.

Limitations of Flash Attention in Windows: Requires CUDA 12.

### Motivation and Context
This will significantly increase the performance of Windows-based LLM's
with hardware sm>=80.

To illustrate the improvement of Flash Attention over Memory Efficient
Attention, here are some average benchmark numbers for the GQA operator,
run with configurations based on several recent models (Llama, Mixtral,
Phi-3). The benchmarks were obtained on RTX4090 GPU using the test
script located at
(onnxruntime/test/python/transformers/benchmark_gqa_windows.py).

* Clarifying Note: These benchmarks are just for the GQA operator, not
the entire model.

### Memory Efficient Attention Kernel Benchmarks:
| Model Name | Max Sequence Length | Inference Interval (ms) |
Throughput (samples/second) |

|----------------------------------------|---------------------|-------------------------|-----------------------------|
| Llama3-8B (Average Prompt) | 8192 | 0.19790525 | 13105.63425 |
| Llama3-8B (Average Token) | 8192 | 0.207775538 | 12025.10172 |
| Llama3-70B (Average Prompt) | 8192 | 0.216049167 | 11563.31185 |
| Llama3-70B (Average Token) | 8192 | 0.209730731 | 12284.38149 |
| Mixtral-8x22B-v0.1 (Average Prompt) | 32768 | 0.371928785 |
7031.440056 |
| Mixtral-8x22B-v0.1 (Average Token) | 32768 | 0.2996659 | 7607.947159 |
| Phi-3-mini-128k (Average Prompt) | 131072 | 0.183195867 | 15542.0852 |
| Phi-3-mini-128k (Average Token) | 131072 | 0.198215688 | 12874.53494 |
| Phi-3-small-128k (Average Prompt) | 65536 | 2.9884929 | 2332.584142 |
| Phi-3-small-128k (Average Token) | 65536 | 0.845072406 | 2877.85822 |
| Phi-3-medium-128K (Average Prompt) | 32768 | 0.324974429 | 8094.909517
|
| Phi-3-medium-128K (Average Token) | 32768 | 0.263662567 | 8978.463687
|

### Flash Attention Kernel Benchmarks:
| Model Name | Max Sequence Length | Inference Interval (ms) |
Throughput (samples/second) |

|--------------------------------------|---------------------|-------------------------|-----------------------------|
| Llama3-8B (Average Prompt) | 8192 | 0.163566292 | 16213.69057 |
| Llama3-8B (Average Token) | 8192 | 0.161643692 | 16196.14715 |
| Llama3-70B (Average Prompt) | 8192 | 0.160510375 | 17448.67753 |
| Llama3-70B (Average Token) | 8192 | 0.169427308 | 14702.62043 |
| Mixtral-8x22B-v0.1 (Average Prompt) | 32768 | 0.164121964 |
15618.51301 |
| Mixtral-8x22B-v0.1 (Average Token) | 32768 | 0.1715865 | 14524.32273 |
| Phi-3-mini-128k (Average Prompt) | 131072 | 0.167527167 | 14576.725 |
| Phi-3-mini-128k (Average Token) | 131072 | 0.175940594 | 15762.051 |
| Phi-3-small-128k (Average Prompt) | 65536 | 0.162719733 | 17824.494 |
| Phi-3-small-128k (Average Token) | 65536 | 0.14977525 | 16749.19858 |
| Phi-3-medium-128K (Average Prompt) | 32768 | 0.156490786 | 17679.2513
|
| Phi-3-medium-128K (Average Token) | 32768 | 0.165333833 | 14932.26079
|

Flash Attention is consistently faster for every configuration we
benchmarked, with improvements in our trials ranging from ~20% to ~650%.

In addition to these improvements in performance, Flash Attention has
better memory usage. For example, Memory Efficient Attention cannot
handle a max sequence length higher than 32,768, but Flash Attention can
handle max sequence lengths at least as high as 131,072.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2024-06-24 09:43:49 -07:00
Changming Sun
f5625b8858
Revert "[MIGraphX EP] enable compilation and execution on Windows (21084)" (#21132)
### Description

This reverts commit 1d7bf56947 because it
broken the AMD GPU CI pipeline. Sorry when I reviewed the PR I forgot to
run the AMD GPU CI pipeline.

Will revert the PR first then ask the author to fix the issue.
2024-06-21 01:01:07 -07:00
Jake Mathern
b9eb1dc21e
Update protobuf_cmake.patch to allow extra disablements configurable by projects that build ORT (#20875)
### Description
Update protobuf_cmake.patch to allow extra disablements. ORT repo
already patches protobuf to not disable the warning 4996.


### Motivation and Context

To meet SDL requirements, Microsoft repos have to fail build if there is
warning 4996
Binskim also gives errors if warning 4996 is disabled.
We can suppress the Binskim issues, but we need a way to disable the
warnings for the minimal set of code that has them.
Right now, WindowsAI disables 4996 for entirety of ORT, but it should
only be disabled for protobuf.
2024-06-20 16:28:15 -07:00
Ted Themistokleous
1d7bf56947
[MIGraphX EP] enable compilation and execution on Windows (#36) (#21084) 2024-06-20 16:21:11 -07:00
Changming Sun
be423747b1
Delete pyop (#21094)
### Description
Remove the "--enable_language_interop_ops" build flag, because the code
is incompatible with the latest numpy, and the build flag is not used
anywhere except a macOS CI pipeline. It does not seem to have a ship
plan.


### Motivation and Context
The build error was:
```
onnxruntime/core/language_interop_ops/pyop/pyop.cc:122:85: error: no member named 'elsize' in '_PyArray_Descr'
                                  static_cast<int64_t>(PyArray_DescrFromType(type)->elsize),
                                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~  ^
```
2024-06-19 16:21:33 -07:00
Clément Péron
8ab8e649a7
tools: build: fix typo (#21052)
### Description
Typo in the python build script
2024-06-19 16:14:58 -07:00
Tianlei Wu
01279d8896
[ROCM] Exclude flash attention from hipify (#21091)
Exclude flash attention sub-directory from hipify.
2024-06-19 08:59:10 -07:00
cloudhan
ddd4ce3cb7
[ROCm] Update ck to use ck_tile (#21030) 2024-06-19 14:06:10 +08:00
Changming Sun
9ef4f1b789
Update pybind11 (#21072)
### Description
Upgrade pybind11 to the latest as suggested by @gnought in #21063

### Motivation and Context
Recently numpy released a new version, which caused compatibility issue
between the latest numpy version and the latest ONNX Runtime version.
2024-06-17 19:50:57 -07:00
Scott McKay
159fe9d4f3
Update to mobile model usability checker (#19843)
### Description
<!-- Describe your changes. -->

- Add check for CoreML MLProgram supported ops
- Only check usability with ORT Mobile package if requested
- this package will be deprecated so info is a) of minimal value and b)
can be confusing.
- Output more things at INFO level
- a lot of meaningful info was only output at DEBUG level. The default
INFO level is more useful
  - dump full partition info at DEBUG level
- Check subgraphs fully
  - CoreML can handle a subgraph
- TBD if we want to add support for adding a subgraph to the parent
graph for Loop and If nodes
    - most likely will be required for simple If nodes to be performant
- Check 5D CoreML limitation

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve helper tools

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-06-18 07:50:33 +10:00
Adrian Lizarraga
8f0e896c95
Fix Reduced Op build with empty FP16 kernel function tables (#21038)
### Description
- Fixes compilation error for "reduced operator" builds with no FP16
kernels and `MLAS_F16VEC_INTRINSICS_SUPPORTED` enabled.
- Fixes linker error for "reduced operator" builds with QNN EP by
excluding QNN EP unit tests. QNN EP unit tests require CPU EP operator
implementations to evaluate accuracy.


### Motivation and Context
Need to be able to build a reduced operator build with QNN EP. See
https://github.com/microsoft/onnxruntime/blob/main/docs/Reduced_Operator_Kernel_build.md

The following example operator config file causes a compilation error
when either `MLAS_F16VEC_INTRINSICS_SUPPORTED` is defined or QNN EP is
enabled.
```
# reduced_op_config.txt
ai.onnx;12;Add
```

```shell
python tools\ci_build\build.py --include_ops_by_config reduced_op_config.txt --config Debug --build_wheel --build_shared_lib --skip_tests --build_dir build --parallel --use_qnn --qnn_home '<QNN_ROOT_DIR>'
```
2024-06-14 14:23:12 -07:00
Ye Wang
f35dd1407f
custom allreduce cuda kernel (#20703)
### Description
<!-- Describe your changes. -->

Conditionally route to custom AllReduce kernel when buffer size and gpu
numbers meet certain requirements. Otherwise, keep using NCCL's
AllReduce.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Ye Wang <wangye@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net>
Co-authored-by: Your Name <you@example.com>
2024-06-13 11:09:49 -07:00
Tianlei Wu
b3fc9b5a0e
[CUDA] upgrade cutlass to 3.5.0 (#20940)
### Description
Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in
Windows
- [x] Upgrade cutlass to 3.5.0.
- [x] Fix flash attention build error with latest cutlass header files
and APIs. This fix is provided by @wangyems.
- [x] Update efficient attention to use new cutlass fmha interface.
- [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53.
- [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test
build error for cuda 11.8 to 12.3.
- [x] Disable TRT 10 deprecate warnings. 

The following are not included in this PR:
* TRT provider replaces the deprecated APIs.
* Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This
test is not built by default unless you add `--cmake_extra_defines
onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command.

To integrate to rel-1.18.1: Either bring in other changes (like onnx
1.16.1), or generate manifest and upload a new ONNX Runtime Build Time
Deps artifact based on rel-1.18.1.

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/19891
https://github.com/microsoft/onnxruntime/issues/20924
https://github.com/microsoft/onnxruntime/issues/20953
2024-06-11 13:32:15 -07:00
Changming Sun
92ae60b01f
Revert a cmake change in protobuf_cmake.patch (#20964)
Avoid patching external projects unless absolutely necessary
#20875
2024-06-10 11:20:33 -07:00
Edward Chen
981893c318
Remove deprecated "mobile" packages (#20941)
# Description

This PR removes the building of the ORT "mobile" packages and much of the associated infrastructure which is no longer needed.

Not removed yet - tools/ci_build/github/android/mobile_package.required_operators.config and the helper scripts that depend on it.

# Motivation and Context

The mobile packages were deprecated in 1.18. Users should use the full packages (Android - onnxruntime-android, iOS - onnxruntime-c/onnxruntime-objc) instead or do a custom build.
2024-06-07 16:20:32 -05:00
Changming Sun
f8b5c2805e
Update abseil-cpp.cmake: add version check (#20962)
Some dev environments come with a preinstalled abseil. For example,
conda users often do that. If the preinstalled abseil version is
incompatible with what we have in cmake/deps.txt, it could result in a
hard-to-understand build error. This PR adds a version check to improve
that.
2024-06-06 19:42:31 -07:00
liqun Fu
51bc53580d
Update to onnx 1.16.1 (#20702) 2024-06-04 11:06:28 -07:00
Changming Sun
d13cabf7f9
Upgrade GCC and remove the dependency on GCC8's experimental std::filesystem implementation (#20893)
### Description
This PR upgrades CUDA 11 build pipelines' GCC version from 8 to 11. 

### Motivation and Context

GCC8 has an experimental std::filesystem implementation which is not ABI
compatible with the formal one in later GCC releases. It didn't cause
trouble for us, however, ONNX community has encountered this issue much.
For example, https://github.com/onnx/onnx/issues/6047 . So this PR
increases the minimum supported GCC version from 8 to 9, and removes the
references to GCC's "stdc++fs" library. Please note we compile our code
on RHEL8 and RHEL8's libstdc++ doesn't have the fs library, which means
the binaries in ONNX Runtime's official packages always static link to
the fs library. It is just a matter of which version of the library, an
experimental one or a more mature one. And it is an implementation
detail that is not visible from outside. Anyway, a newer GCC is better.
It will give us the chance to use many C++20 features.

#### Why we were using GCC 8?
It is because all our Linux packages were built on RHEL8 or its
equivalents. The default GCC version in RHEL8 is 8. RHEL also provides
additional GCC versions from RH devtoolset. UBI8 is the abbreviation of
Red Hat Universal Base Image 8, which is the containerized RHEL8. UBI8
is free, which means it doesn't require a subscription(while RHEL does).
The only devtoolset that UBI8 provides is GCC 12, which is too new for
being used with CUDA 11.8. And our CUDA 11.8's build env is a docker
image from Nvidia that is based on UBI8.
#### How the problem is solved
Almalinux is an alternative to RHEL. Almalinux 8 provides GCC 11. And
the CUDA 11.8 docker image from Nvidia is open source, which means we
can rebuild the image based on Almalinux 8 to get GCC 11. I've done
this, but I cannot republish the new image due to various complicated
license restrictions. Therefore I put them at an internal location in
onnxruntimebuildcache.azurecr.io.
2024-06-03 10:14:08 -07:00
Changming Sun
65ef270e06
Update Aten pipeline's docker file to use UBI8 (#20856)
### Description
Now it uses CentOS 7 which is EOL. This PR updates it to UBI8.

### Motivation and Context
To deprecate CentOS 7 .
2024-05-30 07:38:15 -07:00
Carson M
5bfca1dc57
[Build] Change onnxruntime_NVCC_THREADS from option to cache entry (#20768)
### Description
Changes the `onnxruntime_NVCC_THREADS` CMake variable from an
[`option`](https://cmake.org/cmake/help/latest/command/option.html) to a
[cache
entry](https://cmake.org/cmake/help/latest/command/set.html#set-cache-entry).

### Motivation and Context
Fixes #19833.

`option` in CMake (confusingly, IMHO) always defines a *boolean* option.
The original definition of `onnxruntime_NVCC_THREADS` specified a
default of `1`, which I presume is coerced to `ON`. Thus, if the option
is not overridden with a value of another type, NVCC will receive a
malformed option `--threads ON` (rather than the expected `--threads
1`), which causes the error reported in #19833.

This error only occurred if compiling ONNX Runtime via CMake with
`-Donnxruntime_USE_CUDA=ON`; the CI build script always overrode
`onnxruntime_NVCC_THREADS` with a string value:

f1fef19b6e/tools/ci_build/build.py (L1152-L1154)
2024-05-29 12:28:33 -07:00
Chi Lo
a7bc49a565
[TensorRT EP] Use latest commit of onnx-tensorrt parser (#20758)
The 10 GA branch updated with several issues fixed.
https://github.com/onnx/onnx-tensorrt/commits/10.0-GA/
2024-05-24 16:44:16 -07:00
Changming Sun
b522df0ae4
Update RE2 to the latest (#20775)
Update RE2 to the latest.

To keep the components up to date.
2024-05-23 14:30:15 -07:00
Edward Chen
a39f8862fd
SQNBitGemm - move workspace size calculation functions to hardware-specific implementations (#20757)
The workspace usage may be hardware-specific. Moving away from a common workspace size calculation allows more flexibility in the hardware-specific implementations.
2024-05-22 15:12:17 -07:00
pengwa
8a98874e7e
Flash attention recompute (#20603)
### Flash attn recompute

1. Allow PythonOp(FlashAttn) can be recomputed correctly.
45879ff5c2
2. Use JSON to pass the selected-to-recompute subgraphs.
3c374da678

#### Better Memory Efficiency 

Customer model can run both PyTorch SPDA and Flash Attn, this PR make it
possible to let the Flash Attn path work with ORTModule layerwise
recompute. The peak drop from 45.xGB to 32.xGB if we only compare the
layers (not including other pieces, BTW there are few more optimization
targeting other pieces as well later).

#### Better Perf

Using Flash ATTN bring additionally 16% end to end time reduction, with
highly aligned loss curve.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/bb63894a-f281-49bc-a8e6-ff818439be38)

#### Use JSON File to pass Recompute Plans

To overcome the limitation of max length of the strings defined in
session options.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-21 13:38:19 +08:00
Yulong Wang
036fcd93d4
[js/web] optimize module export and deployment (#20165)
### Description

This PR make numbers of optimizations to onnxruntime-web's module export
and deployment.

See each section below for more details.

#### Preview

>
[onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21)

> ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~

> ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~

> ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~

<details>
<summary><h4>Breaking changes</h4></summary>

There is no code change required, but there are a few differences
regarding **code import**, **flags**, **bundler config** and
**deployment steps**.

#### Importing:

Import table is changed. See following for details.

<details>
<summary><h5>Current import table:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
  | `ort.all` | `onnxruntime-web/experimental` | ✔️ | ✔️ | ✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
| `ort.training` | `onnxruntime-web/training` |  |  | ✔️ |
✔️<sup>\[1]</sup> | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
  | `ort.wasm-core` | `onnxruntime-web/wasm-core` |  |  | ✔️ |  |  |
| `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ✔️<sup>\[2]</sup>
|  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

* [1] didn't test. may not actually work.
* [2] not working. this is a mistake in build config.

</details>

<details>
<summary><h5>Proposed update:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
| `ort.all` |
~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` | ✔️ | ✔️ |
✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
  | `ort.training` | `onnxruntime-web/training` |  |  | ✔️ | ✔️ | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
| ~~`ort.wasm-core`~~ | ~~`onnxruntime-web/wasm-core`~~ | ~~~~ | ~~~~
| ~~✔️~~ | ~~~~ | ~~~~ |
  | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ~~✔️~~  |  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

</details>

#### Flags:

The following flags are deprecated:
- `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in
build.

The following flags changed their type:
- `env.wasm.wasmPaths`: When using this flag as a string ( for the URL
prefix ), nothing is changed. When using this flag as an object ( for
per-file path override ), the type changed:
  ```diff
  -  export interface Old_WasmFilePaths{
  -    'ort-wasm.wasm'?: string;
  -    'ort-wasm-threaded.wasm'?: string;
  -    'ort-wasm-simd.wasm'?: string;
  -    'ort-training-wasm-simd.wasm'?: string;
  -    'ort-wasm-simd-threaded.wasm'?: string;
  -  };
  +  export interface New_WasmFilePaths {
  +    /**
  +     * Specify the override path for the main .wasm file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .wasm file is:
  +     * - `ort-wasm-simd-threaded.wasm` for default build
+ * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.wasm` for training build
  +     */
  +    wasm?: URL|string;
  +    /**
  +     * Specify the override path for the main .mjs file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .mjs file is:
  +     * - `ort-wasm-simd-threaded.mjs` for default build
+ * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.mjs` for training build
  +     */
  +    mjs?: URL|string;
  +  }
  ```

#### Bundler compatibility:

Config changes are need for bundlers. See usage example in
/js/web/test/e2e/ for Webpack, parcel and rollup.

#### Deployment:

- if consuming from a CDN, there is no breaking change.
- if consuming from a local server, need to copy all `ort-*.wasm` and
`ort-*.mjs` files (totally 6 files) in the dist folder. (previously only
need to copy `ort-*.wasm` files.)

</details>
<details>
<summary><h4>Problems</h4></summary>

There are a few problems with the current module export and deployment:

- Script URL cannot be correctly inferred when imported as ESM.
- Workers are forcefully encoded using Blob URL, which makes
onnxruntime-web not working in CSP environment and Node.js, when using
proxy or multi-threading feature.
- Generated JS code (by Emscripten) is encoded using
`function.toString()`, which is unstable and error-prone.
- When running with a different Emscripten build, always need the build
step. Making it difficult to swap artifacts in deveopment/debug.
</details>
<details>
<summary><h4>Goals</h4></summary>

- Full ESM support
- Support variances of ways to import. Including:
- import from HTML's `<script>` tag (IIFE format, exporting to global
variable `ort`)
    ```html
<script
src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script>
    ```
  - import from source code inside `<script type="module">` tag (ESM)
    ```html
    <script type="module">
import * as ort from
"https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs";

      // using 'ort'
    </script>
    ```
- import in a CommonJS project (CJS format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js
    const ort = require('onnxruntime-web');
    ```
- import in an ESM project (ESM format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js (or main.mjs)
    import * as ort from 'onnxruntime-web';
    ```
- Support popular bundlers when importing onnxruntime-web into a CJS/ESM
project.
  - webpack (esm requires extra post-process step)
  - rollup
  - parcel (esm requires extra post-process step)
  - More bundlers **TBD**
- Multi-threading support for Node.js

NOTE: keeping single JavaScript file (the all-in-one bundle) is no
longer a goal. This is because technically there is a conflict with the
other requirements.
</details>

<details>
<summary><h4>Important Design Decisions</h4></summary>

- Drop support of single JavaScript output.
- The current onnxruntime-web distribution uses a single JavaScript file
to include all code. While there are a few benefits, it also creates
problems as mentioned above. Since ESM is being used more and more
widely, and browsers are making more restricted security checks and
requirement, the old Blob based solution is going to be replaced.
- To achieve the requirement, specifically, the CSP environment support,
we have to offer a non Blob based solution. Therefore, we have to
distribute multiple files and drop the single file solution.

- Do not run parser/postprocess on Emscripten generated JavaScript.
- Emscripten is evolving quickly so we should only depends on what's in
its documentation instead of a certain implementation details. (for
example, currently we patch on its code to deal with a special variable
`_scriptDir`)
  - Keep the generated files as-is also helps to:
    - reduce the size of ort.min.js
- make it easier to replace build artifacts when in development/debug

- Drop support for non-SIMD and non-MultiThread. This helps to reduce
the number of artifacts in distribution.
  - (fixed-sized) SIMD is supported in any mainstream JS environment.
- Multi-thread as WebAssembly feature is supported in any mainstream JS
environment. In some environment the feature is guarded with cross
origin policy, but it can still work if not trying to create any worker.

- Use ESM output for Emscripten generated JavaScript.
- There are 2 ways to dynamically import classic (umd) modules and
neither of them are recommended:
- dynamically creating a <script> tag. This changes the HTML structure
and have quite a lot of compatibility issue
- use `fetch()` and `eval()`. However `eval` is strongly suggested to be
avoid because there is a great perf hit.
- importing ESM is super easy - just use the `import()` call.
Considering ESM is widely supported in modern browsers and Node.js this
is the better option.

- Add Blob based solution as a fallback for cross-origin workers.
- There are still wide use case of importing onnxruntime-web from CDN.
In this usage, make it able create worker by using `fetch()`+`Blob` to
create a same-origin Blob URL.

</details>

<details>
<summary><h4>Distribution File Manifest</h4></summary>

The distribution folder contains the following files:

- WebAssembly artifacts. These files are the result of compiling the
ONNX Runtime C++ code to WebAssembly by Emscripten.

  | File Name | Build Flags |
  |------|-----|
| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm |
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-training-wasm-simd-threaded.mjs <br/>
ort-training-wasm-simd-threaded.wasm | `--enable_training_apis` <br/>
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm
| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep`
<br/> `--use_webnn` |

- onnxruntime-web JavaScript artifacts. These files are generated by
ESBuild as the entry point for onnxruntime-web.

  There are multiple build targets for different use cases:
  | Target Name | Path for "import" or "require" | Description |
  |------|-----|-----|
  | `ort` | `onnxruntime-web` | The default target. |
  | `ort.all` | `onnxruntime-web/all` | The target including webgl. |
  | `ort.node` | `onnxruntime-web` | The default target for Node.js. |
| `ort.training` | `onnxruntime-web/training` | The target including
training APIs |
| `ort.wasm` | `onnxruntime-web/wasm` | The target including only
WebAssembly (CPU) EP |
| `ort.webgl` | `onnxruntime-web/webgl` | The target including only
WebGL EP |


  For each target, there are multiple files generated:
  | File Name | Description |
  |------|-----|
| [target].js | The entry point for the target. IIFE and CommonJS
format. |
  | [target].mjs | The entry point for the target. ESM format. |
| [target].min.js <br/> [target].min.js.map | The entry point for the
target. Minimized with sourcemap. IIFE and CommonJS format. |
| [target].min.mjs <br/> [target].min.mjs.map | The entry point for the
target. Minimized with sourcemap. ESM format. |
| [target].proxy.mjs | (if appliable) The proxy ESM module for the
target. |
| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map | (if
appliable) The proxy ESM module for the target. Minimized with
sourcemap. |

</details>

<details>
<summary><h4>Dynamic Import Explained</h4></summary>

- Local Served | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort-wasm-simd-threaded.mjs]
                    |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                    |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Local Served | Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort.proxy.min.mjs]
                    |
                    + new Worker()--> [ort.proxy.min.mjs (worker)]
                                        |
+ import()--> [ort-wasm-simd-threaded.mjs]
                                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                        |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Cross Origin | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort-wasm-simd-threaded.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort-wasm-simd-threaded)]
                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                        |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
                                            |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```

- Cross Origin | Proxy
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort.proxy.min.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort.proxy)]
                        |
+ new Worker()--> [blob:... (ort.proxy) (worker)]
                                            |
+ fetch('ort-wasm-simd-threaded.mjs')
                                                |
+ URL.createObjectURL(res.blob())
                                                |
+ import()--> [blob:... (ort-wasm-simd-threaded)]
                                                                |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                                |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
</details>
2024-05-20 09:51:16 -07:00
cloudhan
5d07291247
hipify int4 gemv (#20666)
Hipify MatMulNBits to accommodate the need of Phi3 onnx release.
2024-05-18 16:59:03 +08:00
Edward Chen
e81c8676e3
MatMulNBits + Add fusion (#20587)
- Add MatMulNBits Bias input
- Add graph transformer to fuse MatMulNBits + Add
2024-05-16 11:00:59 -07:00
Hector Li
0e11d0c4f8
Enable Qnn nuget nightly (#20662)
### Description
Enable Qnn nuget nightly
2024-05-13 21:28:43 -07:00
Jon Campbell
768c79317c
Enable QNN HTP support for Node (#20576)
### Description
Add support for using Onnx Runtime with Node

### Motivation and Context
Onnx Runtime supports the QNN HTP, but does not support it for Node.js.
This adds baseline support for the Onnx Runtime to be used with Node.

Note it does not update the node packages that are distributed
officially. This simply patches the onnxruntime.dll to allow 'qnn' to be
used as an execution provider.

Testing was done using the existing onnxruntime-node package. The
`onnxruntime.dll` and `onnxruntime_binding.node` were swapped into
`node_modules\onnxruntime-node\bin\napi-v3\win32\arm64` with the newly
built version, then the various QNN dlls and .so files were placed next
to the onnxruntime.dll. Testing was performed on a variety of models and
applications, but the easiest test is to modify the [node quickstart
example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/quick-start_onnxruntime-node).
2024-05-09 13:11:07 -07:00
George Wu
58d7b12205
support --arm64ec for qnn ep build (#20607)
link against binaries in arm64x-windows-msvc when building qnn ep with
--arm64ec build option.
2024-05-08 11:09:15 -07:00
Scott McKay
8d09baf49f
Clarify when protobuf dependency builds protoc (#20542)
### Description
<!-- Describe your changes. -->
Currently figuring out if the protobuf dependency is building protoc it
is a little obtuse and inconsistent
* in some places we directly set protobuf_BUILD_PROTOC_BINARIES to OFF
to indicate the protobuf dependency is not building protoc
  * e.g. macOS/iOS/visionOS builds
* for a user provided protoc path we don't set
protobuf_BUILD_PROTOC_BINARIES, and inside protobuf_function.cmake that
determines if `protobuf::protoc` is added as a dependency or not
*
0dda8b0c44/cmake/external/protobuf_function.cmake (L40-L45)

To be more consistent/explicit, set protobuf_BUILD_PROTOC_BINARIES to
OFF when ONNX_CUSTOM_PROTOC_EXECUTABLE set and valid.

Remove outdated script that built and external protoc binary which was
used in later builds. The build setup will fetch a pre-built protoc so
there's no need for this additional build.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make it easier to figure out if protoc is coming from the protobuf
dependency.
2024-05-08 08:30:11 +10:00
moyo1997
aff04ba08a
Dev/mookerem/arm64x update (#20536)
Made some changes to the arm64x.cmake script to:
- handle edge case
- Enable Projects that include onnxruntime as submodule and build it, to
be able to build as x without causing onnxruntime build_as_x to fail.
2024-05-07 12:50:38 -07:00
Chi Lo
c86476a636
[TensorRT] adapt for TRT lib name change after TRT 10 GA (update) (#20550)
https://github.com/microsoft/onnxruntime/pull/20445
The nvonnxparser still needs major version appending to it when building
oss parser.
2024-05-06 15:00:13 -07:00
Scott McKay
f9febc4f35
Remove usage of 'required reason' iOS API from protobuf (#20529)
### Description
<!-- Describe your changes. -->

Using certain APIs is about to require a [privacy
manifest](https://developer.apple.com/documentation/bundleresources/privacy_manifest_files/describing_use_of_required_reason_api)
to be added to a package.

Our version of protobuf uses `mach_absolute_time`. Patch as per
https://github.com/protocolbuffers/protobuf/pull/15662/ to remove usage.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Usage of API will require a privacy manifest for an iOS app to be
accepted as of 5/1/2024
#20519
2024-05-02 08:21:08 +10:00
Yifan Li
29417762f7
[TensorRT EP] support TensorRT 10-GA (#20506)
### Description
<!-- Describe your changes. -->
This branch is based on rel-1.18.0 and supports TensorRT 10-GA.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-05-01 11:10:53 -07:00
Hector Li
755aaea9a6
Qnn nuget update (#20527)
### Description
Update Qnn nuget package to include Qnn libs and license file
2024-04-30 22:12:53 -07:00
Tianlei Wu
9f0fae29e8
[CUDA] Add SparseAttention operator for Phi-3-small (#20216)
### Description
Add CUDA implementation for block sparse attention for Phi-3-small.

Block sparse attention was proposed in [Sparse
Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also
adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different
sparse layout.

In Phi-3-small, the sparse layout is static, and works with
unidirectional (causal) attention.

Compared to dense attention, the benefit of block sparse is to speed up
both training and inference. It could save memory thus support longer
context length.

- [x] Add operator spec and shape inference
- [x] Symbolic shape inference
- [x] Refactor GroupQueryAttention to expose common kernels for kv cache
concatenation, q/k/v transpose etc.
- [x] Add cuda kernel to convert block mask to CSR format
- [x] Add cuda kernel to generate position ids
- [x] Add compile script and template files to convert triton kernel to
cubin and dispatcher.
- [x] Add triton kernel v1 for prompt
- [x] Add triton kernel v2 for token generation and support padding
- [x] Update IO Binding Helper to allow buffer sharing.
- [x] Test relevance
- [x] Test performance

### Performance
Test in A100-SXM4-80GB with `batch_size=4, num_heads=32,
max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16,
vert_stride=8, num_layout=8`

We compare sparse attention to corresponding GQA with local attention
windows size 1024, or GQA with dense causal.

Average latency in milliseconds (for fused attention kernel used in
prompt prefilling):

seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0465 | 0.0722 | 0.0641
128 | 0.0618 | 0.0787 | 0.0672
256 | 0.1086 | 0.1076 | 0.0943
512 | 0.2535 | 0.2487 | 0.1676
1024 | 0.7042 | 0.7050 | 0.3800
2048 | 2.4125 | 1.9316 | 0.8966
4096 | 8.9346 | 4.5699 | 2.1129
8192 | 40.5401 | 10.3508 | 5.1748

Average latency in milliseconds (for fused attention kernel used in
token generation:

past_seq_len | GQA-Dense | GQA-Local | SparseAttention
-- | -- | -- | --
64 | 0.0186 | 0.0186 | 0.0870
128 | 0.0408 | 0.0466 | 0.1165
256 | 0.0530  | 0.0592 | 0.0988
512 | 0.0445| 0.0447 | 0.1150
1024 | 0.0634  | 0.0640 | 0.1454
2048 | 0.1027 | 0.0637 | 0.1589
4096 | 0.1789 | 0.0631 | 0.1806
8192 | 0.3288 | 0.0655 | 0.2146

We can see that the kernel for token generation still have room to
improve.

#### Limitations
Only support right-side padding and unidirectional attention.

The following are not supported in the first version:
(1) Packed mode like PackedMultiHeadAttention where input has been
removed padding.
(2) paged attention.
(3) bidirectional attention.
(4) GPU compute capacity that is not 8.0, 8.6 and 8.9.
(5) Left side padding.

Some of these limitations will be removed in the future (may be in a new
operator).
2024-04-30 09:06:29 -07:00
Edward Chen
358f5bb022
Update regex to match correct pattern. (#20483)
In CMakeLists.txt:set_msvc_c_cpp_compiler_warning_level(), the regex should match the value that gets added by the function. The latter got updated, so this change updates the former to match.
2024-04-29 10:43:31 -07:00
George Wu
49b2bebe85
[qnn ep] include qnn sdk in onnxruntime-qnn python whl (#20485)
script changes to include qnn sdk libs with onnxruntime-qnn python
package.
2024-04-29 09:44:54 -07:00
zz002
50e41984b0
[VitisAI] Solve the problem that gsl cannot be found when compiling under linux (#20466)
### Description
<!-- Describe your changes. -->

[VitisAI] Solve the problem that gsl cannot be found when compiling
under linux

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-04-28 20:56:16 -07:00
Hector Li
d2d4639ddb
fix the build issue for Win Arm64 Release build (#20475)
### Description
Fix the build error for Win ARM64 Release build.
graph_transform_test.cc(1,1): error C1128: number of sections exceeded
object file format limit: compile with /bigobj
[D:\build\Windows\Release\onnxruntime_test_all.vcxproj]


### Motivation and Context
Fix issue: https://github.com/microsoft/onnxruntime/issues/20406
2024-04-25 22:08:19 -07:00
liqun Fu
cc26b2dac2
Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels (#20163)
### Description

```
Avx2:
Int8

NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16: 	90.96			25.15			-72%					7.65				11.71			53%
Blklen32:	90.73			48.55			-46%					7.86				14.28			81%
Blklen64:	89.49			68.84			-23%					8.30				15.78			90%
Blklen128:	87.38			78.37			-10%					7.90				16.05			103%
Blklen256:	89.45			82.36			-7%					8.30				16.56			99%

Fp32		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	91.36			105.18		15%				7.57			9.52		25%
Blklen32:	89.30			105.99			18%					7.65				9.68			26%
Blklen64:	89.53			101.41			13%					7.97				9.84			23%
Blklen128:	85.23			99.71			16%					7.86				10.39			32%
Blklen256:	88.46			97.94			10%					8.32				10.23			22%

Avx512vnni:
Int8		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	132.18			21.56			-83%					10.34				11.48			11%
Blklen32:	168.28			43.69			-74%					11.85				14.73			24%
Blklen64:	201.81			60.29			-70%					12.36				15.47			25%
Blklen128:	194.92			57.04			-71%					13.03				14.67			12%
Blklen256:	218.76			70.20			-68%					13.33				16.31			22%

Fp32		
NS(Prompt)		MLAS(Prompt)  	MLAS(Prompt)Gain/Loss		NS(TokenGen)		MLAS(TokenGen)  	MLAS(TokenGen)Gain/Loss
Blklen16:	102.81			92.74			-9%					8.41				9.18			9%
Blklen32:	109.49			97.08			-11%					8.83				11.51			30%
Blklen64:	104.13			101.57			-2%					9.32				12.00			28%
Blklen128:	108.45			103.69			-4%					9.58				12.45			29%
Blklen256:	109.43			106.43			-2%					9.19				12.2			32%

```

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
2024-04-25 21:30:50 -07:00
Chi Lo
a077330c3e
[TensorRT] adapt for TRT lib name change after TRT 10 GA (#20445)
For TensorRT 10 GA onwards, the TensorRT libraries will have major
version appended to the end on Windows, for example, nvinfer_10.dll,
nvinfer_plugin_10.dll, nvonnxparser_10.dll ...

Change cmake file accordingly.
2024-04-24 21:46:54 -07:00
Edward Chen
9cc5badc49
Fix Objective-C static analysis warnings. (#20417)
Replace most usages of [NSString stringWithUTF8String:] with checked helper function. The issue is that the former can return nil.
2024-04-24 11:48:29 -07:00
Rachel Guo
14fcf0a52d
Support visionos build (#20365)
### Description
<!-- Describe your changes. -->

This PR supports a build of onnxruntime.xcframework for xros/xrsimulator
for visionos via the build command of

`python3 tools/ci_build/github/apple/build_apple_framework.py --config
Release/Debug
tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`.

For officially include visionos in ios cocoapods package and testing in
CI, would require separate work for upgrading the Xcode version &
upgrade macOS CI agent to macos-13-arm64 or higher.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

visionos support:
https://github.com/microsoft/onnxruntime/discussions/19313

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>
2024-04-23 18:15:07 -07:00
Scott McKay
9372e9a0a3
Support >2GB of Tensor data in training checkpoint (#20077)
### Description
<!-- Describe your changes. -->
Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.

I don't see a way for the flatbuffers 64-bit offsets to be used, as they
don't support storing 'table' types with 64-bit offsets (and our Tensor
is a 'table' type not a simple struct).


0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)

Allowing a Tensor to have its raw_data in an external file should
hopefully work with the least friction. As it's an extra field it's
backwards compatible.

Please feel free to suggest alternative approaches. 

Side note: the diffs in the generated *.fbs.h files are unexpectedly
large. Maybe they weren't re-generated when the new flatbuffers version
was checked in. I updated by running:
`python .\compile_schema.py -f <build output
dir>\_deps\flatbuffers-build\Debug\flatc.exe`
from onnxruntime\core\flatbuffers\schema which I thought was the correct
way but maybe that's out of date.

I think you can ignore all the diffs in the generated files and just
worry about the changes to the .fbs files in
onnxruntime/core/flatbuffers/schema. Basically start at the bottom of
the files changed and work up as all the 'real' diffs are there.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: carzh <wolfivyaura@gmail.com>
2024-04-22 15:17:43 -07:00
Yulong Wang
a457c1df80
upgrade emsdk to 3.1.57 (#20295)
### Description
upgrade emsdk to 3.1.57
2024-04-19 23:05:18 -07:00
sfatimar
4d1963c2a2
OpenVINO EP Rel 1.18 Changes (#20337)
### Description
These changes include
Support to OpenVINO 2024.1 
Import PreCompiled Blobs with EPContext Blob 
Separate Device/Precision as input
Deprecate CPU_FP32 , GPU_FP32 terminology , introduce CPU, GPU 
AUTO GPU, CPU will only create GPU Blob and not CPU Blob. 



### Motivation and Context
- OpenVINO 2024.1 will be out soon
- Import Precompiled Blob can greatly reduce FEIL/FIL Time. 
- Separating Device/Precision will make the input cleaner
-

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
2024-04-19 00:31:38 -07:00
Patrice Vignola
12569626cb
Update DML to 1.14.1 (#20380)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-18 22:43:41 -07:00