Commit graph

472 commits

Author SHA1 Message Date
Yulong Wang
7a8fa12850
Add implementation of WebGPU EP (#22591)
### Description

This PR adds the actual implementation of the WebGPU EP based on
https://github.com/microsoft/onnxruntime/pull/22318.

This change includes the following:

<details>
<summary><b>core framework of WebGPU EP</b></summary>

  - WebGPU EP factory classes for:
    - handling WebGPU options
    - creating WebGPU EP instance
    - creating WebGPU context
  - WebGPU Execution Provider classes
    - GPU Buffer allocator
    - data transfer
  - Buffer management classes
    - Buffer Manager
    - BufferCacheManager
      - DisabledCacheManager
      - SimpleCacheManager
      - LazyReleaseCacheManager
      - BucketCacheManager
  - Program classes
    - Program (base)
    - Program Cache Key
    - Program Manager
  - Shader helper classes
    - Shader Helper
    - ShaderIndicesHelper
    - ShaderVariableHelper
  - Utils
    - GPU Query based profiler
    - compute context
    - string utils
  - Miscs
    - Python binding webgpu support (basic)
 
</details>

<details>
<summary><b>Kernel implementation</b></summary>


  - onnx.ai (default opset):
- Elementwise (math): Abs, Neg, Floor, Ceil, Reciprocal, Sqrt, Exp, Erf,
Log, Sin, Cos, Tan, Asin, Acos, Atan, Sinh, Cosh, Asinh, Acosh, Atanh,
Tanh, Not, Cast
- Elementwise (activation): Sigmoid, HardSigmoid, Clip, Elu, Relu,
LeakyRelu, ThresholdedRelu, Gelu
- Binary (math): Add, Sub, Mul, Div, Pow, Equal, Greater,
GreaterOrEqual, Less, LessOrEqual
    - (Tensors): Shape, Reshape, Squeeze, Unsqueeze
    - Where
    - Transpose
    - Concat
    - Expand
    - Gather
    - Tile
    - Range
    - LayerNormalization
  - com.microsoft
    - FastGelu
    - MatMulNBits
    - MultiHeadAttention
    - RotaryEmbedding
    - SkipLayerNormalization
    - LayerNormalization
    - SimplifiedLayerNormalization
    - SkipSimplifiedLayerNormalization

</details>

<details>
<summary><b>Build, test and CI pipeline integration</b></summary>

  - build works for Windows, macOS and iOS
  - support onnxruntime_test_all and python node test
  - added a new unit test for `--use_external_dawn` build flag.
  - updated MacOS pipeline to build with WebGPU support
  - added a new pipeline for WebGPU Windows

</details>

This change does not include:

- Node.js binding support for WebGPU (will be a separate PR)
2024-10-29 18:29:40 -07:00
Indy Zhu
e2e837584f
[DML EP] Update DML to 1.15.4 (#22635)
### Description
[DML EP] Update DML to 1.15.4



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We want the customer to use the latest DirectML.
2024-10-29 17:13:57 -07:00
Changming Sun
88676e62b9
Remove nsync (#20413)
### Description
1. Remove the onnxruntime::OrtMutex class and replace it with
~absl::Mutex~ std::mutex.
2. After this change, most source files will not include <Windows.h>
indirectly.


### Motivation and Context
To reduce the number of deps we have, and address some Github issues
that are related to build ONNX Runtime from source.
In PR #3000 , I added a custom implementation of std::mutex . It was
mainly because at that time std::mutex's default constructor was not
trivial on Windows. If you had such a mutex as a global var, it could
not be initialized at compile time. Then VC++ team fixed this issue.
Therefore we don't need this custom implementation anymore.

This PR also removes nsync. I ran several models tests on Linux. I
didn't see any perf difference.
This PR also reverts PR #21005 , which is no longer needed since conda
has updated its msvc runtime DLL.

This PR unblocks #22173 and resolves #22092 . We have a lot of open
issues with nsync. This PR can resolve all of them.
2024-10-21 15:32:14 -07:00
Ted Themistokleous
572e43c5d7
[MIGraphX EP/ ROCm EP] add gfx1200, gfx1201 to CMAKE_HIP_ARCHITECTURES (#22348)
### Description
Add additonal gfx targets for AMD GPU support


### Motivation and Context
Required to integrate mainline onnxruntime support for AMD GPUs

---------

Co-authored-by: Stefan Sokolovic <stsokolo@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-10-11 17:31:36 -07:00
Changming Sun
2bef89c171
Upgrade absl to the latest released version (#22365)
### Description
Resolve #21976 .  
 
ABSL generally does not have forward/backward compatibility. Our code is
only compatible with one fixed LTS version. So it's important to fix the
version number there when using find_package to detect an installed
version.
2024-10-09 20:21:40 -07:00
Yulong Wang
c5d28cac4d
Initial WebGPU EP checkin (#22318)
### Description

This change introduces the WebGPU EP into ONNX Runtime.

To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update

This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.

There is going to be separated PRs for each mentioned above.

Current working branch: #21904
2024-10-08 16:10:46 -07:00
Tianlei Wu
f3f33bfa05
Upgrade cutlass to 3.5.1 and cudnn frontend to 1.7.0 (#22316)
### Description
Upgrade cutlass to 3.5.1
Upgrade cudnn_frontend to 1.7.0
2024-10-04 11:48:50 -07:00
Edward Chen
f1be92faf0
Patch fp16 to fix Xcode 16 builds with XNNPACK EP targeting x86_64. (#22294) 2024-10-03 14:17:15 -07:00
Yufeng Li
96e9c99dce
remove neural-speed (#22236)
### Description
<!-- Describe your changes. -->
NS is not developed anymore and ORT doesn't use it for int4 inference
either. Remove it to clean up the code


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-01 09:50:44 -07:00
Sumit Agarwal
529835cc46
[DML EP] Update DML to 1.15.2 (#22247)
### Description
Update DML binary to the current latest redist version
[1.15.2](https://www.nuget.org/packages/Microsoft.AI.DirectML/1.15.2).
2024-09-27 13:20:29 -07:00
Edward Chen
209ff86d52
Get build working on Xcode 16 (#22168) 2024-09-24 08:33:03 -07:00
Yi Zhang
8d2d40781c
set CMAKE_SYSTEM_PROCESSOR in xnnpack.cmake (#22155)
### Description
<!-- Describe your changes. -->



### Motivation and Context
By default, CMAKE_SYSTEM_PROCESSOR is same CMAKE_HOST_SYSTEM_PROCESSOR
https://cmake.org/cmake/help/latest/variable/CMAKE_SYSTEM_PROCESSOR.html
KleidiAI uses CMAKE_SYSTEM_PROCESSOR to determine whether to include
some arm64 ukernels.
https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CMakeLists.txt#L134
We use Mac with Intel CPU to cross compile MAC with ARM in ios packaging
pipeline
So we need to make CMAKE_SYSTEM_PROCESSOR same with ORT_TARGET_PROCESSOR
2024-09-20 15:19:26 -07:00
Yi Zhang
b94ba09e4f
Upgrade XNNPACK to latest version (#22012)
### Description
Update XNNPack to latest version (Sep 4)
- Some op outputs are changed, channel or stride paras are moved into
reshape func.
e.g.
96962a602d
- input params of xnnpack's resize related function are changed a lot
- KleidiAI is added as a dependency in ARM64
- The latest XNNPACK includes 2 static libs microkernels-prod and
xnnpack.
Without microkernels-prod, it throws the exception of Undefined symbols.
- Add ORT_TARGET_PROCESSOR to get the real processor target in CMake
2024-09-17 10:12:16 -07:00
PARK DongHa
f633caa0b1
Create CMake option onnxruntime_USE_VCPKG (#21348)
### Changes

1. CMake option `onnxruntime_USE_VCPKG`. It will be used in the vcpkg
port
* Unit test may fail because this option leads to a mixture of
unexpected external library versions.
     Especially ONNX, Protobuf, and Flatbuffers version can be different
2. Overhaul of `onnxruntime_external_deps.cmake`
   * Make `FetchContent_Declare` to try `find_package`.  
See
https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
* Relocated `FetchContent_Declare` and `FetchContent_MakeAvailable`(or
`onnxruntime_fetchcontent_makeavailable`) to closer lines.
It was too hard to navigate the entire file to search related
sections...
* Alias `IMPORTED` targets like build targets (e.g. `ONNX::onnx` -->
`onnx`)

```cmake
# The script uses `find_package` with the changes.
# In this case, use vcpkg to search dependencies
# See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
include(external/onnxruntime_external_deps.cmake)
```

3. Create CMakePresets.json and presets to [run vcpkg in manifest
mode](https://learn.microsoft.com/en-us/vcpkg/concepts/manifest-mode)
   * Currently, it's NOT for training build
   * Main triplets are `x64-windows` and `x64-osx`

```pwsh
Push-Location "cmake"
    cmake --preset "x64-windows-vcpkg"
    cmake --build --preset "x64-windows-vcpkg-debug"
Pop-Location
```
```bash
pushd "cmake"
    cmake --preset "x64-osx-vcpkg"
    cmake --build --preset "x64-osx-vcpkg-debug"
popd
```

4. Updated tools/ci_build/build.py
* `--use_vcpkg` option: it needs `CMAKE_TOOLCHAIN_FILE` with
[vcpkg.cmake toolchain
script](https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake)
* `--compile_no_warning_as_error` is recommended because library version
differences will cause unexpected compiler warnings

```bash
python ./tools/ci_build/build.py \
    --compile_no_warning_as_error \
    --use_vcpkg \
    --cmake_extra_defines "CMAKE_TOOLCHAIN_FILE:FILEPATH=${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" \
    --cmake_extra_defines "VCPKG_TARGET_TRIPLET=..."
```

5. Created Job `Vcpkg` for Windows and macOS
   * Show how to setup and use vcpkg.  
     Similar to the CMakePresets.json usage

### Motivation and Context

* Help #7150
* Help https://github.com/microsoft/vcpkg/pull/36850
   * https://github.com/luncliff/vcpkg-registry/pull/212
   * https://github.com/microsoft/vcpkg/pull/39881
* https://github.com/luncliff/vcpkg-registry/pull/215
   * https://github.com/luncliff/vcpkg-registry/pull/216
   * https://github.com/luncliff/vcpkg-registry/pull/227
*
https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
*
https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake

### Future Works?

More feature coverage with the vcpkg supported libraries

* CUDA feature support
* Training feature support
2024-09-10 16:39:27 -07:00
Ranjit Ranjan
02e3a430af
[AIX] Python binding enablement and gcc support (#21934)
### Description
Enabling python binding and gcc support for AIX.



### Motivation and Context
Code changes in this PR contains:
1. python binding enablement
2. gcc building support


Below are list of files and the description.

1. cmake/CMakeLists.txt
[gcc building support] -no-unused-function compiler flag addition for
IBMClang
2. cmake/external/eigen.cmake
[gcc building support] AIX check for applying the AIX patch
3. cmake/onnxruntime_python.cmake
[python binding ] putting NOT AIX check for -Xlinker
4. cmake/onnxruntime_unittests.cmake
[gcc building support] Fix for gtest behavior. Check the comment .
[python binding ] using -Wl,-brtl for linking
onnxruntime_providers_shared in test_execution_provider
5. cmake/patches/eigen/eigen-aix.patch
[gcc building support] In AIX gcc, we are hitting
__builtin_cpu_supports("mma") which is not supported yet. So patching
code for this method . Patched code will check for P10 Processor at
run-time and based on that routine will be set.
6. onnxruntime/python/onnxruntime_validation.py
[python binding ] Adding AIX check in check_distro_info()
7. onnxruntime/test/providers/cpu/generator/random_test.cc
[gcc building support] updating previous check for AIX , along with
clang. So in case of gcc, else block will hit.
8. onnxruntime/test/python/onnxruntime_test_python.py
[python binding ] powerpc check on platform.processor()
9. setup.py
[python binding ] Adding AIX check for list of libs.
2024-08-30 12:17:26 -07:00
Guenther Schmuelling
ba7baae994
Revert "Upgrade emsdk from 3.1.59 to 3.1.62" (#21817)
Reverts microsoft/onnxruntime#21421

Users are seeing chrome memory grow to 16GB before it crashes:
https://github.com/microsoft/onnxruntime/issues/21810

Revert for now so we have time to debug.
2024-08-22 11:21:00 -07:00
Tianlei Wu
fbc3927231
[CUDA] cuDNN Flash Attention (#21629)
### Description
- [x] Add cuDNN flash attention using cudnn frontend, and enable it in
MultiHeadAttention operator.
- [x] Support attention mask.
- [x] Support attention bias.
- [x] Update tests and benchmark script.

The cuDNN SDPA is disabled by default. To enable it, need the following:
(1) Requires cuDNN 9.3 or newer version installed.
(2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or
set `sdpa_kernel=8` cuda provider option to enable it.
(3) Only works on devices with compute capability >= 8.0.

Note that some combinations of parameters might be rejected due to
limited support of head dimension or sequence lengths.

Future Works:
(1) FP8 and BF16 APIs.  Currently, only API for FP16 are exposed.
(2) Add API to support ragged batching (padding removed in inputs).
(3) Support other input formats (like QKV_BS3NH).
(4) Currently, q are converted to BSNH, k/v are converted to either BSNH
or BNSH format. May do some experiment to see whether converting q to
BNSH could be better in some case.

### Example Benchmark Results on H100

The following tests are on FP16 MultiHeadAttention operator without
attention mask and attention bias.

#### Test Setting 1
batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 256 | 0 | 32 | 128

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000075 | 229.5 | torch:flash
Q,K,V (BNSH) | 0.000119 | 144.8 | torch:efficient
Q,K,V (BNSH) | 0.000224 | 76.5 | torch:math
Q,K,V (BSNH) | 0.000075 | 227.8 | ort:cudnn
Q,K,V (BSNH) | 0.000094 | 182.8 | ort:flash
Q,K,V (BSNH) | 0.000138 | 124.7 | ort:efficient
Q,K,V (BSNH) | 0.000438 | 39.3 | ort:math
Q,KV | 0.000129 | 133.0 | ort:cudnn
Q,KV | 0.000151 | 114.1 | ort:flash
Q,KV | 0.000194 | 88.5 | ort:efficient
QKV | 0.000154 | 111.8 | ort:cudnn
QKV | 0.000175 | 98.0 | ort:flash
QKV | 0.000217 | 79.0 | ort:efficient

#### Test Setting 2

batch_size | sequence_length | past_sequence_length | num_heads |
head_size
-- | -- | -- | -- | --
16 | 512 | 0 | 16 | 64

format | average_latency | tflops | kernel
-- | -- | -- | --
Q,K,V (BNSH) | 0.000069 | 249.2 | torch:flash
Q,K,V (BNSH) | 0.000141 | 121.7 | torch:efficient
Q,K,V (BNSH) | 0.000294 | 58.5 | torch:math
Q,K,V (BSNH) | 0.000077 | 221.7 | ort:cudnn
Q,K,V (BSNH)  | 0.000087 | 196.6 | ort:flash
Q,K,V (BSNH)  | 0.000163 | 105.6 | ort:efficient
Q,K,V (BSNH)  | 0.000651 | 26.4 | ort:math
Q,KV | 0.000103 | 167.1 | ort:cudnn
Q,KV | 0.000117 | 146.3 | ort:flash
Q,KV | 0.000192 | 89.6 | ort:efficient
QKV | 0.000113 | 151.5 | ort:cudnn
QKV | 0.000128 | 134.7 | ort:flash
QKV | 0.000201 | 85.3 | ort:efficient
2024-08-20 08:50:22 -07:00
Scott McKay
c97cc5c1b0
Put all external project targets under the 'External' folder in VS (#21765)
### Description
<!-- Describe your changes. -->
Handle targets in subdirectories for external projects. All targets will
now go in a per-project folder under 'External'

e.g. gmock and gtest now get handled correctly and are under
External/googletest vs. existing setup where they ended up as top-level
projects.


![image](https://github.com/user-attachments/assets/99ec259c-47cd-44f3-954d-58569c941cc2)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve developer experience.
2024-08-16 15:51:50 +10:00
Satya Kumar Jandhyala
6d8de1f7b8
Upgrade emsdk from 3.1.59 to 3.1.62 (#21421)
### Description
Upgrade EM SDK to 3.1.62.



### Motivation and Context
The changes are required to clear wasm64 errors.
2024-08-14 12:38:52 -07:00
Sumit Agarwal
c5592fdcef
[DML EP] Update DML to 1.15.1 (#21695)
### Description
Update DML runtime binary to 1.15.1



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-08-12 14:16:43 -07:00
Julius Tischbein
1391354265
Adding CUDNN Frontend and use for CUDA NN Convolution (#19470)
### Description
Added CUDNN Frontend and used it for NHWC convolutions, and optionally
fuse activation.

#### Backward compatible 
- For model existed with FusedConv, model can still run. 
- If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.

#### Major Changes
- For cuDNN 9, we will enable cudnn frontend to fuse convolution and
bias when a provider option `fuse_conv_bias=1`.
- Remove the fusion of FusedConv from graph transformer for CUDA
provider, so there will not be FusedConv be added to graph for CUDA EP
in the future.
- Update cmake files regarding to cudnn settings. The search order of
CUDNN installation in build are like the following:
  * environment variable `CUDNN_PATH`
* `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from
build.py/build.sh, user can pass it through `--cudnn_home` parameter, or
by environment variable `CUDNN_HOME` if `--cudnn_home` not used.
* cudnn python package installation directory like
python3.xx/site-packages/nvidia/cudnn
  * CUDA installation path

#### Potential Issues

- If ORT is built with cuDNN 8, FusedConv fusion is no longer done
automatically, so some model might have performance regression. If user
still wants FusedConv operator for performance reason, they can still
have multiple ways to walkaround: like use older version of onnxruntime;
or use older version of ORT to save optimized onnx, then run with latest
version of ORT. We believe that majority users have moved to cudnn 9
when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3
months when 1.20 release), so the impact is small.
- cuDNN graph uses TF32 by default, and user cannot disable TF32 through
the use_tf32 cuda provider option. If user encounters accuracy issue
(like in testing), user has to set environment variable
`NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of
use_tf32 later.

#### Follow ups
This is one of PRs that target to enable NHWC convolution in CUDA EP by
default if device supports it. There are other changes will follow up to
make it possible.
(1) Enable `prefer_nhwc` by default for device with sm >= 70. 
(2) Change `fuse_conv_bias=1` by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).

### Motivation and Context

The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution with the pointwise bias operation. On the [NVIDIA
ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/)
we get a performance boost from 49.1144 ms to 42.4643 ms per inference
on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i
'prefer_nhwc|1' resnet50.onnx`).

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>
2024-08-02 15:16:42 -07:00
Ranjit Ranjan
82b2955268
[AIX]test failure fix using gtest-1.15.0 for AIX (#21497)
### Description
Local CI setup for AIX reported tests failure after the gtest 1.15.0
upgrade.

### Motivation and Context
Below tests failure is observed after gtest upgrade.

The following tests FAILED:
	  1 - onnxruntime_test_all (ILLEGAL)
	  7 - onnxruntime_logging_apis_test (Subprocess aborted)

To fix this, I am enabling pthread support under gtest. This was
disabled with previous version of gtest for some reason.
Now by enabling this, above tests are getting passed with gtest 1.15.0.
2024-07-27 11:17:22 -07:00
Changming Sun
f70215d4e6
Update C++ dependencies (#21410)
1. Update google benchmark from 1.8.3 to 1.8.5
2. Update google test from commit in main branch to tag 1.15.0 
3. Update pybind11 from 2.12.0 to 2.13.1
4. Update pytorch cpuinfo to include the support for Arm Neoverse V2,
Cortex X4, A720 and A520.
5. Update re2 from 2024-05-01 to 2024-07-02
6. Update cmake to 3.30.1
7. Update Linux docker images
8. Fix a warning in test/perftest/ort_test_session.cc:826:37: error:
implicit conversion loses integer precision: 'streamoff' (aka 'long
long') to 'const std::streamsize' (aka 'const long')
[-Werror,-Wshorten-64-to-32]
2024-07-23 10:00:36 -07:00
Sheil Kumar
dd010edb37
Update DirectML from 1.14.1 to 1.15.0 (#21323)
Update DirectML from 1.14.1 to 1.15.0

---------

Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
2024-07-22 16:59:03 -07:00
Ranjit Ranjan
6c7562b097
Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. (#21133)
### Description
Enablement of onnxruntime for AIX and fixing issues related to
big-endian platform.

### Motivation and Context
changes in this PR contains:
1. Enablement code for building onnxruntime on AIX operating system.
2. while testing the build on AIX, we found issues related to big endian
platform . More details about few of those issues can be found in [Big
endian issue: Graph Transformation Attention Fusion tests are failing
#12921](https://github.com/microsoft/onnxruntime/issues/12921)

Below are list of files and the description about the change.
1.	cmake/CMakeLists.txt
[BUILDING on AIX issue] check for "IBMClang" is added for handling
-Wno-unused-parameter
2.	cmake/external/onnxruntime_external_deps.cmake
[BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX
3.	cmake/onnxruntime.cmake
[BUILDING on AIX issue]
o Blocking codes for AIX which generates generated_source.c and further
requires some symbol files.
o	Putting NO AIX check for non-supported linker flags like --Xlinker
o	iconv linking
4.	cmake/onnxruntime_framework.cmake
[BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN'
5.	cmake/onnxruntime_mlas.cmake
[BUILDING on AIX issue]POWER10 releated macro/function definition .
6.	cmake/onnxruntime_providers_cpu.cmake
[BUILDING on AIX issue]Putting NO AIX check for non-supported linker
flags like --Xlinker
7.	cmake/onnxruntime_unittests.cmake
[BUILDING on AIX issue]
o	Putting NO AIX check for non-supported linker flags like --Xlinker
o Adding required libraries for AIX linker under applicatiion like
onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc
8.	cmake/patches/flatbuffers/flatbuffers.patch
[BUILDING on AIX issue] Handling of TypeCode in
include/flatbuffers/flatbuffers.h under AIX + clang
9.	onnxruntime/contrib_ops/cpu/murmur_hash3.cc
[Big endian issue] Byte-Conversion handlling in compute() and getblock()
routines
10.	onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc
[Big endian issue] Handling of test failures . Byte swapping for
quant_value.
11.	onnxruntime/core/framework/tensorprotoutils.cc
[Big endian issue]
Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto
.
o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling
ConvertRawDataInTensorProto() in big-endian system
o ConvertRawDataInTensorProto : function used mainly on big-endian
system for byte-swapping of tensor raw_data
12.	onnxruntime/core/framework/tensorprotoutils.h
[Big endian issue]
Declaration of SetRawDataInTensorProto,  ConvertRawDataInTensorProto
13.	onnxruntime/core/graph/graph.cc
[Big endian issue]
 o	Call ConvertRawDataInTensorProto for SPARSE_TENSOR type
 o	Call ConvertRawDataInTensorProto for SaveToOrtFormat
14.	onnxruntime/core/mlas/lib/platform.cpp
[BUILDING on AIX issue] POWER10 released enablement for AIX
15.	onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp
[BUILDING on AIX issue]Handling of __vector under AIX+clang
16.	onnxruntime/core/mlas/lib/qgemm.h
[BUILDING on AIX issue] Adding _AIX flag
17.	onnxruntime/core/mlas/lib/qlmul.cpp
[BUILDING on AIX issue] Handling of __vector under AIX+clang
18.  onnxruntime/core/optimizer/attention_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
19.  onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
20.  onnxruntime/core/optimizer/constant_folding.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
21.  onnxruntime/core/optimizer/embed_layer_norm_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
22.  onnxruntime/core/optimizer/nchwc_transformer.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
23.  onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
24.  onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
25.  onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
26.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
27.  onnxruntime/core/optimizer/reshape_fusion.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
28.  onnxruntime/core/optimizer/stft_decomposition.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
29.
onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
30.	onnxruntime/core/platform/path_lib.h
[BUILDING on AIX issue] Moving to normal function call, instead of
template
31.	onnxruntime/core/platform/posix/env.cc
[BUILDING on AIX issue]Blocking syscall.h in AIX
32.	onnxruntime/core/session/inference_session.cc
[Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN
33.	onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc
[Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer
and ExternalWriteReadWithLoadInitializers
34.	onnxruntime/test/framework/sparse_kernels_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
35.	onnxruntime/test/framework/tensorutils_test.cc
[Big endian issue] Helper method ConvertEndianessForVector and call this
from required place.
36.	onnxruntime/test/framework/test_tensor_loader.cc
o.  [BUILDING on AIX issue] Handling of getcwd for AIX
o.  [Big endian issue]  Bytes Swapping in run_external_data_test
37.	onnxruntime/test/onnx/main.cc
[Big endian issue] including <thread> for AIX
38.	onnxruntime/test/onnx/tensorprotoutils.cc
[Big endian issue]  Bytes swapping in UnpackTensorWithRawData
39.	onnxruntime/test/optimizer/graph_transform_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
40.	onnxruntime/test/optimizer/graph_transform_test_builder.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
41.	onnxruntime/test/optimizer/graph_transform_test_builder.h
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
42.	onnxruntime/test/optimizer/initializer_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
43.	onnxruntime/test/optimizer/nchwc_optimizer_test.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
44.	onnxruntime/test/providers/base_tester.cc
[Big endian issue] Use util function SetRawDataInTensorProto, instead of
set_raw_data
45.	onnxruntime/test/providers/cpu/generator/random_test.cc
[BUILDING on AIX issue]  Adding AIX check in MultinomialGoodCase

---------

Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>
2024-07-17 12:37:06 -07:00
cloudhan
ddd4ce3cb7
[ROCm] Update ck to use ck_tile (#21030) 2024-06-19 14:06:10 +08:00
Tianlei Wu
b3fc9b5a0e
[CUDA] upgrade cutlass to 3.5.0 (#20940)
### Description
Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in
Windows
- [x] Upgrade cutlass to 3.5.0.
- [x] Fix flash attention build error with latest cutlass header files
and APIs. This fix is provided by @wangyems.
- [x] Update efficient attention to use new cutlass fmha interface.
- [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53.
- [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test
build error for cuda 11.8 to 12.3.
- [x] Disable TRT 10 deprecate warnings. 

The following are not included in this PR:
* TRT provider replaces the deprecated APIs.
* Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This
test is not built by default unless you add `--cmake_extra_defines
onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command.

To integrate to rel-1.18.1: Either bring in other changes (like onnx
1.16.1), or generate manifest and upload a new ONNX Runtime Build Time
Deps artifact based on rel-1.18.1.

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/19891
https://github.com/microsoft/onnxruntime/issues/20924
https://github.com/microsoft/onnxruntime/issues/20953
2024-06-11 13:32:15 -07:00
Changming Sun
f8b5c2805e
Update abseil-cpp.cmake: add version check (#20962)
Some dev environments come with a preinstalled abseil. For example,
conda users often do that. If the preinstalled abseil version is
incompatible with what we have in cmake/deps.txt, it could result in a
hard-to-understand build error. This PR adds a version check to improve
that.
2024-06-06 19:42:31 -07:00
liqun Fu
51bc53580d
Update to onnx 1.16.1 (#20702) 2024-06-04 11:06:28 -07:00
Changming Sun
b522df0ae4
Update RE2 to the latest (#20775)
Update RE2 to the latest.

To keep the components up to date.
2024-05-23 14:30:15 -07:00
Yulong Wang
036fcd93d4
[js/web] optimize module export and deployment (#20165)
### Description

This PR make numbers of optimizations to onnxruntime-web's module export
and deployment.

See each section below for more details.

#### Preview

>
[onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21)

> ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~

> ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~

> ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~

<details>
<summary><h4>Breaking changes</h4></summary>

There is no code change required, but there are a few differences
regarding **code import**, **flags**, **bundler config** and
**deployment steps**.

#### Importing:

Import table is changed. See following for details.

<details>
<summary><h5>Current import table:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
  | `ort.all` | `onnxruntime-web/experimental` | ✔️ | ✔️ | ✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
| `ort.training` | `onnxruntime-web/training` |  |  | ✔️ |
✔️<sup>\[1]</sup> | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
  | `ort.wasm-core` | `onnxruntime-web/wasm-core` |  |  | ✔️ |  |  |
| `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ✔️<sup>\[2]</sup>
|  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

* [1] didn't test. may not actually work.
* [2] not working. this is a mistake in build config.

</details>

<details>
<summary><h5>Proposed update:</h5></summary>

| Target Name | Path for "import" or "require" | WebGL | JSEP | wasm |
Proxy | Training |
  |------|-----|-----|-----|-----|-----|-----|
  | `ort` (default) | `onnxruntime-web` | ✔️ |  | ✔️ | ✔️ |  |
| `ort.all` |
~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` | ✔️ | ✔️ |
✔️ | ✔️ |  |
  | `ort.node` | `onnxruntime-web` |  |  | ✔️ |  |  |
  | `ort.training` | `onnxruntime-web/training` |  |  | ✔️ | ✔️ | ✔️ |
  | `ort.wasm` | `onnxruntime-web/wasm` |  |  | ✔️ | ✔️ |  |
| ~~`ort.wasm-core`~~ | ~~`onnxruntime-web/wasm-core`~~ | ~~~~ | ~~~~
| ~~✔️~~ | ~~~~ | ~~~~ |
  | `ort.webgl` | `onnxruntime-web/webgl` | ✔️ |  |  | ~~✔️~~  |  |
  | `ort.webgpu` | `onnxruntime-web/webgpu` |  | ✔️ | ✔️ | ✔️ |  |

</details>

#### Flags:

The following flags are deprecated:
- `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in
build.

The following flags changed their type:
- `env.wasm.wasmPaths`: When using this flag as a string ( for the URL
prefix ), nothing is changed. When using this flag as an object ( for
per-file path override ), the type changed:
  ```diff
  -  export interface Old_WasmFilePaths{
  -    'ort-wasm.wasm'?: string;
  -    'ort-wasm-threaded.wasm'?: string;
  -    'ort-wasm-simd.wasm'?: string;
  -    'ort-training-wasm-simd.wasm'?: string;
  -    'ort-wasm-simd-threaded.wasm'?: string;
  -  };
  +  export interface New_WasmFilePaths {
  +    /**
  +     * Specify the override path for the main .wasm file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .wasm file is:
  +     * - `ort-wasm-simd-threaded.wasm` for default build
+ * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.wasm` for training build
  +     */
  +    wasm?: URL|string;
  +    /**
  +     * Specify the override path for the main .mjs file.
  +     *
  +     * This path should be an absolute path.
  +     *
  +     * If not modified, the filename of the .mjs file is:
  +     * - `ort-wasm-simd-threaded.mjs` for default build
+ * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and
WebNN)
  +     * - `ort-training-wasm-simd-threaded.mjs` for training build
  +     */
  +    mjs?: URL|string;
  +  }
  ```

#### Bundler compatibility:

Config changes are need for bundlers. See usage example in
/js/web/test/e2e/ for Webpack, parcel and rollup.

#### Deployment:

- if consuming from a CDN, there is no breaking change.
- if consuming from a local server, need to copy all `ort-*.wasm` and
`ort-*.mjs` files (totally 6 files) in the dist folder. (previously only
need to copy `ort-*.wasm` files.)

</details>
<details>
<summary><h4>Problems</h4></summary>

There are a few problems with the current module export and deployment:

- Script URL cannot be correctly inferred when imported as ESM.
- Workers are forcefully encoded using Blob URL, which makes
onnxruntime-web not working in CSP environment and Node.js, when using
proxy or multi-threading feature.
- Generated JS code (by Emscripten) is encoded using
`function.toString()`, which is unstable and error-prone.
- When running with a different Emscripten build, always need the build
step. Making it difficult to swap artifacts in deveopment/debug.
</details>
<details>
<summary><h4>Goals</h4></summary>

- Full ESM support
- Support variances of ways to import. Including:
- import from HTML's `<script>` tag (IIFE format, exporting to global
variable `ort`)
    ```html
<script
src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script>
    ```
  - import from source code inside `<script type="module">` tag (ESM)
    ```html
    <script type="module">
import * as ort from
"https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs";

      // using 'ort'
    </script>
    ```
- import in a CommonJS project (CJS format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js
    const ort = require('onnxruntime-web');
    ```
- import in an ESM project (ESM format, resolve from package.json
"exports" field)
    ```js
    // myProject/main.js (or main.mjs)
    import * as ort from 'onnxruntime-web';
    ```
- Support popular bundlers when importing onnxruntime-web into a CJS/ESM
project.
  - webpack (esm requires extra post-process step)
  - rollup
  - parcel (esm requires extra post-process step)
  - More bundlers **TBD**
- Multi-threading support for Node.js

NOTE: keeping single JavaScript file (the all-in-one bundle) is no
longer a goal. This is because technically there is a conflict with the
other requirements.
</details>

<details>
<summary><h4>Important Design Decisions</h4></summary>

- Drop support of single JavaScript output.
- The current onnxruntime-web distribution uses a single JavaScript file
to include all code. While there are a few benefits, it also creates
problems as mentioned above. Since ESM is being used more and more
widely, and browsers are making more restricted security checks and
requirement, the old Blob based solution is going to be replaced.
- To achieve the requirement, specifically, the CSP environment support,
we have to offer a non Blob based solution. Therefore, we have to
distribute multiple files and drop the single file solution.

- Do not run parser/postprocess on Emscripten generated JavaScript.
- Emscripten is evolving quickly so we should only depends on what's in
its documentation instead of a certain implementation details. (for
example, currently we patch on its code to deal with a special variable
`_scriptDir`)
  - Keep the generated files as-is also helps to:
    - reduce the size of ort.min.js
- make it easier to replace build artifacts when in development/debug

- Drop support for non-SIMD and non-MultiThread. This helps to reduce
the number of artifacts in distribution.
  - (fixed-sized) SIMD is supported in any mainstream JS environment.
- Multi-thread as WebAssembly feature is supported in any mainstream JS
environment. In some environment the feature is guarded with cross
origin policy, but it can still work if not trying to create any worker.

- Use ESM output for Emscripten generated JavaScript.
- There are 2 ways to dynamically import classic (umd) modules and
neither of them are recommended:
- dynamically creating a <script> tag. This changes the HTML structure
and have quite a lot of compatibility issue
- use `fetch()` and `eval()`. However `eval` is strongly suggested to be
avoid because there is a great perf hit.
- importing ESM is super easy - just use the `import()` call.
Considering ESM is widely supported in modern browsers and Node.js this
is the better option.

- Add Blob based solution as a fallback for cross-origin workers.
- There are still wide use case of importing onnxruntime-web from CDN.
In this usage, make it able create worker by using `fetch()`+`Blob` to
create a same-origin Blob URL.

</details>

<details>
<summary><h4>Distribution File Manifest</h4></summary>

The distribution folder contains the following files:

- WebAssembly artifacts. These files are the result of compiling the
ONNX Runtime C++ code to WebAssembly by Emscripten.

  | File Name | Build Flags |
  |------|-----|
| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm |
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-training-wasm-simd-threaded.mjs <br/>
ort-training-wasm-simd-threaded.wasm | `--enable_training_apis` <br/>
`--enable_wasm_simd` <br/> `--enable_wasm_threads` |
| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm
| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep`
<br/> `--use_webnn` |

- onnxruntime-web JavaScript artifacts. These files are generated by
ESBuild as the entry point for onnxruntime-web.

  There are multiple build targets for different use cases:
  | Target Name | Path for "import" or "require" | Description |
  |------|-----|-----|
  | `ort` | `onnxruntime-web` | The default target. |
  | `ort.all` | `onnxruntime-web/all` | The target including webgl. |
  | `ort.node` | `onnxruntime-web` | The default target for Node.js. |
| `ort.training` | `onnxruntime-web/training` | The target including
training APIs |
| `ort.wasm` | `onnxruntime-web/wasm` | The target including only
WebAssembly (CPU) EP |
| `ort.webgl` | `onnxruntime-web/webgl` | The target including only
WebGL EP |


  For each target, there are multiple files generated:
  | File Name | Description |
  |------|-----|
| [target].js | The entry point for the target. IIFE and CommonJS
format. |
  | [target].mjs | The entry point for the target. ESM format. |
| [target].min.js <br/> [target].min.js.map | The entry point for the
target. Minimized with sourcemap. IIFE and CommonJS format. |
| [target].min.mjs <br/> [target].min.mjs.map | The entry point for the
target. Minimized with sourcemap. ESM format. |
| [target].proxy.mjs | (if appliable) The proxy ESM module for the
target. |
| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map | (if
appliable) The proxy ESM module for the target. Minimized with
sourcemap. |

</details>

<details>
<summary><h4>Dynamic Import Explained</h4></summary>

- Local Served | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort-wasm-simd-threaded.mjs]
                    |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                    |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Local Served | Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + import()--> [ort.proxy.min.mjs]
                    |
                    + new Worker()--> [ort.proxy.min.mjs (worker)]
                                        |
+ import()--> [ort-wasm-simd-threaded.mjs]
                                                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                        |
+ new Worker()--> [ort-wasm-simd-threaded.mjs (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
- Cross Origin | No Proxy:
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort-wasm-simd-threaded.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort-wasm-simd-threaded)]
                        |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                        |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
                                            |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```

- Cross Origin | Proxy
  ```
  [Bundle or ort.min.js]
    |
    + fetch('ort.proxy.min.mjs')
        |
        + URL.createObjectURL(res.blob())
        |
        + import()--> [blob:... (ort.proxy)]
                        |
+ new Worker()--> [blob:... (ort.proxy) (worker)]
                                            |
+ fetch('ort-wasm-simd-threaded.mjs')
                                                |
+ URL.createObjectURL(res.blob())
                                                |
+ import()--> [blob:... (ort-wasm-simd-threaded)]
                                                                |
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
                                                                |
+ new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)]
|
+ WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm]
  ```
</details>
2024-05-20 09:51:16 -07:00
Scott McKay
8d09baf49f
Clarify when protobuf dependency builds protoc (#20542)
### Description
<!-- Describe your changes. -->
Currently figuring out if the protobuf dependency is building protoc it
is a little obtuse and inconsistent
* in some places we directly set protobuf_BUILD_PROTOC_BINARIES to OFF
to indicate the protobuf dependency is not building protoc
  * e.g. macOS/iOS/visionOS builds
* for a user provided protoc path we don't set
protobuf_BUILD_PROTOC_BINARIES, and inside protobuf_function.cmake that
determines if `protobuf::protoc` is added as a dependency or not
*
0dda8b0c44/cmake/external/protobuf_function.cmake (L40-L45)

To be more consistent/explicit, set protobuf_BUILD_PROTOC_BINARIES to
OFF when ONNX_CUSTOM_PROTOC_EXECUTABLE set and valid.

Remove outdated script that built and external protoc binary which was
used in later builds. The build setup will fetch a pre-built protoc so
there's no need for this additional build.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make it easier to figure out if protoc is coming from the protobuf
dependency.
2024-05-08 08:30:11 +10:00
Yulong Wang
a457c1df80
upgrade emsdk to 3.1.57 (#20295)
### Description
upgrade emsdk to 3.1.57
2024-04-19 23:05:18 -07:00
Patrice Vignola
12569626cb
Update DML to 1.14.1 (#20380)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-04-18 22:43:41 -07:00
Patrice Vignola
745b426c60
[DML] Update DML to 1.14 (#20304)
I am prefiring this change to pre-run the non-dml checks, and also to
give folks the time to review it before DML gets released. When DML 1.14
officially releases, we'll only need to run the DML pipeline to
automatically pick up the nuget package. This should save us some
valuable time.

Note that DML 1.14 is the release needed for ORT 1.17.4, and DML 1.15
will come soon after.
2024-04-18 16:22:57 -07:00
liqun Fu
cd7112f800
Integration with ONNX 1.16.0 (#19745)
### Description
update with ONNX 1.16.0 branch according to
https://github.com/microsoft/onnxruntime/blob/main/docs/How_To_Update_ONNX_Dev_Notes.md

ONNX 1.16.0 release notes:
https://github.com/onnx/onnx/releases/tag/v1.16.0

#### Updated ops for CPU EP:
- DequantizeLinear(21)
  - Added int16 and uint16 support + various optimizer tests
  - Missing int4 and uint4 support
  - Missing block dequantization support
- QuantizeLinear(21)
  - Added int16 and uint16 support + various optimizer tests
  - Missing int4 and uint4 support
  - Missing block quantization support
- Cast(21)
  - Missing int4 and uint4 support
- CastLike(21)
  - Missing int4 and uint4 support
- ConstantOfShape(21)
  - Missing int4 and uint4 support
- Identity(21)
  - Missing int4 and uint4 support
- If(21)
  - Missing int4 and uint4 support
- Loop(21)
  - Missing int4 and uint4 support
- Reshape(21)
  - Missing int4 and uint4 support
- Scan(21)
  - Missing int4 and uint4 support
- Shape(21)
  - Missing int4 and uint4 support
- Size(21)
  - Missing int4 and uint4 support
- Flatten(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Pad(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Squeeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Transpose(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support
- Unsqueeze(21)
- Missing float8e4m3fnuz, float8e5m2, float8e5m2fnuz, int4, and uint4
support

#### Unimplemented opset 21 features/ops
- int4 and uint4 data type
- QLinearMatMul(21)
- GroupNormalization(21)
- ai.onnx.ml.TreeEnsemble(5)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

### Disabled tests
#### ORT Training

orttraining/orttraining/test/python/orttraining_test_ort_apis_py_bindings.py
- test_ort_custom_ops: Potential shape inference bug for custom ops

#### Python quantization unit tests
test/onnx/python/quantization (shape inference bug)
- test_op_conv_transpose.py: test_quantize_conv_transpose_u8u8_fp16
- test_op_conv_transpose.py: test_quantize_conv_transpose_s8s8_fp16
- test_op_gemm.py: test_quantize_qop_gemm_s8s8
- test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_same
 - test_op_gemm.py: test_quantize_qop_gemm_e4m3fn_p3
- test_op_matmul.py: test_quantize_matmul_u8u8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_entropy
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_percentile
- test_op_matmul.py: test_quantize_matmul_s8s8_f16_distribution
- test_op_relu.py: test_quantize_qop_relu_s8s8

#### ONNX tests
- test_maxpool_2d_ceil_output_size_reduce_by_one: ONNX 1.16.0 fixed a
maxpool output size bug and added this test. Enable this test when [ORT
PR](https://github.com/microsoft/onnxruntime/pull/18377) is merged.
Refer to original [ONNX PR](https://github.com/onnx/onnx/pull/5741).
- test_ai_onnx_ml_tree_ensemble_set_membership_cpu: new unimplemented op
ai.onnx.ml.TreeEnsemble
- test_ai_onnx_ml_tree_ensemble_single_tree_cpu: same
- test_ai_onnx_ml_tree_ensemble_set_membership_cuda: same
- test_ai_onnx_ml_tree_ensemble_single_tree_cuda: same
- test_cast_INT4_to_FLOAT_cpu: ORT Cast(21) impl doesn't support int4
yet
- test_cast_INT4_to_INT8_cpu: same
- test_cast_UINT4_to_FLOAT_cpu: same
- test_cast_UINT4_to_UINT8_cpu: same
- test_cast_INT4_to_FLOAT_cuda
- test_cast_INT4_to_INT8_cuda
- test_cast_UINT4_to_FLOAT_cuda
- test_cast_UINT4_to_UINT8_cuda
- test_constantofshape_float_ones_cuda: ConstantOfShape(21) not
implemented for cuda
- test_constantofshape_int_shape_zero_cuda: same
- test_constantofshape_int_zeros_cuda: same
- test_flatten_axis0_cuda: Flatten(21) not implemented for cuda
- test_flatten_axis1_cuda: same
- test_flatten_axis2_cuda: same
- test_flatten_axis3_cuda: same
- test_flatten_default_axis_cuda: same
- test_flatten_negative_axis1_cuda: same
- test_flatten_negative_axis2_cuda: same
- test_flatten_negative_axis3_cuda: same
- test_flatten_negative_axis4_cuda: same
- test_qlinearmatmul_2D_int8_float16_cpu: QLinearMatMul(21) for onnx not
implemented in ORT yet
- test_qlinearmatmul_2D_int8_float32_cpu: same
- test_qlinearmatmul_2D_uint8_float16_cpu: same
- test_qlinearmatmul_2D_uint8_float32_cpu: same
- test_qlinearmatmul_3D_int8_float16_cpu: same
- test_qlinearmatmul_3D_int8_float32_cpu: same
- test_qlinearmatmul_3D_uint8_float16_cpu: same
- test_qlinearmatmul_3D_uint8_float32_cpu: same
- test_qlinearmatmul_2D_int8_float16_cuda: same
- test_qlinearmatmul_2D_int8_float32_cuda: same
- test_qlinearmatmul_2D_uint8_float16_cuda: same
- test_qlinearmatmul_2D_uint8_float32_cuda: same
- test_qlinearmatmul_3D_int8_float16_cuda: same
- test_qlinearmatmul_3D_int8_float32_cuda: same
- test_qlinearmatmul_3D_uint8_float16_cuda: same
- test_qlinearmatmul_3D_uint8_float32_cuda: same
- test_size_cuda: Size(21) not implemented for cuda
- test_size_example_cuda: same
- test_dequantizelinear_blocked: Missing implementation for block
dequant for DequantizeLinear(21)
- test_quantizelinear_blocked_asymmetric: Missing implementation for
block quant for QuantizeLinear(21)
- test_quantizelinear_blocked_symmetric: Missing implementation for
block quant for QuantizeLinear(21)

---------

Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: adrianlizarraga <adlizarraga@microsoft.com>
2024-04-12 09:46:49 -07:00
Jeff Bloomfield
2f31560430
Enable generic feature level devices in DML EP (#20114)
### Description
Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and
D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX
headers through the DirectX-Headers repo.

Note that this includes an update to cgamanifest.json for onnx-tensorrt
which is triggered during re-generation due to a prior changes to
deps.txt.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-29 14:37:30 -07:00
Changming Sun
dafbef3a21
CMake: support reading dependency zip files from a local mirror (#20005)
### Description
To test this feature, run 
```bat
python cmake\deps_update_and_upload.py --root-path mirror
```
Then run build.py as usual. 

The zip files will be cached local. To avoid being downloaded again and
again.
2024-03-21 17:58:59 -07:00
Yufeng Li
15219e2e71
turn on neural_speed by default (#19627)
### Description
<!-- Describe your changes. -->
the crash caused by the neural_speed turns out to be a very corn case.
Turn it on by default.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-20 12:49:58 -07:00
Adam Louly
32558134a9
[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770)
### Description
Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers


### Motivation and Context
This PR includes changes that will make ort handle 2GB+ checkpoints.
To do that we need to upgrade flatbuffers to 23.5.9 -
https://github.com/google/flatbuffers/pull/7945

- Modified the commitHash and the hash for the new version
- Removed the patch for rust generator's unused variable warning as it
is no longer producing this - [Check it out
here](d121e09d89/src/idl_gen_rust.cpp)
- Updated the VerifyField calls with alignment values that were
introduced in the new version.

---------

Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
2024-03-14 16:36:24 -07:00
Changming Sun
1fb6cbddee
Add a build patch for Windows ARM64EC (#19898)
### Description
Add a patch for Windows ARM64EC


### Motivation and Context
Will need more changes in onnxruntime/core/common/cpuid_arch_definition.h and onnxruntime/core/common/cpuid_info.cc
2024-03-14 08:50:42 -07:00
Scott McKay
978c40d853
Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723)
### Description
<!-- Describe your changes. -->
If the EP handles QDQ node units, we need to make sure we do not split
those into different partitions.

Update the partitioning utils to be QDQ aware. If there are node units
we process the logical nodes they represent instead of individual nodes.
This ensure we process all nodes in a QDQ node unit at the same time so
that they are always in the same partition.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix one of the issues in #19590

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-03-12 10:55:49 +10:00
Changming Sun
a0521f899e
Enable CPUINFO for all Windows build (#19655)
### Description
It was disabled in PR #9065. And the reason was:
" api-ms-win-core-kernel32-legacy-*.dll wasn't available in Windows 8
and was added in Windows 10, so cpuinfo breaks our Windows 8 support.
I'm disabling it again."

We no longer support Windows 8.  Therefore we can add CPUINFO back.

### Motivation and Context
To make the code simpler. If in any case the library doesn't work as
expected, we can submit a PR to their code base and fix it.
2024-03-01 16:23:20 -08:00
Maximilian Müller
c20ced4132
Use CMake's find package for CUDA libs (#19673)
### Description
Answers issue #19640 
More details are in the issue, basically I am changing all the include
directory and link directory usage to CMake's `CUDA::*` targets
2024-02-27 11:26:48 -08:00
Changming Sun
1007d8f3d1
Revert "Revert NeuralSpeed code for x64 MatMulNBits (#19382)" (#19474)
This reverts commit 0d10c7f3c1.
2024-02-09 09:24:54 -08:00
luoyu-intel
0d10c7f3c1
Revert NeuralSpeed code for x64 MatMulNBits (#19382)
### Description
<!-- Describe your changes. -->
Revert PR#19016 https://github.com/microsoft/onnxruntime/pull/19016
Revert PR#17669 https://github.com/microsoft/onnxruntime/pull/17669
2024-02-07 13:04:37 -08:00
Scott McKay
debd1cab10
Add coremltools 7.1 as a dependency (#19389)
### Description
<!-- Describe your changes. -->
Setup usage of coremltools via dependencies instead of copying files. 
Pull in some changes from
https://github.com/microsoft/onnxruntime/pull/19347 in preparation for
supporting ML Program and enabling building the ML Model on all
platforms to make development and testing of CoreML EP code easier.

- Update to coremltools 7.1 
- Add patch for changes required for cross platform build of ML Program
related code
- Generate coreml proto files on all platforms
- mainly to test these changes work everywhere, as the proto files will
be used on all platforms when #19347 is checked in
- rename onnxruntime_coreml_proto target to coreml_proto as it contains
purely coreml protobuf code with no ORT related chagnes

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve setup.
2024-02-03 09:42:21 +10:00
He Li
1bdd7d9499
Update oneDNN to v3.0.1 in order to support gcc 13 (#19344)
### Description

Update the dependency of `oneDNN` to v3.0.1, which fixes a minor bug
hindering gcc 13.

### Motivation and Context


Referring to
[oneDNN-1548](https://github.com/oneapi-src/oneDNN/issues/1548).

- When building with `--use_dnnl` using gcc 13.x, it will fail due to
this upstream issue.
- This is fixed in `v3.0.1`
[tag](https://github.com/oneapi-src/oneDNN/tree/v3.0.1) by [this
commit](1d7971ce48).
2024-02-01 15:39:03 -08:00
Tianlei Wu
8b4517218b
Remove USE_CUTLASS flag (#19271)
### Description
Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for
onnxruntime CUDA build), there is no need to have a flag to disable
cutlass.

Changes:
(1) Reverted https://github.com/microsoft/onnxruntime/pull/18761
(2) remove the condition to build cutlass.
(3) Fix a few build errors or warnings during testing CUDA 11.4 build. 

Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later.
Flash attention and cutlass fused multihead attention will not be built
for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if
you want to support latest GPUs.

It is better to include it in 1.17.0 (otherwise, the release branch
might encounter build failure with CUDA 11.4).

Tests:
(1) Build with flash attention and efficient attention off: **passed**
(2) Build with CUDA 11.4: **passed**

Example build command used in Ubuntu 20.04:
```
export CUDA_HOME=/usr/local/cuda-11.4
export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/
export CUDACXX=/usr/local/cuda-11.4/bin/nvcc

sh build.sh --config Release  --build_shared_lib --parallel  --use_cuda --cuda_version 11.4 \
            --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
            --disable_types float8
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-25 16:57:58 -08:00
Phoebe Chen
4477f57ee3
Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238)
### Description  
This pull request introduces the necessary changes to enable RISC-V
64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V
architecture has gained popularity as an open standard instruction set
architecture, and this contribution aims to extend ONNX Runtime's
compatibility to include RISC-V, thereby broadening the reach of ONNX
models to a wider range of devices.

### Motivation and Context
RISC-V is a free and open-source instruction set architecture (ISA)
based on established RISC principles. It is provided under open licenses
without fees. Due to its extensibility and freedom in both software and
hardware, RISC-V is poised for widespread adoption in the future,
especially in applications related to AI, parallel computing, and data
centers.

### Example Build Command
```
./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests
```

### Documentation Updates
Relevant sections of the documentation will be updated to reflect the
newly supported RISC-V 64-bit cross-compilation feature.
https://github.com/microsoft/onnxruntime/pull/19239

---------

Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
2024-01-24 16:27:05 -08:00