Commit graph

1614 commits

Author SHA1 Message Date
Jeff Bloomfield
2f31560430
Enable generic feature level devices in DML EP (#20114)
### Description
Enable NPUs supporting DXCORE_ADAPTER_ATTRIBUTE_D3D12_GENERIC_ML and
D3D_FEATURE_LEVEL_1_0_GENERIC with DML EP. This also begins ingesting DX
headers through the DirectX-Headers repo.

Note that this includes an update to cgamanifest.json for onnx-tensorrt
which is triggered during re-generation due to a prior changes to
deps.txt.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-29 14:37:30 -07:00
Ye Wang
17919717b5
add QMoE (#20108)
### Description
<!-- Describe your changes. -->
1. Introduce latest cutlass extension from TRTLLM that gives us cutlass
upgrade(to 3.4) opportunity from MoE side.
2. Fix Windows build issue
3. Add Int4 MoE op and ut



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-29 10:24:19 -07:00
Dmitri Smirnov
b95fd4e644
Enable CUDA EP unit testing on Windows (#20039)
### Description
Address build issues and source code discrepancies.
Fix cuda_test_provider gtest argument stack corruption.

### Motivation and Context
`OpTester` class that is widely used for kernel testing is not
suitable for testing internal classes for EPs that are built as shared
objects.
Currently, CUDA EP tests run only on Linux.
We want to enable testing and developments on Windows,
and create a usable pattern for testing of other EPs internals.

Alternatives considered: 
Abstracting EP unit tests into separate test executable such as
`onnxruntime_test_all`.
This alternative was rejected as it would create a lot more changes in
the established patterns,
and potentially interfere with CUDA functionality with more complex
source code maintanence.
2024-03-27 13:32:36 -07:00
Dmitri Smirnov
3076b56947
Make MS Debug engine SymInitialize() called as needed. (#20036)
### Description
<!-- Describe your changes. -->
Initialize Symbol engine as needed with no duplicate calls.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
  Currently absel library may call SymInitialize more than once
  when shared libraries are involved. However, this can only be
  called only once per process. Our debug_alloc also may call it
  when enabled. This change enables intialization to proceed
  only when needed with no duplicate effort.
2024-03-22 16:17:47 -07:00
sfatimar
eab35c20fc
Ort openvino npu 1.17 master (#19966)
### Description
Add NPU to list of device supported. 
Added changes for Support to OV 2024.0
Nuget packages removes packaging of OpenVINO DLL 
Bug Fixes with Python API 
Reverted Dockerfiles not being maintained. 



### Motivation and Context
NPU Device has been introduced by Intel in latest client systems
OpenVINO 2024.0 release is out.

---------

Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Ubuntu <ubuntu@ubuntu-118727.iind.intel.com>
Co-authored-by: hmamidix <hemax.sowjanya.mamidi@intel.com>
Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com>
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
2024-03-21 18:44:00 -07:00
Changming Sun
dafbef3a21
CMake: support reading dependency zip files from a local mirror (#20005)
### Description
To test this feature, run 
```bat
python cmake\deps_update_and_upload.py --root-path mirror
```
Then run build.py as usual. 

The zip files will be cached local. To avoid being downloaded again and
again.
2024-03-21 17:58:59 -07:00
Yufeng Li
15219e2e71
turn on neural_speed by default (#19627)
### Description
<!-- Describe your changes. -->
the crash caused by the neural_speed turns out to be a very corn case.
Turn it on by default.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-03-20 12:49:58 -07:00
Rachel Guo
6b305f95e0
Support xcframework for mac catalyst builds. (#19534)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

MAUI on macOS uses mac-catalyst which requires a different native
binary.

---------

Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-03-20 10:55:19 -07:00
mindest
3dfe4a5e6d
[ROCm] Remove MPI dependency and collectives to use NCCL (#19830)
### Description
* Remove MPI dependency to use NCCL AllReduce, etc.
* Exclude unsupported collectives in hipify
2024-03-19 17:35:18 -07:00
Ted Themistokleous
6bb64683f8
Use version instead of version-dev for ROCm (#19967) 2024-03-19 10:40:40 +08:00
Adam Louly
32558134a9
[On-Device-Training] Upgrade Flatbuffers to Support 2GB+ Checkpoints. (#19770)
### Description
Modifications to support 2GB+ checkpoint & Upgrading Flatbuffers


### Motivation and Context
This PR includes changes that will make ort handle 2GB+ checkpoints.
To do that we need to upgrade flatbuffers to 23.5.9 -
https://github.com/google/flatbuffers/pull/7945

- Modified the commitHash and the hash for the new version
- Removed the patch for rust generator's unused variable warning as it
is no longer producing this - [Check it out
here](d121e09d89/src/idl_gen_rust.cpp)
- Updated the VerifyField calls with alignment values that were
introduced in the new version.

---------

Co-authored-by: Sumit Agarwal <sumitagarwal@microsoft.com>
2024-03-14 16:36:24 -07:00
Changming Sun
1fb6cbddee
Add a build patch for Windows ARM64EC (#19898)
### Description
Add a patch for Windows ARM64EC


### Motivation and Context
Will need more changes in onnxruntime/core/common/cpuid_arch_definition.h and onnxruntime/core/common/cpuid_info.cc
2024-03-14 08:50:42 -07:00
Jeff Daily
9443366009
[ROCm] fix build failure when nccl is enabled (#19900)
Building onnxruntime ROCm EP with --enable_nccl --use_mpi fails due to
inclusion of MOE source files but MOE is not supported. The error
observed is

`error: contrib_ops/rocm/moe/ft_moe/moe_kernel.h: No such file or
directory`

The fix is to exclude collective/sharded_moe.* files when nccl is
requested.
2024-03-13 21:16:54 -07:00
Adrian Lizarraga
9c3242ab70
[QNN EP] Copy security catalog file for HtpV73Skel.so from QNN SDK (#19903)
### Description
Copies the `QNN_HOME/lib/hexagon-v73/unsigned/libqnnhtpv73.cat` file
from QNN SDK to the unittest build directory. This is necessary in order
to be able to load the `libQnnHtpV73Skel.so` file on Windows for modern
versions of QNN SDK.

### Motivation and Context
A [digitally-signed catalog
file](https://learn.microsoft.com/en-us/windows-hardware/drivers/install/catalog-files)
(.cat) can be used as a digital signature for an arbitrary collection of
files.
2024-03-13 20:52:59 -07:00
Jake Mathern
18ad8587a6
[CP] Fix for xfgcheck and Fix WAI ARM64 build (#19634) (#19644)
### Description
Fix WAI build by only conditionally copying linker flags



### Motivation and Context
I broke the WAI build that contains ORT on ARM64
2024-03-13 17:54:06 -07:00
Edward Chen
860eb762c2
[Apple framework] Fix minimal build with training enabled. (#19858)
Fix some linker errors that come up when integrating the onnxruntime-training-c pod into another Xcode project. The problematic configuration is a minimal build with training APIs enabled.
- training_op_defs.o had some unresolved references to ONNX functions. It should not be included at all in a minimal build.
- tree_ensemble_helper.o also had unresolved references to ONNX ParseData. The containing function is unused in a minimal build.

Added a test to cover this configuration.
2024-03-12 11:33:30 -07:00
Scott McKay
978c40d853
Make partitioning utils QDQ aware so it does not break up QDQ node units (#19723)
### Description
<!-- Describe your changes. -->
If the EP handles QDQ node units, we need to make sure we do not split
those into different partitions.

Update the partitioning utils to be QDQ aware. If there are node units
we process the logical nodes they represent instead of individual nodes.
This ensure we process all nodes in a QDQ node unit at the same time so
that they are always in the same partition.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix one of the issues in #19590

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-03-12 10:55:49 +10:00
Changming Sun
efad5bbc5a
Replace some old file system calls with C++17 std::filesystem APIs. (#19196)
### Description
1. Replace some old file system calls to use C++17 std::filesystem APIs.
2. Remove tensorflow_C_PACKAGE_PATH cmake option, which was only used in
onnxruntime_perf_test and the code is out of maintain.
3. Excludes onnx_test_runner and onnxruntime_perf_test from iOS build
because C++17 filesystem library is not available there
2024-03-09 09:17:36 -08:00
Scott McKay
db59cec82f
Don't reduce warning level for CUDA build on Windows (#19663)
### Description
<!-- Describe your changes. -->
Address warnings so all the ORT projects build with /W4 on Windows.

Mainly 
- unused parameters
- variables shadowing other ones

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#19588 started on this.
2024-03-06 15:03:55 +10:00
Chi Lo
d9730c7f43
[TensorRT EP] Fix bug for DDS output handling for empty tensor (#19575)
When the DDS output is empty tensor (i.e. any of the dimension is 0),
TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to
prevent accidentally overwriting other location that might belong to
other tensors.

This PR also refactors the code to only allocate single bytes for all
empty tensors.

#TODO: add unit tests to cover the DDS code paths or doing more testing
with concurrent,sequential, threaded faster-rcnn using onnx_test_runner
and verifying outputs

---------

Co-authored-by: Chi Lo <lochi@microsoft.com>
2024-03-05 14:39:36 -08:00
Chen Fu
06e684c9f2
Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619)
### Description
Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is
specially optimized for Nvidia Ampere GPUs.


### Motivation and Context
Trying to improve quantized LLM inference performance on Nvidia Ampere
GPUs

### Note:
This is implemented by extending CUTLASS, so it has a hard dependency on
CUTLASS. However, in current build system, loading of CUTLASS dependency
is guarded with:

(onnxruntime_USE_FLASH_ATTENTION OR
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION)

If both of these options are turned off, then compilation will fail.

Why CUTLASS dependency is guarded at all? It's a header file only
library that does not introduce any binary if not instantiated. What's
the downside of removing all the guards and just include CUTLASS
unconditionally?
2024-03-05 09:37:45 -08:00
Changming Sun
a0521f899e
Enable CPUINFO for all Windows build (#19655)
### Description
It was disabled in PR #9065. And the reason was:
" api-ms-win-core-kernel32-legacy-*.dll wasn't available in Windows 8
and was added in Windows 10, so cpuinfo breaks our Windows 8 support.
I'm disabling it again."

We no longer support Windows 8.  Therefore we can add CPUINFO back.

### Motivation and Context
To make the code simpler. If in any case the library doesn't work as
expected, we can submit a PR to their code base and fix it.
2024-03-01 16:23:20 -08:00
Edward Chen
5672cdebdf
Update google benchmark to 1.8.3. (#19734)
Update google benchmark to 1.8.3.
Update deps_update_and_upload.py script to make it easier to use.
2024-03-01 11:01:58 -08:00
Scott McKay
2a857d9a86
Add ML Program support for more operators (#19527)
### Description
<!-- Describe your changes. -->

Add support for:
- Clip/Relu/Relu6
- Add/Mul/Div/Sub/Pow
- GlobalAveragePool/GlobalMaxPool/AveragePool/MaxPool
- Reshape
- Gemm/MatMul

Fix some build issues/warnings from changes.

Fix a couple of potential issues with the Resize op as well (noticed due
to change to reject inputs with empty data at a higher level).

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable mobilenetv2 with ML Program
2024-03-01 10:23:29 +10:00
Maximilian Müller
c20ced4132
Use CMake's find package for CUDA libs (#19673)
### Description
Answers issue #19640 
More details are in the issue, basically I am changing all the include
directory and link directory usage to CMake's `CUDA::*` targets
2024-02-27 11:26:48 -08:00
cloudhan
1e69b61238
Make version string detection more robust (#19615)
`/opt/rocm/.info/version-dev` is only available if the `rocm-dev`
metapackage is installed. This will bring a lot of unused packages which
are not needed by the users, they may opt for fine grained control.
Fallback to `rocm_version.h` in case `rocm-dev` is not installed.
2024-02-27 16:06:06 +08:00
Changming Sun
9ccdc4961a
Stop using apiset in OneCore build: use onecoreuap.lib instead of onecoreuap_apiset.lib (#19632)
### Description
Stop using apiset in OneCore build: use onecoreuap.lib instead of
onecoreuap_apiset.lib in onecore build.


### Motivation and Context
1. Now all Windows Editions come with Reverse Forwarders. We should just
use the normal onecore libs.
2. Many new Windows APIs are only available in [windows umbrella
libraries](https://learn.microsoft.com/en-us/windows/win32/apiindex/windows-umbrella-libraries).
So these libraries are not specific for Windows CoreOS or Onecore.
3. Going forward we should use "IsApiSetImplemented" to guard our API
usages:

https://learn.microsoft.com/en-us/windows/win32/apiindex/detect-api-set-availability
.

After this change, our built binaries can pass apivalidator's check.

```
C:\local\apivalidator>apivalidator.exe -BinaryPath:C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll -SupportedApiXmlFiles:onecoreuap_DDIs.xml
ApiValidation:
Summary:
        "C:\src\onnxruntime\b\Debug\Debug\onnxruntime.dll" is Universal


ApiValidation: All binaries are Universal
```
So it will give an easy way to test ONNX Runtime's compatibility to
Windows versions.
2024-02-23 22:31:57 -08:00
cao lei
f430600432
Enable streams for DML EP. This change is to revert PR 19481 since the bug 19480 is fixed by PR 19515 (#19609)
### Description
<!-- Describe your changes. -->
Enable streams for DML EP. This change is to revert PR 19481 since the
bug 19480 is fixed by PR 19515


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Enable streams for DML EP. This change is to revert PR 19481 since the
bug 19480 is fixed by PR 19515
2024-02-23 06:02:05 -08:00
pengwa
ae92d593c0
ONNX Gelu Op in Opset 20 (#19560)
### ONNX Gelu Op in Opset 20

Refactor code to support MSDomain Gelu and ONNX Gelu-opset20 Op

1. Move CPU-GELU implmentation from
`onnxruntime/contrib_ops/cpu/activations.h/cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'none'.
2. Dumplicate some logic from
`onnxruntime/contrib_ops/cpu/bert/bias_gelu.cc` to
`onnxruntime/core/providers/cpu/tensor/gelu.h/cc`, as the implementation
for approximate attribute to be 'tanh'.
3. Register ONNX domain Gelu CPU kernel from opset 20 in
`onnxruntime/core/providers/cpu/cpu_execution_provider.cc`.
4. Move `onnxruntime/contrib_ops/cuda/bert/fast_gelu_impl.h/cu` to
`onnxruntime/core/providers/cuda/tensor/gelu_impl.h` and
`onnxruntime/core/providers/cuda/tensor/gelu_approximate_impl.cu`
respectively, as the implementation for approximate attribute to be
'tanh'.
5. Implement the logic for approximate attribute to be 'none' in
`onnxruntime/core/providers/cuda/tensor/gelu_impl.cu`.
6. Register ONNX domain Gelu CUDA kernel from opset 20 in
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`.
7. ROCM ep related changes. 
8. Enrich the tests for ONNX domain Gelu in
`onnxruntime/test/providers/cpu/activation/activation_op_test.cc`.
2024-02-23 11:05:16 +08:00
PeixuanZuo
6226c5f62f
[ROCm] Add SkipGroupNorm for ROCm EP (#19303)
Add SkipGroupNorm for ROCm EP.

---------

Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-02-21 11:08:48 +08:00
Jake Mathern
7a5860e490
Fix cmake function duplicate lib (#19547)
### Description
Fixes cmake function definition in winml.cmake to copy link flags.



### Motivation and Context
XFGCheck errors in WindowsAI because this function does not transfer
linker flags
2024-02-20 13:41:40 -08:00
pengwa
b55260d076
Minor fix for cmake (#19552)
### Minor fix for cmake

When build on Linux, get a warning saying "
CMake Warning at CMakeLists.txt:1603 (message):
  MPI and NCCL disabled on Win build.
"

This message is not correct. So have such a fix to avoid any
misunderstanding from users.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/848c2d77-a538-4e31-8e0d-4b539233e515)




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-02-19 10:21:19 +08:00
Scott McKay
4e5119760d
Add initial support for CoreML ML Program to the CoreML EP. (#19347)
### Description
<!-- Describe your changes. -->
Adds infrastructure to create an ML Package containing the Model using
ML Program. Updated coremltools files to v7.1 to bring in new protobuf
definitions along with the tools to write the weight.bin file and create
an ML Package correctly.

Enables building a CoreML Model on all platforms which means all the
operator builder code can be debugged anywhere. Execution of the
generated CoreML model is obviously limited to Apple platforms.

The Conv operator builder has been updated to be able to generate an ML
Program Operation.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
NeuralNetwork is no longer being developed and ML Program is the
replacement going forward.
2024-02-15 08:46:03 +10:00
George Wu
5e70c6b3a6
allow protobuf lite build for TRT EP (#19498)
allow protobuf-lite builds with TensorRT EP as long as it's built with
the trt built-in parser and not the oss-parser.
This is because trt built-in parser statically links protobuf so there
aren't any conflicts for protobuf-lite.
2024-02-12 22:53:04 -08:00
Patrice Vignola
1182b5509b
Disable streams for the DML EP (#19481)
There's currently a bug in the allocation planner when reusing buffers
and more than one streams are used that make it possible (although
rarely) to reach a reference count of 0 for a buffer that is still being
used. Since DML doesn't benefit from multiple streams, disabling it is
the safest option for now.

This is a high priority issue that we need to fix for 1.17.1 since it
breaks stable diffusion. Identifying the perfect fix and fixing the
underlying issue would be too risky for a patch release, especially
given the limited time that we have.

https://github.com/microsoft/onnxruntime/issues/19480
2024-02-10 00:34:34 -08:00
Changming Sun
1007d8f3d1
Revert "Revert NeuralSpeed code for x64 MatMulNBits (#19382)" (#19474)
This reverts commit 0d10c7f3c1.
2024-02-09 09:24:54 -08:00
luoyu-intel
0d10c7f3c1
Revert NeuralSpeed code for x64 MatMulNBits (#19382)
### Description
<!-- Describe your changes. -->
Revert PR#19016 https://github.com/microsoft/onnxruntime/pull/19016
Revert PR#17669 https://github.com/microsoft/onnxruntime/pull/17669
2024-02-07 13:04:37 -08:00
Maximilian Müller
91b2e660fe
[Build] fix: missing nvcc flags when compiling with unittests (#19308)
When configured using the following CMake ops Clion is not able to
configure due to checking with `nvcc ... --dryrun tmp.cu`:
```
cmake -G Ninja -Donnxruntime_USE_TENSORRT="ON" -Donnxruntime_USE_CUDA="ON" -Donnxruntime_USE_CUDA_NHWC_OPS="ON" -DCMAKE_CUDA_ARCHITECTURES="native" -Donnxruntime_NVCC_THREADS=1 -Donnxruntime_ENABLE_NVTX_PROFILE="ON" -Donnxruntime_USE_TENSORRT_BUILTIN_PARSER="ON" -DCMAKE_CUDA_COMPILER_LAUNCHER="ccache" -Donnxruntime_BUILD_UNIT_TESTS="ON" -Donnxruntime_USE_TRITON_KERNEL=OFF -Donnxruntime_USE_FLASH_ATTENTION=OFF
```
Without building the unittests everything works fine. I believe my
changes only follow the logic that is actually desired. If
`NVCC_HAS_STRICT_ALIASING` is set to false it should not be possible to
add this as a CUDA flag. Same is true for `HAS_NOERROR` as seen in
`adjust_global_compile_flags.cmake`
2024-02-06 17:01:26 -08:00
Ye Wang
aaf32fb1b1
phi2 conversion/optimization script (#19338)
### Description
<!-- Describe your changes. -->
This PR adds 
onnx conversion script for dynamo exported phi2,
optimization script,
and inference example script

A readme file is added as documentation.
https://github.com/microsoft/onnxruntime/tree/wangye/phi2_doc/onnxruntime/python/tools/transformers/models/phi2#readme


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-02-05 10:15:16 -08:00
Scott McKay
debd1cab10
Add coremltools 7.1 as a dependency (#19389)
### Description
<!-- Describe your changes. -->
Setup usage of coremltools via dependencies instead of copying files. 
Pull in some changes from
https://github.com/microsoft/onnxruntime/pull/19347 in preparation for
supporting ML Program and enabling building the ML Model on all
platforms to make development and testing of CoreML EP code easier.

- Update to coremltools 7.1 
- Add patch for changes required for cross platform build of ML Program
related code
- Generate coreml proto files on all platforms
- mainly to test these changes work everywhere, as the proto files will
be used on all platforms when #19347 is checked in
- rename onnxruntime_coreml_proto target to coreml_proto as it contains
purely coreml protobuf code with no ORT related chagnes

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve setup.
2024-02-03 09:42:21 +10:00
He Li
1bdd7d9499
Update oneDNN to v3.0.1 in order to support gcc 13 (#19344)
### Description

Update the dependency of `oneDNN` to v3.0.1, which fixes a minor bug
hindering gcc 13.

### Motivation and Context


Referring to
[oneDNN-1548](https://github.com/oneapi-src/oneDNN/issues/1548).

- When building with `--use_dnnl` using gcc 13.x, it will fail due to
this upstream issue.
- This is fixed in `v3.0.1`
[tag](https://github.com/oneapi-src/oneDNN/tree/v3.0.1) by [this
commit](1d7971ce48).
2024-02-01 15:39:03 -08:00
Yueqing Zhang
1d6f13fb92
[VitisAI] Refactor the VAIEP to use MSFT's standalone API (#19058)
### Description
<!-- Describe your changes. -->
Refactor the VAIEP to use MSFT's standalone API


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs
in order to decouple the EP from onnxruntime.dll and the providers.dll.
This will help to simplify customer deployment of applications and use
cases that need to share their onnxruntime.dll with other applications.

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
2024-01-31 21:08:26 -08:00
Yi-Hong Lyu
55b60d8fe0
Turn off Neural Speed to avoid slowdowns (#19265)
Disable Neural Speed to prevent the operation following MatMulNBits from
significantly slowing down.
2024-01-31 13:40:25 -08:00
Phoebe Chen
2b361c04d6
Fix Flatbuffer build issue. (#19296)
### Description

Building on g++ 13.2.0 results in -Wstringop-overread errors on Linux.
This commit addresses the flatbuffer build issue with the following
changes:
1. Remove the Werror flag in the flarbuffer patch.
2. Add a compilation option to suppress the 'stringop-overflow' error in
the Flatbuffers within the xnnpack provider.

### Motivation and Context
https://github.com/google/flatbuffers/issues/8119
https://github.com/microsoft/onnxruntime/pull/19239

Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
2024-01-31 10:12:43 -08:00
Changming Sun
8dad9d92f4
Move einsum's test data to constexpr variables (#19320)
### Description
emscripten's C++ compiler has difficulty on compiling einsum_test.cc
because the file has too many local variables. So I moved them to
constexpr.
2024-01-30 15:59:37 -08:00
Changming Sun
a92802f940
Disable a few tests for wasm build (#19316) 2024-01-30 08:16:57 -08:00
Tianlei Wu
8b4517218b
Remove USE_CUTLASS flag (#19271)
### Description
Since Cutlass can be built with CUDA 11.4 (The minimum CUDA version for
onnxruntime CUDA build), there is no need to have a flag to disable
cutlass.

Changes:
(1) Reverted https://github.com/microsoft/onnxruntime/pull/18761
(2) remove the condition to build cutlass.
(3) Fix a few build errors or warnings during testing CUDA 11.4 build. 

Note that SM 89 and 90 (including fp8) requires CUDA 11.8 or later.
Flash attention and cutlass fused multihead attention will not be built
for CUDA < 11.6. It is recommended to use CUDA 11.8 or above to build if
you want to support latest GPUs.

It is better to include it in 1.17.0 (otherwise, the release branch
might encounter build failure with CUDA 11.4).

Tests:
(1) Build with flash attention and efficient attention off: **passed**
(2) Build with CUDA 11.4: **passed**

Example build command used in Ubuntu 20.04:
```
export CUDA_HOME=/usr/local/cuda-11.4
export CUDNN_HOME=/usr/lib/x86_64-linux-gnu/
export CUDACXX=/usr/local/cuda-11.4/bin/nvcc

sh build.sh --config Release  --build_shared_lib --parallel  --use_cuda --cuda_version 11.4 \
            --cuda_home $CUDA_HOME --cudnn_home $CUDNN_HOME --build_wheel --skip_tests \
            --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 \
            --disable_types float8
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-01-25 16:57:58 -08:00
PeixuanZuo
1c92e56dc0
[Cuda] Refactor GroupNorm (#19146)
Split GroupNorm implementation into multiple files, to make ROCm EP can
reuse cuda code.

Related PR: https://github.com/microsoft/onnxruntime/pull/19158

---------

Co-authored-by: Peixuan Zuo <peixuanzuo@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
2024-01-25 22:28:47 +08:00
Phoebe Chen
4477f57ee3
Enable RISC-V 64-bit Cross-Compiling Support for ONNX Runtime on Linux (#19238)
### Description  
This pull request introduces the necessary changes to enable RISC-V
64-bit cross-compiling support for the ONNX Runtime on Linux. The RISC-V
architecture has gained popularity as an open standard instruction set
architecture, and this contribution aims to extend ONNX Runtime's
compatibility to include RISC-V, thereby broadening the reach of ONNX
models to a wider range of devices.

### Motivation and Context
RISC-V is a free and open-source instruction set architecture (ISA)
based on established RISC principles. It is provided under open licenses
without fees. Due to its extensibility and freedom in both software and
hardware, RISC-V is poised for widespread adoption in the future,
especially in applications related to AI, parallel computing, and data
centers.

### Example Build Command
```
./build.sh --parallel --config Debug --rv64 --riscv_toolchain_root=/path/to/toolchain/root --skip_tests
```

### Documentation Updates
Relevant sections of the documentation will be updated to reflect the
newly supported RISC-V 64-bit cross-compilation feature.
https://github.com/microsoft/onnxruntime/pull/19239

---------

Signed-off-by: Phoebe Chen <phoebe.chen@sifive.com>
2024-01-24 16:27:05 -08:00
Changming Sun
bc54ad3f03
Update abseil to a release tag and register neural_speed (#19255)
### Description
Update abseil to a release tag and register neural_speed to CG.


### Motivation and Context
Now we are using a non-relesed version of abseil. Using a tag is better.
2024-01-24 14:37:39 -08:00