### Description
Resolve#21976 .
ABSL generally does not have forward/backward compatibility. Our code is
only compatible with one fixed LTS version. So it's important to fix the
version number there when using find_package to detect an installed
version.
### Description
This change introduces the WebGPU EP into ONNX Runtime.
To make the PR as simple as possible, this PR excluded the following:
- C API changes for WebGPU EP
- actual implementation of WebGPU EP. Currently in this PR, WebGPU is a
stub implementation that does not register any kernel.
- Python IO Binding update
- Node.js IO Binding update
This PR now contains only 43 file changes (while the working branch
contains 130+) and hopefully this makes it easier to review.
There is going to be separated PRs for each mentioned above.
Current working branch: #21904
- Add Java API for appending QNN EP
- Update Java unit test setup
- Fix issues with setting system properties for tests
- Unify Windows/non-Windows setup to simplify
### Description
<!-- Describe your changes. -->
NS is not developed anymore and ORT doesn't use it for int4 inference
either. Remove it to clean up the code
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
To fix the build issues for AIX OS while using system installed
protobuf/onnx.
### Motivation and Context
Code changes in this PR contains:
1. Fix for below compilation issue.
```
collect2: fatal error: library liblibprotobuf-lite not found
compilation terminated.
```
2. Adding onnx library into dependency list for test applicaitons.
### Description
TensorRT 10.4 is GA now, update to 10.4
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Composable Kernel build fails under ROCm 6.2.
This PR patches Composable Kernel the same way as
https://github.com/ROCm/composable_kernel/pull/1346
* fix buffer resource to match "s" constraint
* add missing memory clobber
### Description
<!-- Describe your changes. -->
Specify the path of `ar`, `ld` and `libtool` when building apple
framework.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sometimes non-system executables will comes before the system-provided
ones. This PR intends to prevent it from happening.
### Description
<!-- Describe your changes. -->
Use the latest nuget.exe for the `readme` property to be supported.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#22137
### Description
Update XNNPack to latest version (Sep 4)
- Some op outputs are changed, channel or stride paras are moved into
reshape func.
e.g.
96962a602d
- input params of xnnpack's resize related function are changed a lot
- KleidiAI is added as a dependency in ARM64
- The latest XNNPACK includes 2 static libs microkernels-prod and
xnnpack.
Without microkernels-prod, it throws the exception of Undefined symbols.
- Add ORT_TARGET_PROCESSOR to get the real processor target in CMake
### Description
This PR makes the following updates to the Arm Compute Library execution
provider:
- Target Arm Compute Library 24.07
- Add support for the following operators:
- Conv (FP16)
- NhwcConv
- QLinearConv
- MatMul
- FusedMatMul
- MatMulIntegerToFloat
- Optimize memory usage and performance
- Expose the enable_fast_math setting
- Use the main runtime thread pool
### Motivation and Context
These updates improve performance and memory usage, and enable use of a
more recent version of Arm Compute Library.
@microsoft-github-policy-service agree company="Arm Ltd"
---------
Signed-off-by: Michael Tyler <michael.tyler@arm.com>
### Description
This PR adds two new libfuzzer in fuzzer project.
1. Binary libfuzzer
2. libprotobuf-fuzzer
To compile run below cmd on linux:
```
LLVM_PROFILE_FILE="%p.profraw" CFLAGS="-g -fsanitize=address,fuzzer-no-link -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address,fuzzer-no-link -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --use_full_protobuf --parallel --fuzz_testing --build_dir build/
```
Run fuzzer:
```
LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) build/Debug/onnxruntime_libfuzzer_fuzz testinput -rss_limit_mb=8196 -max_total_time=472800 -fork=2 -jobs=4 -workers=4 -ignore_crashes=1 -max_len=2097152 2>&1 | grep -v "\[libprotobuf ERROR"
```
### Motivation and Context
The existing custom fuzzer is not coverage guided and it's slow and it
will work on one model mutation at a time. The new fuzzers are coverage
guided, and we can use more models' files as a corpus to increase the
coverage.
### Description
<!-- Describe your changes. -->
Change the `CMAKE_CXX_COMPILER_VERSION` greater than `11` for using
'-mavxvnni'.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
`CMakeFiles/onnxruntime_mlas.dir/root/Git.d/onnxruntime/onnxruntime/core/mlas/lib/x86_64/QgemmU8S8KernelAvx2.S.o
cc: error: unrecognized command-line option ‘-mavxvnni’; did you mean
‘-mavx512vnni’?` using `gcc (GCC) 10.3.1`.
`-mavxnni` is supported since [GCC 11
Release](https://gcc.gnu.org/gcc-11/changes.html), this PR change the
version check.
Error Codes are added to catch compilation error and signal recompile.
Remote Tensors are added to ensure direct memory access for NPU
inferencing.
UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching
### Motivation and Context
The changes are needed to ensure backward compatibility
UMD Bypass caching eliminates driver caching
Remote Tensors lead to performance improvement with inferencing on NPU
---------
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
### Description
Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT
instructions to accelerate casting from FP16 to FP32. Added CPUID checks
to determine support of the ISA.
### Motivation and Context
Currently FP16 models executed on systems that lack complete FP16
operator support use single precision on every node to run the model,
this means the original FP16 weights have to be casted to FP32 in order
to run the model properly, this change aims to accelerate the casting by
using upconvert instructions and therefore improve performance.
### Description
Added some change in fuzzer project code to support linux also.
How to test on linux:
1. Make sure you have installed clang/llvm.
2. run below command to build asan instrumented project:
```
CFLAGS="-g -fsanitize=address -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --skip_tests --use_full_protobuf --parallel --fuzz_testing --build_dir build/
```
3. run fuzzer for some time, it will generate *.profraw file:
```
LLVM_PROFILE_FILE="%p.profraw" ./build/Debug/onnxruntime_security_fuzz /t /v onnxruntime/test/testdata/bart_tiny.onnx 1 m
```
4. Get the cov by running below cmd:
```
llvm-profdata merge -sparse *.profraw -o default.profdata
llvm-cov report ./build/Debug/onnxruntime_security_fuzz -instr-profile=default.profdata
```
<img width="1566" alt="Screenshot 2024-09-05 at 4 25 08 PM"
src="https://github.com/user-attachments/assets/2aa0bb83-6634-4d33-b026-3535e97df431">
### Motivation and Context
1. Currently fuzzer only supports windows and MSVC, we can't generate
the code coverage using MSVC. With clang/llvm we can try and use clang
instrumentation and llvm tools like llvm-cov.
2. In future we can add coverage guided fuzzer (libfuzzer) in same
project. (Working on it)
Calling Split API Calls Read+Model in lieu of unified Compile Model call
for export compile flow to ensure memory optimization. Freeing up model
proto and serialized string and read model ov ir later to free up memory
for the ahead pipeline
Optimization during EpCtxt flow
All the Graph related operations require all the Node Attributes to be
set while dealing with model instances internally with them, in the
existing implementation these attributes make a copy when constructing a
Graph dynamically during runtime.
Propose to use these attributes in place without creating a copy to
avoid memory allocation / copy while calling these Graph related
functions.
Changes to ensure the bug fixes related to openvino version and epctxt
file path.
Moving Compiler version to C++20 for getting r-value mem optimizations
benefit
### Motivation and Context
This change is required because memory optimization during Compilation
flow is too high.
---------
Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
### Description
Implement softcap for gqa.
### Motivation and Context
Fixes certain models like Gemma-2 which need softcap to work so they
don't output nan's.
### Description
Enabling python binding and gcc support for AIX.
### Motivation and Context
Code changes in this PR contains:
1. python binding enablement
2. gcc building support
Below are list of files and the description.
1. cmake/CMakeLists.txt
[gcc building support] -no-unused-function compiler flag addition for
IBMClang
2. cmake/external/eigen.cmake
[gcc building support] AIX check for applying the AIX patch
3. cmake/onnxruntime_python.cmake
[python binding ] putting NOT AIX check for -Xlinker
4. cmake/onnxruntime_unittests.cmake
[gcc building support] Fix for gtest behavior. Check the comment .
[python binding ] using -Wl,-brtl for linking
onnxruntime_providers_shared in test_execution_provider
5. cmake/patches/eigen/eigen-aix.patch
[gcc building support] In AIX gcc, we are hitting
__builtin_cpu_supports("mma") which is not supported yet. So patching
code for this method . Patched code will check for P10 Processor at
run-time and based on that routine will be set.
6. onnxruntime/python/onnxruntime_validation.py
[python binding ] Adding AIX check in check_distro_info()
7. onnxruntime/test/providers/cpu/generator/random_test.cc
[gcc building support] updating previous check for AIX , along with
clang. So in case of gcc, else block will hit.
8. onnxruntime/test/python/onnxruntime_test_python.py
[python binding ] powerpc check on platform.processor()
9. setup.py
[python binding ] Adding AIX check for list of libs.
### Description
This change disables Abseil's symbolize functionality in Windows
non-debug builds.
### Motivation and Context
To solve #21826. Avoid having a dependency on dbghelp.dll.
### Description
* Add new ROCm CI pipeline (`Linux ROCm CI Pipeline`) focusing on
inference.
* Resolve test errors; disable flaky tests.
based on test PR #21614.
### Description
<!-- Describe your changes. -->
refer to https://github.com/microsoft/onnxruntime/pull/21867
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Your Name <you@example.com>
### Description
Extends the Drop QDQ optimization to remove DequantizeLinear and
QuantizeLinear nodes from around operators:
- Flatten
- Expand
- Tile
- Slice
- GatherElements
- ReduceMin
- ReduceMax
### Motivation and Context
To reduce floating-point conversions in quantize inference. Mainly
motivated by the Flatten case, since that will show up in graphs
exported from PyTorch to ONNX. But to make the change complete,
extending to a larger set of ops for which this optimization is valid.
https://github.com/microsoft/onnxruntime/issues/21375
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Remove legacy code and wrong message.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required by Microsoft to remove unwanted error message. This is
required for 8.15 release.
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
### Description
Compiling onnxruntime with QNN EP on Windows x86_64 results in a
compilation error:
```shell
$ onnxruntime\test\optimizer\qdq_transformer_test.cc(1,1): error C1128: num
ber of sections exceeded object file format limit: compile with /bigobj [...onnxruntime\build\Debug\onnxruntime_test_all.vcxproj]
```
This PR adds the `/bigobj` compilation flag for the
`qdq_transformer_test.cc` file.
### Description
- TensorRT 10.2.0.19 -> 10.3.0.26
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Handle targets in subdirectories for external projects. All targets will
now go in a per-project folder under 'External'
e.g. gmock and gtest now get handled correctly and are under
External/googletest vs. existing setup where they ended up as top-level
projects.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve developer experience.
### Description
Update DML runtime binary to 1.15.1
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Several tests result in segfaults during the minimal cuda build.
Although test failures are expected due to the limitation of the minimal
cuda EP, failing gracefully would be much preferred.
### Motivation and Context
To reproduce:
1. Build ORT with:
```bash
./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1
```
2. Run `onnxruntime_test_all`
```bash
...
[----------] 1 test from AllocationPlannerTest
[ RUN ] AllocationPlannerTest.ReusedInputCrossDifferentStreams
Segmentation fault (core dumped)
```