onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-06 00:03:22 +00:00

Author	SHA1	Message	Date
Dmitri Smirnov	224f0651d0	[C#] Expose Multi-Lora support in C# (#22281 ) ### Description ### Motivation and Context https://github.com/microsoft/onnxruntime/pull/22046	2024-10-02 10:00:43 -07:00
Edward Chen	c24e55b1f1	[Java] Add API for appending QNN EP (#22208 ) - Add Java API for appending QNN EP - Update Java unit test setup - Fix issues with setting system properties for tests - Unify Windows/non-Windows setup to simplify	2024-10-01 10:18:04 -07:00
Yufeng Li	96e9c99dce	remove neural-speed (#22236 ) ### Description <!-- Describe your changes. --> NS is not developed anymore and ORT doesn't use it for int4 inference either. Remove it to clean up the code ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-10-01 09:50:44 -07:00
Dmitri Smirnov	d9de054eb5	Multi-Lora support (#22046 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-30 15:59:07 -07:00
Ranjit Ranjan	812075731c	[AIX] Build fix for using system installed protobuf/onnx (#22272 ) ### Description To fix the build issues for AIX OS while using system installed protobuf/onnx. ### Motivation and Context Code changes in this PR contains: 1. Fix for below compilation issue. ``` collect2: fatal error: library liblibprotobuf-lite not found compilation terminated. ``` 2. Adding onnx library into dependency list for test applicaitons.	2024-09-30 12:36:21 -07:00
Sumit Agarwal	529835cc46	[DML EP] Update DML to 1.15.2 (#22247 ) ### Description Update DML binary to the current latest redist version [1.15.2](https://www.nuget.org/packages/Microsoft.AI.DirectML/1.15.2).	2024-09-27 13:20:29 -07:00
Jing Fang	1942e40e05	[ARM64] MatMulNBits: use neon instrinsics to convert between fp16 and fp32 (#22195 ) ### Description For fp16 Atype, the fallback operation is convert the data to fp32 and calculate. Added neon intrinsics version to speed up the conversion. Store address alignment and loop unrolling have insignificant impact on latency so they are omitted. \|Benchmark \| Time \| CPU \| \|--------------\|---------------------------------------------\|--------------------\| \|M_ConvertF16ToF32/baseline/real_time \| 1076961 ns \| 1083398 ns \| \|M_ConvertF16ToF32/aligned:0/real_time \| 46785 ns \| 46516 ns \| \|M_ConvertF16ToF32/aligned:1/real_time \| 46631 ns \| 46391 ns \| \|M_ConvertF16ToF32_unroll2/aligned:0/real_time \| 44074 ns \| 44392 ns \| \|M_ConvertF16ToF32_unroll2/aligned:1/real_time \| 44726 ns \| 45226 ns \| \|M_ConvertF32ToF16/baseline/real_time \| 520109 ns \| 527329 ns \| \|M_ConvertF32ToF16/aligned:0/real_time \| 73610 ns \| 74015 ns \| \|M_ConvertF32ToF16/aligned:1/real_time \| 71557 ns \| 71525 ns \| \|M_ConvertF32ToF16_unroll2/aligned:0/real_time \| 64227 ns \| 63374 ns \| \|M_ConvertF32ToF16_unroll2/aligned:1/real_time \| 67428 ns \| 67989 ns \| ### Motivation and Context speed up fallback implementation of Fp16 MatMulNBits	2024-09-26 13:55:40 -07:00
jingyanwangms	d0b0ecfdb9	[Running CI] Update TensorRT to 10.4 (#22049 ) ### Description TensorRT 10.4 is GA now, update to 10.4 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-09-26 11:10:52 -07:00
Edward Chen	209ff86d52	Get build working on Xcode 16 (#22168 )	2024-09-24 08:33:03 -07:00
Hann Wang	7a782b7213	[ROCm] fix rocm-6.2 build issues (#21993 ) Composable Kernel build fails under ROCm 6.2. This PR patches Composable Kernel the same way as https://github.com/ROCm/composable_kernel/pull/1346 * fix buffer resource to match "s" constraint * add missing memory clobber	2024-09-23 14:01:54 -07:00
Chester Liu	9b37b3ea44	Specify the paths of system tools when building Apple framework (#22056 ) ### Description <!-- Describe your changes. --> Specify the path of `ar`, `ld` and `libtool` when building apple framework. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Sometimes non-system executables will comes before the system-provided ones. This PR intends to prevent it from happening.	2024-09-23 17:19:30 +08:00
Yi Zhang	8d2d40781c	set CMAKE_SYSTEM_PROCESSOR in xnnpack.cmake (#22155 ) ### Description <!-- Describe your changes. --> ### Motivation and Context By default, CMAKE_SYSTEM_PROCESSOR is same CMAKE_HOST_SYSTEM_PROCESSOR https://cmake.org/cmake/help/latest/variable/CMAKE_SYSTEM_PROCESSOR.html KleidiAI uses CMAKE_SYSTEM_PROCESSOR to determine whether to include some arm64 ukernels. https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CMakeLists.txt#L134 We use Mac with Intel CPU to cross compile MAC with ARM in ios packaging pipeline So we need to make CMAKE_SYSTEM_PROCESSOR same with ORT_TARGET_PROCESSOR	2024-09-20 15:19:26 -07:00
Scott McKay	bd60add8ce	Update nuget.exe used in WindowsAI nuget packaging so `readme` property is supported. (#22141 ) ### Description <!-- Describe your changes. --> Use the latest nuget.exe for the `readme` property to be supported. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #22137	2024-09-19 19:06:47 +10:00
Scott McKay	99ee6eeca2	Enable Android 16 KB page size support (#22076 ) ### Description <!-- Describe your changes. --> Add linker flags to support 16KB page size support on Android. See https://source.android.com/docs/core/architecture/16kb-page-size/16kb#build-lib-16kb-alignment ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #21837	2024-09-19 18:53:57 +10:00
George Wu	944d87381d	[QNN EP] set up py packaging pipeline for Linux x64 (#22132 ) set up a pipeline to produce nightly Linux x64 whls for onnxruntime-qnn this can be used for offline context binary generation.	2024-09-18 23:24:32 -07:00
Tianlei Wu	a9740d6f96	Add onnx export script for segment anything v2 (#22119 ) ### Description Add ONNX export script for segment anything v2 (SAM2). ### Limitations * Does not support video. Only support image right now. * The decoder does not support batch inference. ### Credits The demo that is based on [SAM2 notebook](https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/image_predictor_example.ipynb), and modified to run with ORT. The export of decoder is inspired by https://github.com/vietanhdev/samexporter. ### Demo Example output of demo: ![sam2_demo](https://github.com/user-attachments/assets/9a9fa360-8c20-482e-9935-a7aba9cf15de) ### Motivation and Context For support optimization of SAM2 image segmentation.	2024-09-18 14:31:59 -07:00
Yi Zhang	b94ba09e4f	Upgrade XNNPACK to latest version (#22012 ) ### Description Update XNNPack to latest version (Sep 4) - Some op outputs are changed, channel or stride paras are moved into reshape func. e.g. `96962a602d` - input params of xnnpack's resize related function are changed a lot - KleidiAI is added as a dependency in ARM64 - The latest XNNPACK includes 2 static libs microkernels-prod and xnnpack. Without microkernels-prod, it throws the exception of Undefined symbols. - Add ORT_TARGET_PROCESSOR to get the real processor target in CMake	2024-09-17 10:12:16 -07:00
liqun Fu	a89bddd5c2	Matmul_nbits kernel for mlas sqnbits to support Fp16 inputs (#21807 )	2024-09-13 14:55:08 -07:00
Michael Tyler	904b850b44	Update Arm Compute Library Execution Provider (#22032 ) ### Description This PR makes the following updates to the Arm Compute Library execution provider: - Target Arm Compute Library 24.07 - Add support for the following operators: - Conv (FP16) - NhwcConv - QLinearConv - MatMul - FusedMatMul - MatMulIntegerToFloat - Optimize memory usage and performance - Expose the enable_fast_math setting - Use the main runtime thread pool ### Motivation and Context These updates improve performance and memory usage, and enable use of a more recent version of Arm Compute Library. @microsoft-github-policy-service agree company="Arm Ltd" --------- Signed-off-by: Michael Tyler <michael.tyler@arm.com>	2024-09-12 20:51:59 -07:00
0xdr3dd	5c361106e6	[Fuzzer] Add two new ORT libfuzzer (Linux clang support for now) (#22055 ) ### Description This PR adds two new libfuzzer in fuzzer project. 1. Binary libfuzzer 2. libprotobuf-fuzzer To compile run below cmd on linux: ``` LLVM_PROFILE_FILE="%p.profraw" CFLAGS="-g -fsanitize=address,fuzzer-no-link -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address,fuzzer-no-link -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --use_full_protobuf --parallel --fuzz_testing --build_dir build/ ``` Run fuzzer: ``` LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) build/Debug/onnxruntime_libfuzzer_fuzz testinput -rss_limit_mb=8196 -max_total_time=472800 -fork=2 -jobs=4 -workers=4 -ignore_crashes=1 -max_len=2097152 2>&1 \| grep -v "\[libprotobuf ERROR" ``` ### Motivation and Context The existing custom fuzzer is not coverage guided and it's slow and it will work on one model mutation at a time. The new fuzzers are coverage guided, and we can use more models' files as a corpus to increase the coverage.	2024-09-12 11:50:34 -07:00
wangshuai09	d539c27de8	Fix version check for using -mavxvnni (#21616 ) ### Description <!-- Describe your changes. --> Change the `CMAKE_CXX_COMPILER_VERSION` greater than `11` for using '-mavxvnni'. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> `CMakeFiles/onnxruntime_mlas.dir/root/Git.d/onnxruntime/onnxruntime/core/mlas/lib/x86_64/QgemmU8S8KernelAvx2.S.o cc: error: unrecognized command-line option ‘-mavxvnni’; did you mean ‘-mavx512vnni’?` using `gcc (GCC) 10.3.1`. `-mavxnni` is supported since [GCC 11 Release](https://gcc.gnu.org/gcc-11/changes.html), this PR change the version check.	2024-09-12 11:42:17 -07:00
sfatimar	0309c5f02f	Ovep release lnl 1.2.1 (#22027 ) Error Codes are added to catch compilation error and signal recompile. Remote Tensors are added to ensure direct memory access for NPU inferencing. UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching ### Motivation and Context The changes are needed to ensure backward compatibility UMD Bypass caching eliminates driver caching Remote Tensors lead to performance improvement with inferencing on NPU --------- Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>	2024-09-11 14:55:40 -07:00
PARK DongHa	f633caa0b1	Create CMake option `onnxruntime_USE_VCPKG` (#21348 ) ### Changes 1. CMake option `onnxruntime_USE_VCPKG`. It will be used in the vcpkg port * Unit test may fail because this option leads to a mixture of unexpected external library versions. Especially ONNX, Protobuf, and Flatbuffers version can be different 2. Overhaul of `onnxruntime_external_deps.cmake` * Make `FetchContent_Declare` to try `find_package`. See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * Relocated `FetchContent_Declare` and `FetchContent_MakeAvailable`(or `onnxruntime_fetchcontent_makeavailable`) to closer lines. It was too hard to navigate the entire file to search related sections... * Alias `IMPORTED` targets like build targets (e.g. `ONNX::onnx` --> `onnx`) ```cmake # The script uses `find_package` with the changes. # In this case, use vcpkg to search dependencies # See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html include(external/onnxruntime_external_deps.cmake) ``` 3. Create CMakePresets.json and presets to [run vcpkg in manifest mode](https://learn.microsoft.com/en-us/vcpkg/concepts/manifest-mode) * Currently, it's NOT for training build * Main triplets are `x64-windows` and `x64-osx` ```pwsh Push-Location "cmake" cmake --preset "x64-windows-vcpkg" cmake --build --preset "x64-windows-vcpkg-debug" Pop-Location ``` ```bash pushd "cmake" cmake --preset "x64-osx-vcpkg" cmake --build --preset "x64-osx-vcpkg-debug" popd ``` 4. Updated tools/ci_build/build.py * `--use_vcpkg` option: it needs `CMAKE_TOOLCHAIN_FILE` with [vcpkg.cmake toolchain script](https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake) * `--compile_no_warning_as_error` is recommended because library version differences will cause unexpected compiler warnings ```bash python ./tools/ci_build/build.py \ --compile_no_warning_as_error \ --use_vcpkg \ --cmake_extra_defines "CMAKE_TOOLCHAIN_FILE:FILEPATH=${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" \ --cmake_extra_defines "VCPKG_TARGET_TRIPLET=..." ``` 5. Created Job `Vcpkg` for Windows and macOS * Show how to setup and use vcpkg. Similar to the CMakePresets.json usage ### Motivation and Context * Help #7150 * Help https://github.com/microsoft/vcpkg/pull/36850 * https://github.com/luncliff/vcpkg-registry/pull/212 * https://github.com/microsoft/vcpkg/pull/39881 * https://github.com/luncliff/vcpkg-registry/pull/215 * https://github.com/luncliff/vcpkg-registry/pull/216 * https://github.com/luncliff/vcpkg-registry/pull/227 * https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html * https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake ### Future Works? More feature coverage with the vcpkg supported libraries * CUDA feature support * Training feature support	2024-09-10 16:39:27 -07:00
Erick Muñoz	7489bfee53	Enable AVX NE CONVERT for FP16 to FP32 cast (#21183 ) ### Description Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT instructions to accelerate casting from FP16 to FP32. Added CPUID checks to determine support of the ISA. ### Motivation and Context Currently FP16 models executed on systems that lack complete FP16 operator support use single precision on every node to run the model, this means the original FP16 weights have to be casted to FP32 in order to run the model properly, this change aims to accelerate the casting by using upconvert instructions and therefore improve performance.	2024-09-09 21:19:31 -07:00
0xdr3dd	2dae8aaced	[Fuzzer] Add fuzzer support for linux (#21996 ) ### Description Added some change in fuzzer project code to support linux also. How to test on linux: 1. Make sure you have installed clang/llvm. 2. run below command to build asan instrumented project: ``` CFLAGS="-g -fsanitize=address -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --skip_tests --use_full_protobuf --parallel --fuzz_testing --build_dir build/ ``` 3. run fuzzer for some time, it will generate .profraw file: ``` LLVM_PROFILE_FILE="%p.profraw" ./build/Debug/onnxruntime_security_fuzz /t /v onnxruntime/test/testdata/bart_tiny.onnx 1 m ``` 4. Get the cov by running below cmd: ``` llvm-profdata merge -sparse .profraw -o default.profdata llvm-cov report ./build/Debug/onnxruntime_security_fuzz -instr-profile=default.profdata ``` <img width="1566" alt="Screenshot 2024-09-05 at 4 25 08 PM" src="https://github.com/user-attachments/assets/2aa0bb83-6634-4d33-b026-3535e97df431"> ### Motivation and Context 1. Currently fuzzer only supports windows and MSVC, we can't generate the code coverage using MSVC. With clang/llvm we can try and use clang instrumentation and llvm tools like llvm-cov. 2. In future we can add coverage guided fuzzer (libfuzzer) in same project. (Working on it)	2024-09-05 11:52:15 -07:00
Hector Li	190588bb64	Enable QNN weight sharing (#21077 ) ### Description Enable QNN weight sharing across graphs in single context Create tool to generate QNN context cache model with weight sharing enabled.	2024-09-04 11:20:33 -07:00
sfatimar	8dba8e3e24	Memory Optimization for Compilation in OVEP (#21872 ) Calling Split API Calls Read+Model in lieu of unified Compile Model call for export compile flow to ensure memory optimization. Freeing up model proto and serialized string and read model ov ir later to free up memory for the ahead pipeline Optimization during EpCtxt flow All the Graph related operations require all the Node Attributes to be set while dealing with model instances internally with them, in the existing implementation these attributes make a copy when constructing a Graph dynamically during runtime. Propose to use these attributes in place without creating a copy to avoid memory allocation / copy while calling these Graph related functions. Changes to ensure the bug fixes related to openvino version and epctxt file path. Moving Compiler version to C++20 for getting r-value mem optimizations benefit ### Motivation and Context This change is required because memory optimization during Compilation flow is too high. --------- Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>	2024-09-03 13:52:31 -07:00
Yulong Wang	bad00a3657	Add dependency dawn into deps.txt (#21910 ) ### Description Add dependency dawn into deps.txt. This is a preparation for introducing WebGPU EP.	2024-09-02 04:24:28 -07:00
aciddelgado	509cb54d6f	softcap gqa (#21683 ) ### Description Implement softcap for gqa. ### Motivation and Context Fixes certain models like Gemma-2 which need softcap to work so they don't output nan's.	2024-08-30 19:11:04 -07:00
Ranjit Ranjan	02e3a430af	[AIX] Python binding enablement and gcc support (#21934 ) ### Description Enabling python binding and gcc support for AIX. ### Motivation and Context Code changes in this PR contains: 1. python binding enablement 2. gcc building support Below are list of files and the description. 1. cmake/CMakeLists.txt [gcc building support] -no-unused-function compiler flag addition for IBMClang 2. cmake/external/eigen.cmake [gcc building support] AIX check for applying the AIX patch 3. cmake/onnxruntime_python.cmake [python binding ] putting NOT AIX check for -Xlinker 4. cmake/onnxruntime_unittests.cmake [gcc building support] Fix for gtest behavior. Check the comment . [python binding ] using -Wl,-brtl for linking onnxruntime_providers_shared in test_execution_provider 5. cmake/patches/eigen/eigen-aix.patch [gcc building support] In AIX gcc, we are hitting __builtin_cpu_supports("mma") which is not supported yet. So patching code for this method . Patched code will check for P10 Processor at run-time and based on that routine will be set. 6. onnxruntime/python/onnxruntime_validation.py [python binding ] Adding AIX check in check_distro_info() 7. onnxruntime/test/providers/cpu/generator/random_test.cc [gcc building support] updating previous check for AIX , along with clang. So in case of gcc, else block will hit. 8. onnxruntime/test/python/onnxruntime_test_python.py [python binding ] powerpc check on platform.processor() 9. setup.py [python binding ] Adding AIX check for list of libs.	2024-08-30 12:17:26 -07:00
Changming Sun	1f879c3282	Disable absl symbolize in Windows Release build (#21923 ) ### Description This change disables Abseil's symbolize functionality in Windows non-debug builds. ### Motivation and Context To solve #21826. Avoid having a dependency on dbghelp.dll.	2024-08-30 12:03:17 -07:00
mindest	bfa4da4f65	Add Linux ROCm CI Pipeline (#21798 ) ### Description * Add new ROCm CI pipeline (`Linux ROCm CI Pipeline`) focusing on inference. * Resolve test errors; disable flaky tests. based on test PR #21614.	2024-08-30 14:50:32 +08:00
Ye Wang	bf8855ba3c	Support Smooth Softmax in fmha (#21885 ) ### Description <!-- Describe your changes. --> refer to https://github.com/microsoft/onnxruntime/pull/21867 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Your Name <you@example.com>	2024-08-28 09:29:33 -07:00
mcollinswisc	5d54dc1462	Drop QDQ around more nodes (#21376 ) ### Description Extends the Drop QDQ optimization to remove DequantizeLinear and QuantizeLinear nodes from around operators: - Flatten - Expand - Tile - Slice - GatherElements - ReduceMin - ReduceMax ### Motivation and Context To reduce floating-point conversions in quantize inference. Mainly motivated by the Flatten case, since that will show up in graphs exported from PyTorch to ONNX. But to make the change complete, extending to a larger set of ops for which this optimization is valid. https://github.com/microsoft/onnxruntime/issues/21375 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-08-27 16:54:37 +10:00
Guenther Schmuelling	ba7baae994	Revert "Upgrade emsdk from 3.1.59 to 3.1.62" (#21817 ) Reverts microsoft/onnxruntime#21421 Users are seeing chrome memory grow to 16GB before it crashes: https://github.com/microsoft/onnxruntime/issues/21810 Revert for now so we have time to debug.	2024-08-22 11:21:00 -07:00
Yueqing Zhang	3ff8ca29e5	[VitisAI] remove wrong error msg, required by Microsoft (#21715 ) ### Description <!-- Describe your changes. --> Remove legacy code and wrong message. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is required by Microsoft to remove unwanted error message. This is required for 8.15 release. Co-authored-by: Yueqing Zhang <yueqingz@amd.com>	2024-08-21 21:10:28 -07:00
Adrian Lizarraga	28c252c77e	[QNN EP] Fix compile error for QNN EP on Windows x64 due to missing /bigobj flag (#21795 ) ### Description Compiling onnxruntime with QNN EP on Windows x86_64 results in a compilation error: ```shell $ onnxruntime\test\optimizer\qdq_transformer_test.cc(1,1): error C1128: num ber of sections exceeded object file format limit: compile with /bigobj [...onnxruntime\build\Debug\onnxruntime_test_all.vcxproj] ``` This PR adds the `/bigobj` compilation flag for the `qdq_transformer_test.cc` file.	2024-08-20 10:11:43 -07:00
Tianlei Wu	fbc3927231	[CUDA] cuDNN Flash Attention (#21629 ) ### Description - [x] Add cuDNN flash attention using cudnn frontend, and enable it in MultiHeadAttention operator. - [x] Support attention mask. - [x] Support attention bias. - [x] Update tests and benchmark script. The cuDNN SDPA is disabled by default. To enable it, need the following: (1) Requires cuDNN 9.3 or newer version installed. (2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or set `sdpa_kernel=8` cuda provider option to enable it. (3) Only works on devices with compute capability >= 8.0. Note that some combinations of parameters might be rejected due to limited support of head dimension or sequence lengths. Future Works: (1) FP8 and BF16 APIs. Currently, only API for FP16 are exposed. (2) Add API to support ragged batching (padding removed in inputs). (3) Support other input formats (like QKV_BS3NH). (4) Currently, q are converted to BSNH, k/v are converted to either BSNH or BNSH format. May do some experiment to see whether converting q to BNSH could be better in some case. ### Example Benchmark Results on H100 The following tests are on FP16 MultiHeadAttention operator without attention mask and attention bias. #### Test Setting 1 batch_size \| sequence_length \| past_sequence_length \| num_heads \| head_size -- \| -- \| -- \| -- \| -- 16 \| 256 \| 0 \| 32 \| 128 format \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- Q,K,V (BNSH) \| 0.000075 \| 229.5 \| torch:flash Q,K,V (BNSH) \| 0.000119 \| 144.8 \| torch:efficient Q,K,V (BNSH) \| 0.000224 \| 76.5 \| torch:math Q,K,V (BSNH) \| 0.000075 \| 227.8 \| ort:cudnn Q,K,V (BSNH) \| 0.000094 \| 182.8 \| ort:flash Q,K,V (BSNH) \| 0.000138 \| 124.7 \| ort:efficient Q,K,V (BSNH) \| 0.000438 \| 39.3 \| ort:math Q,KV \| 0.000129 \| 133.0 \| ort:cudnn Q,KV \| 0.000151 \| 114.1 \| ort:flash Q,KV \| 0.000194 \| 88.5 \| ort:efficient QKV \| 0.000154 \| 111.8 \| ort:cudnn QKV \| 0.000175 \| 98.0 \| ort:flash QKV \| 0.000217 \| 79.0 \| ort:efficient #### Test Setting 2 batch_size \| sequence_length \| past_sequence_length \| num_heads \| head_size -- \| -- \| -- \| -- \| -- 16 \| 512 \| 0 \| 16 \| 64 format \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- Q,K,V (BNSH) \| 0.000069 \| 249.2 \| torch:flash Q,K,V (BNSH) \| 0.000141 \| 121.7 \| torch:efficient Q,K,V (BNSH) \| 0.000294 \| 58.5 \| torch:math Q,K,V (BSNH) \| 0.000077 \| 221.7 \| ort:cudnn Q,K,V (BSNH) \| 0.000087 \| 196.6 \| ort:flash Q,K,V (BSNH) \| 0.000163 \| 105.6 \| ort:efficient Q,K,V (BSNH) \| 0.000651 \| 26.4 \| ort:math Q,KV \| 0.000103 \| 167.1 \| ort:cudnn Q,KV \| 0.000117 \| 146.3 \| ort:flash Q,KV \| 0.000192 \| 89.6 \| ort:efficient QKV \| 0.000113 \| 151.5 \| ort:cudnn QKV \| 0.000128 \| 134.7 \| ort:flash QKV \| 0.000201 \| 85.3 \| ort:efficient	2024-08-20 08:50:22 -07:00
jingyanwangms	c018ba43ef	[Running CI] [TensorRT EP] support TensorRT 10.3-GA (#21742 ) ### Description - TensorRT 10.2.0.19 -> 10.3.0.26 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-18 13:26:41 -07:00
Scott McKay	c97cc5c1b0	Put all external project targets under the 'External' folder in VS (#21765 ) ### Description <!-- Describe your changes. --> Handle targets in subdirectories for external projects. All targets will now go in a per-project folder under 'External' e.g. gmock and gtest now get handled correctly and are under External/googletest vs. existing setup where they ended up as top-level projects. ![image](https://github.com/user-attachments/assets/99ec259c-47cd-44f3-954d-58569c941cc2) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve developer experience.	2024-08-16 15:51:50 +10:00
Satya Kumar Jandhyala	6d8de1f7b8	Upgrade emsdk from 3.1.59 to 3.1.62 (#21421 ) ### Description Upgrade EM SDK to 3.1.62. ### Motivation and Context The changes are required to clear wasm64 errors.	2024-08-14 12:38:52 -07:00
Sumit Agarwal	c5592fdcef	[DML EP] Update DML to 1.15.1 (#21695 ) ### Description Update DML runtime binary to 1.15.1 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-12 14:16:43 -07:00
Edward Chen	a5ce65d87a	Clean up some mobile package related files and their usages. (#21606 ) The mobile packages have been removed.	2024-08-05 16:38:20 -07:00
Po-Wei (Vincent)	2653226ed0	Fail tests gracefully for the minimal cuda build (#21391 ) ### Description Several tests result in segfaults during the minimal cuda build. Although test failures are expected due to the limitation of the minimal cuda EP, failing gracefully would be much preferred. ### Motivation and Context To reproduce: 1. Build ORT with: ```bash ./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1 ``` 2. Run `onnxruntime_test_all` ```bash ... [----------] 1 test from AllocationPlannerTest [ RUN ] AllocationPlannerTest.ReusedInputCrossDifferentStreams Segmentation fault (core dumped) ```	2024-08-02 18:27:36 -07:00
Julius Tischbein	1391354265	Adding CUDNN Frontend and use for CUDA NN Convolution (#19470 ) ### Description Added CUDNN Frontend and used it for NHWC convolutions, and optionally fuse activation. #### Backward compatible - For model existed with FusedConv, model can still run. - If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. #### Major Changes - For cuDNN 9, we will enable cudnn frontend to fuse convolution and bias when a provider option `fuse_conv_bias=1`. - Remove the fusion of FusedConv from graph transformer for CUDA provider, so there will not be FusedConv be added to graph for CUDA EP in the future. - Update cmake files regarding to cudnn settings. The search order of CUDNN installation in build are like the following: * environment variable `CUDNN_PATH` * `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from build.py/build.sh, user can pass it through `--cudnn_home` parameter, or by environment variable `CUDNN_HOME` if `--cudnn_home` not used. * cudnn python package installation directory like python3.xx/site-packages/nvidia/cudnn * CUDA installation path #### Potential Issues - If ORT is built with cuDNN 8, FusedConv fusion is no longer done automatically, so some model might have performance regression. If user still wants FusedConv operator for performance reason, they can still have multiple ways to walkaround: like use older version of onnxruntime; or use older version of ORT to save optimized onnx, then run with latest version of ORT. We believe that majority users have moved to cudnn 9 when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3 months when 1.20 release), so the impact is small. - cuDNN graph uses TF32 by default, and user cannot disable TF32 through the use_tf32 cuda provider option. If user encounters accuracy issue (like in testing), user has to set environment variable `NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of use_tf32 later. #### Follow ups This is one of PRs that target to enable NHWC convolution in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable `prefer_nhwc` by default for device with sm >= 70. (2) Change `fuse_conv_bias=1` by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution with the pointwise bias operation. On the [NVIDIA ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/) we get a performance boost from 49.1144 ms to 42.4643 ms per inference on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i 'prefer_nhwc\|1' resnet50.onnx`). --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>	2024-08-02 15:16:42 -07:00
liqun Fu	b87e8edb98	Mlas int4 int8 with avx2/512 (#20687 ) ### Description model: phi-3-mini-4k-instruct avx2 symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|49.5\|70.0\|-29.2%\|9.6\|10.8\|-34.2% 32 \|76.8\|52.4\|9.7%\|15.2\|14.6\|4.1% 64 \|78.2\|71.4\|9.5%\|16.6\|16.3\|1.8% 128 \|72.9\|70.6\|3.2%\|17.1\|16.8\|1.7% 256 \|83.7\|63.6\|31.6%\|18.1\|17.4\|4% avx2 asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|50.7\|61.5\|-17.5%\|9.6\|9.2\|4.3% 32 \|77.4\|52.4\|47.7%\|14.6\|13.9\|5.0% 64 \|78.7\|63.0\|24.9%\|16.2\|15.9\|1.8% 128 \|80.0\|61.9\|29.2%\|17.2\|16.9\|1.7% 256 \|81.5\|63.3\|28.7%\|17.9\|17.3\|3.4% avx2vnni symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|82.9\|117.0\|-29.0%\|15.9\|19.3\|-17.6% 32 \|133.0\|100.4\|32.4%\|26.1\|24.5\|6.5% 64 \|166.9\|118.8\|40.4%\|28.3\|27.1\|4.4% 128 \|165.9\|119.6\|38.7%\|29.3\|28.5\|2.8% 256 \|165.2\|119.6\|38.1%\|30.2\|29.0\|4.1% avx2vnni asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|80.2\|118.9\|-32.5%\|15.1\|16.7\|-9.5% 32 \|130.7\|99.7\|31.0%\|25.0\|23.8\|5.0% 64 \|168.7\|124.9\|35.0%\|27.3\|26.8\|1.8% 128 \|169.6\|123.8\|36.9%\|29.2\|27.9\|4.6% 256 \|175.0\|125.7\|39.0%\|30.0\|29.7\|1.0% avx512 symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|135.2\|156.5\|-13.6\|25.5\|23.8\|7.1 32 \|150.0\|159.5\|-5.9\|34.9\|29.6\|17.9 64 \|167.5\|157.5\|6.3\|39.7\|34.4\|15.4 128 \|177.8\|158.0\|12.5\|40.3\|35.4\|13.8 256 \|182.6\|157.3\|16.0\|41.7\|37.7\|10.6 avx512 asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|136.1\|151.4\|-10.1%\|26.1\|19.9\|31.1% 32 \|150.0\|157.8\|-4.9%\|34.3\|29.3\|17.0% 64 \|165.7\|156.6\|5.8%\|38.7\|30.7\|26.0% 128 \|180.4\|156.6\|15.1%\|40.2\|34.7\|15.8% 256 \|181.3\|158.0\|14.7%\|41.6\|36.6\|13.6% avx512vnni symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|143.4\|155.4\|-7.7%\|25.6\|23.3\|9.8% 32 \|159.2\|157.0\|1.4%\|34.1\|29.8\|14.4% 64 \|182.0\|159.5\|14.1%\|38.4\|34.8\|10.3% 128 \|221.2\|160.8\|37.5%\|41.0\|36.4\|12.6% 256 \|250.5\|162.4\|54.2%\|41.6\|37.7\|10.3% avx512vnni asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|142.5\|152.3\|-6.4%\|26.3\|19.7\|33.5% 32 \|158.2\|155.0\|2.0%\|34.3\|29.2\|17.4% 64 \|184.1\|156.6\|17.5%\|38.3\|30.9\|23.9% 128 \|215.8\|156.1\|17.5%\|41.3\|35.0\|17.9% 256 \|249.2\|155.9\|59.8%\|41.1\|36.3\|13.2% 4bit gemm implementation with avx using tile. 1. tile size is 2blk by 4. in case of size less then tile, it reduce to 1blk by 4, 2blk by 1 and lastly 1blk by 1. with internal kernel, weight and activation are loaded based on SIMD register width and blk length: avx2 256bit register, 64 weights and activation are loaded. blklen16: 4 blks are computed by the internal kernel blklen32: 2 blks are computed by the internal kernel blklen64: 1 blk are computed by the internal kernel blklen128: 1 blks are computed 2 times by the internal kernel blklen16: 1 blks are computed 4 times by the internal kernel avx512 512bit register, 128 weights and activation are loaded. blklen16: 8 blks are computed by the internal kernel blklen32: 4 blks are computed by the internal kernel blklen64: 2 blk are computed by the internal kernel blklen128: 1 blks are computed by the internal kernel blklen16: 1 blks are computed 2 times by the internal kernel 2. blksum is precomputed during prepacking. computation is reformed: Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b) Sum_blk is over one blk Sum1 is over all blks for one output Sum2 is over all blks for one output Sum is computed with sgemm with the current implementation. Further improvement is possible. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>	2024-08-02 10:20:22 -07:00
Changming Sun	25722bb9e3	Add CUDA custom op header files to Linux tarball (#21551 ) ### Description The header files were added in PR #16454. Then, recently I made a PR #21464 that changed how we packed Linux tarballs. The new tarball misses the custom op header files. Therefore I need to make this change. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-01 04:23:02 -07:00
Yifan Li	5d78b9a17b	[TensorRT EP] Update TRT OSS Parser to 10.2 (#21552 ) ### Description <!-- Describe your changes. --> Update TRT OSS Parser to [latest 10.2-GA branch](`f161f95883`) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 17:27:38 -07:00
Yulong Wang	b03c9496aa	[js/web] allow load WebAssembly binary from buffer (#21534 ) ### Description This PR adds a new option `ort.env.wasm.wasmBinary`, which allows user to set to a buffer containing preload .wasm file content. This PR should resolve the problem from latest discussion in #20876.	2024-07-29 13:39:38 -07:00
liqun Fu	a4d3a1ce0c	pick changes from https://github.com/onnx/onnx/pull/6195 to fix heap-buffer-overflow in onnx::convPoolShapeInference (#21507 ) ### Description onnx 1.16.2 is not available before ort 1.19.0 code freeze. Thus pick the needed change as patch	2024-07-27 15:58:36 -07:00

1 2 3 4 5 ...

1750 commits