onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-29 23:06:41 +00:00

Author	SHA1	Message	Date
0xdr3dd	2dae8aaced	[Fuzzer] Add fuzzer support for linux (#21996 ) ### Description Added some change in fuzzer project code to support linux also. How to test on linux: 1. Make sure you have installed clang/llvm. 2. run below command to build asan instrumented project: ``` CFLAGS="-g -fsanitize=address -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --skip_tests --use_full_protobuf --parallel --fuzz_testing --build_dir build/ ``` 3. run fuzzer for some time, it will generate .profraw file: ``` LLVM_PROFILE_FILE="%p.profraw" ./build/Debug/onnxruntime_security_fuzz /t /v onnxruntime/test/testdata/bart_tiny.onnx 1 m ``` 4. Get the cov by running below cmd: ``` llvm-profdata merge -sparse .profraw -o default.profdata llvm-cov report ./build/Debug/onnxruntime_security_fuzz -instr-profile=default.profdata ``` <img width="1566" alt="Screenshot 2024-09-05 at 4 25 08 PM" src="https://github.com/user-attachments/assets/2aa0bb83-6634-4d33-b026-3535e97df431"> ### Motivation and Context 1. Currently fuzzer only supports windows and MSVC, we can't generate the code coverage using MSVC. With clang/llvm we can try and use clang instrumentation and llvm tools like llvm-cov. 2. In future we can add coverage guided fuzzer (libfuzzer) in same project. (Working on it)	2024-09-05 11:52:15 -07:00
Hector Li	190588bb64	Enable QNN weight sharing (#21077 ) ### Description Enable QNN weight sharing across graphs in single context Create tool to generate QNN context cache model with weight sharing enabled.	2024-09-04 11:20:33 -07:00
sfatimar	8dba8e3e24	Memory Optimization for Compilation in OVEP (#21872 ) Calling Split API Calls Read+Model in lieu of unified Compile Model call for export compile flow to ensure memory optimization. Freeing up model proto and serialized string and read model ov ir later to free up memory for the ahead pipeline Optimization during EpCtxt flow All the Graph related operations require all the Node Attributes to be set while dealing with model instances internally with them, in the existing implementation these attributes make a copy when constructing a Graph dynamically during runtime. Propose to use these attributes in place without creating a copy to avoid memory allocation / copy while calling these Graph related functions. Changes to ensure the bug fixes related to openvino version and epctxt file path. Moving Compiler version to C++20 for getting r-value mem optimizations benefit ### Motivation and Context This change is required because memory optimization during Compilation flow is too high. --------- Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>	2024-09-03 13:52:31 -07:00
Yulong Wang	bad00a3657	Add dependency dawn into deps.txt (#21910 ) ### Description Add dependency dawn into deps.txt. This is a preparation for introducing WebGPU EP.	2024-09-02 04:24:28 -07:00
aciddelgado	509cb54d6f	softcap gqa (#21683 ) ### Description Implement softcap for gqa. ### Motivation and Context Fixes certain models like Gemma-2 which need softcap to work so they don't output nan's.	2024-08-30 19:11:04 -07:00
Ranjit Ranjan	02e3a430af	[AIX] Python binding enablement and gcc support (#21934 ) ### Description Enabling python binding and gcc support for AIX. ### Motivation and Context Code changes in this PR contains: 1. python binding enablement 2. gcc building support Below are list of files and the description. 1. cmake/CMakeLists.txt [gcc building support] -no-unused-function compiler flag addition for IBMClang 2. cmake/external/eigen.cmake [gcc building support] AIX check for applying the AIX patch 3. cmake/onnxruntime_python.cmake [python binding ] putting NOT AIX check for -Xlinker 4. cmake/onnxruntime_unittests.cmake [gcc building support] Fix for gtest behavior. Check the comment . [python binding ] using -Wl,-brtl for linking onnxruntime_providers_shared in test_execution_provider 5. cmake/patches/eigen/eigen-aix.patch [gcc building support] In AIX gcc, we are hitting __builtin_cpu_supports("mma") which is not supported yet. So patching code for this method . Patched code will check for P10 Processor at run-time and based on that routine will be set. 6. onnxruntime/python/onnxruntime_validation.py [python binding ] Adding AIX check in check_distro_info() 7. onnxruntime/test/providers/cpu/generator/random_test.cc [gcc building support] updating previous check for AIX , along with clang. So in case of gcc, else block will hit. 8. onnxruntime/test/python/onnxruntime_test_python.py [python binding ] powerpc check on platform.processor() 9. setup.py [python binding ] Adding AIX check for list of libs.	2024-08-30 12:17:26 -07:00
Changming Sun	1f879c3282	Disable absl symbolize in Windows Release build (#21923 ) ### Description This change disables Abseil's symbolize functionality in Windows non-debug builds. ### Motivation and Context To solve #21826. Avoid having a dependency on dbghelp.dll.	2024-08-30 12:03:17 -07:00
mindest	bfa4da4f65	Add Linux ROCm CI Pipeline (#21798 ) ### Description * Add new ROCm CI pipeline (`Linux ROCm CI Pipeline`) focusing on inference. * Resolve test errors; disable flaky tests. based on test PR #21614.	2024-08-30 14:50:32 +08:00
Ye Wang	bf8855ba3c	Support Smooth Softmax in fmha (#21885 ) ### Description <!-- Describe your changes. --> refer to https://github.com/microsoft/onnxruntime/pull/21867 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Your Name <you@example.com>	2024-08-28 09:29:33 -07:00
mcollinswisc	5d54dc1462	Drop QDQ around more nodes (#21376 ) ### Description Extends the Drop QDQ optimization to remove DequantizeLinear and QuantizeLinear nodes from around operators: - Flatten - Expand - Tile - Slice - GatherElements - ReduceMin - ReduceMax ### Motivation and Context To reduce floating-point conversions in quantize inference. Mainly motivated by the Flatten case, since that will show up in graphs exported from PyTorch to ONNX. But to make the change complete, extending to a larger set of ops for which this optimization is valid. https://github.com/microsoft/onnxruntime/issues/21375 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-08-27 16:54:37 +10:00
Guenther Schmuelling	ba7baae994	Revert "Upgrade emsdk from 3.1.59 to 3.1.62" (#21817 ) Reverts microsoft/onnxruntime#21421 Users are seeing chrome memory grow to 16GB before it crashes: https://github.com/microsoft/onnxruntime/issues/21810 Revert for now so we have time to debug.	2024-08-22 11:21:00 -07:00
Yueqing Zhang	3ff8ca29e5	[VitisAI] remove wrong error msg, required by Microsoft (#21715 ) ### Description <!-- Describe your changes. --> Remove legacy code and wrong message. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is required by Microsoft to remove unwanted error message. This is required for 8.15 release. Co-authored-by: Yueqing Zhang <yueqingz@amd.com>	2024-08-21 21:10:28 -07:00
Adrian Lizarraga	28c252c77e	[QNN EP] Fix compile error for QNN EP on Windows x64 due to missing /bigobj flag (#21795 ) ### Description Compiling onnxruntime with QNN EP on Windows x86_64 results in a compilation error: ```shell $ onnxruntime\test\optimizer\qdq_transformer_test.cc(1,1): error C1128: num ber of sections exceeded object file format limit: compile with /bigobj [...onnxruntime\build\Debug\onnxruntime_test_all.vcxproj] ``` This PR adds the `/bigobj` compilation flag for the `qdq_transformer_test.cc` file.	2024-08-20 10:11:43 -07:00
Tianlei Wu	fbc3927231	[CUDA] cuDNN Flash Attention (#21629 ) ### Description - [x] Add cuDNN flash attention using cudnn frontend, and enable it in MultiHeadAttention operator. - [x] Support attention mask. - [x] Support attention bias. - [x] Update tests and benchmark script. The cuDNN SDPA is disabled by default. To enable it, need the following: (1) Requires cuDNN 9.3 or newer version installed. (2) Set an environment variable `ORT_ENABLE_CUDNN_FLASH_ATTENTION=1` or set `sdpa_kernel=8` cuda provider option to enable it. (3) Only works on devices with compute capability >= 8.0. Note that some combinations of parameters might be rejected due to limited support of head dimension or sequence lengths. Future Works: (1) FP8 and BF16 APIs. Currently, only API for FP16 are exposed. (2) Add API to support ragged batching (padding removed in inputs). (3) Support other input formats (like QKV_BS3NH). (4) Currently, q are converted to BSNH, k/v are converted to either BSNH or BNSH format. May do some experiment to see whether converting q to BNSH could be better in some case. ### Example Benchmark Results on H100 The following tests are on FP16 MultiHeadAttention operator without attention mask and attention bias. #### Test Setting 1 batch_size \| sequence_length \| past_sequence_length \| num_heads \| head_size -- \| -- \| -- \| -- \| -- 16 \| 256 \| 0 \| 32 \| 128 format \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- Q,K,V (BNSH) \| 0.000075 \| 229.5 \| torch:flash Q,K,V (BNSH) \| 0.000119 \| 144.8 \| torch:efficient Q,K,V (BNSH) \| 0.000224 \| 76.5 \| torch:math Q,K,V (BSNH) \| 0.000075 \| 227.8 \| ort:cudnn Q,K,V (BSNH) \| 0.000094 \| 182.8 \| ort:flash Q,K,V (BSNH) \| 0.000138 \| 124.7 \| ort:efficient Q,K,V (BSNH) \| 0.000438 \| 39.3 \| ort:math Q,KV \| 0.000129 \| 133.0 \| ort:cudnn Q,KV \| 0.000151 \| 114.1 \| ort:flash Q,KV \| 0.000194 \| 88.5 \| ort:efficient QKV \| 0.000154 \| 111.8 \| ort:cudnn QKV \| 0.000175 \| 98.0 \| ort:flash QKV \| 0.000217 \| 79.0 \| ort:efficient #### Test Setting 2 batch_size \| sequence_length \| past_sequence_length \| num_heads \| head_size -- \| -- \| -- \| -- \| -- 16 \| 512 \| 0 \| 16 \| 64 format \| average_latency \| tflops \| kernel -- \| -- \| -- \| -- Q,K,V (BNSH) \| 0.000069 \| 249.2 \| torch:flash Q,K,V (BNSH) \| 0.000141 \| 121.7 \| torch:efficient Q,K,V (BNSH) \| 0.000294 \| 58.5 \| torch:math Q,K,V (BSNH) \| 0.000077 \| 221.7 \| ort:cudnn Q,K,V (BSNH) \| 0.000087 \| 196.6 \| ort:flash Q,K,V (BSNH) \| 0.000163 \| 105.6 \| ort:efficient Q,K,V (BSNH) \| 0.000651 \| 26.4 \| ort:math Q,KV \| 0.000103 \| 167.1 \| ort:cudnn Q,KV \| 0.000117 \| 146.3 \| ort:flash Q,KV \| 0.000192 \| 89.6 \| ort:efficient QKV \| 0.000113 \| 151.5 \| ort:cudnn QKV \| 0.000128 \| 134.7 \| ort:flash QKV \| 0.000201 \| 85.3 \| ort:efficient	2024-08-20 08:50:22 -07:00
jingyanwangms	c018ba43ef	[Running CI] [TensorRT EP] support TensorRT 10.3-GA (#21742 ) ### Description - TensorRT 10.2.0.19 -> 10.3.0.26 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-18 13:26:41 -07:00
Scott McKay	c97cc5c1b0	Put all external project targets under the 'External' folder in VS (#21765 ) ### Description <!-- Describe your changes. --> Handle targets in subdirectories for external projects. All targets will now go in a per-project folder under 'External' e.g. gmock and gtest now get handled correctly and are under External/googletest vs. existing setup where they ended up as top-level projects. ![image](https://github.com/user-attachments/assets/99ec259c-47cd-44f3-954d-58569c941cc2) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve developer experience.	2024-08-16 15:51:50 +10:00
Satya Kumar Jandhyala	6d8de1f7b8	Upgrade emsdk from 3.1.59 to 3.1.62 (#21421 ) ### Description Upgrade EM SDK to 3.1.62. ### Motivation and Context The changes are required to clear wasm64 errors.	2024-08-14 12:38:52 -07:00
Sumit Agarwal	c5592fdcef	[DML EP] Update DML to 1.15.1 (#21695 ) ### Description Update DML runtime binary to 1.15.1 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-12 14:16:43 -07:00
Edward Chen	a5ce65d87a	Clean up some mobile package related files and their usages. (#21606 ) The mobile packages have been removed.	2024-08-05 16:38:20 -07:00
Po-Wei (Vincent)	2653226ed0	Fail tests gracefully for the minimal cuda build (#21391 ) ### Description Several tests result in segfaults during the minimal cuda build. Although test failures are expected due to the limitation of the minimal cuda EP, failing gracefully would be much preferred. ### Motivation and Context To reproduce: 1. Build ORT with: ```bash ./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1 ``` 2. Run `onnxruntime_test_all` ```bash ... [----------] 1 test from AllocationPlannerTest [ RUN ] AllocationPlannerTest.ReusedInputCrossDifferentStreams Segmentation fault (core dumped) ```	2024-08-02 18:27:36 -07:00
Julius Tischbein	1391354265	Adding CUDNN Frontend and use for CUDA NN Convolution (#19470 ) ### Description Added CUDNN Frontend and used it for NHWC convolutions, and optionally fuse activation. #### Backward compatible - For model existed with FusedConv, model can still run. - If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. #### Major Changes - For cuDNN 9, we will enable cudnn frontend to fuse convolution and bias when a provider option `fuse_conv_bias=1`. - Remove the fusion of FusedConv from graph transformer for CUDA provider, so there will not be FusedConv be added to graph for CUDA EP in the future. - Update cmake files regarding to cudnn settings. The search order of CUDNN installation in build are like the following: * environment variable `CUDNN_PATH` * `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from build.py/build.sh, user can pass it through `--cudnn_home` parameter, or by environment variable `CUDNN_HOME` if `--cudnn_home` not used. * cudnn python package installation directory like python3.xx/site-packages/nvidia/cudnn * CUDA installation path #### Potential Issues - If ORT is built with cuDNN 8, FusedConv fusion is no longer done automatically, so some model might have performance regression. If user still wants FusedConv operator for performance reason, they can still have multiple ways to walkaround: like use older version of onnxruntime; or use older version of ORT to save optimized onnx, then run with latest version of ORT. We believe that majority users have moved to cudnn 9 when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3 months when 1.20 release), so the impact is small. - cuDNN graph uses TF32 by default, and user cannot disable TF32 through the use_tf32 cuda provider option. If user encounters accuracy issue (like in testing), user has to set environment variable `NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of use_tf32 later. #### Follow ups This is one of PRs that target to enable NHWC convolution in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable `prefer_nhwc` by default for device with sm >= 70. (2) Change `fuse_conv_bias=1` by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution with the pointwise bias operation. On the [NVIDIA ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/) we get a performance boost from 49.1144 ms to 42.4643 ms per inference on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i 'prefer_nhwc\|1' resnet50.onnx`). --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>	2024-08-02 15:16:42 -07:00
liqun Fu	b87e8edb98	Mlas int4 int8 with avx2/512 (#20687 ) ### Description model: phi-3-mini-4k-instruct avx2 symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|49.5\|70.0\|-29.2%\|9.6\|10.8\|-34.2% 32 \|76.8\|52.4\|9.7%\|15.2\|14.6\|4.1% 64 \|78.2\|71.4\|9.5%\|16.6\|16.3\|1.8% 128 \|72.9\|70.6\|3.2%\|17.1\|16.8\|1.7% 256 \|83.7\|63.6\|31.6%\|18.1\|17.4\|4% avx2 asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|50.7\|61.5\|-17.5%\|9.6\|9.2\|4.3% 32 \|77.4\|52.4\|47.7%\|14.6\|13.9\|5.0% 64 \|78.7\|63.0\|24.9%\|16.2\|15.9\|1.8% 128 \|80.0\|61.9\|29.2%\|17.2\|16.9\|1.7% 256 \|81.5\|63.3\|28.7%\|17.9\|17.3\|3.4% avx2vnni symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|82.9\|117.0\|-29.0%\|15.9\|19.3\|-17.6% 32 \|133.0\|100.4\|32.4%\|26.1\|24.5\|6.5% 64 \|166.9\|118.8\|40.4%\|28.3\|27.1\|4.4% 128 \|165.9\|119.6\|38.7%\|29.3\|28.5\|2.8% 256 \|165.2\|119.6\|38.1%\|30.2\|29.0\|4.1% avx2vnni asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|80.2\|118.9\|-32.5%\|15.1\|16.7\|-9.5% 32 \|130.7\|99.7\|31.0%\|25.0\|23.8\|5.0% 64 \|168.7\|124.9\|35.0%\|27.3\|26.8\|1.8% 128 \|169.6\|123.8\|36.9%\|29.2\|27.9\|4.6% 256 \|175.0\|125.7\|39.0%\|30.0\|29.7\|1.0% avx512 symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|135.2\|156.5\|-13.6\|25.5\|23.8\|7.1 32 \|150.0\|159.5\|-5.9\|34.9\|29.6\|17.9 64 \|167.5\|157.5\|6.3\|39.7\|34.4\|15.4 128 \|177.8\|158.0\|12.5\|40.3\|35.4\|13.8 256 \|182.6\|157.3\|16.0\|41.7\|37.7\|10.6 avx512 asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|136.1\|151.4\|-10.1%\|26.1\|19.9\|31.1% 32 \|150.0\|157.8\|-4.9%\|34.3\|29.3\|17.0% 64 \|165.7\|156.6\|5.8%\|38.7\|30.7\|26.0% 128 \|180.4\|156.6\|15.1%\|40.2\|34.7\|15.8% 256 \|181.3\|158.0\|14.7%\|41.6\|36.6\|13.6% avx512vnni symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|143.4\|155.4\|-7.7%\|25.6\|23.3\|9.8% 32 \|159.2\|157.0\|1.4%\|34.1\|29.8\|14.4% 64 \|182.0\|159.5\|14.1%\|38.4\|34.8\|10.3% 128 \|221.2\|160.8\|37.5%\|41.0\|36.4\|12.6% 256 \|250.5\|162.4\|54.2%\|41.6\|37.7\|10.3% avx512vnni asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|142.5\|152.3\|-6.4%\|26.3\|19.7\|33.5% 32 \|158.2\|155.0\|2.0%\|34.3\|29.2\|17.4% 64 \|184.1\|156.6\|17.5%\|38.3\|30.9\|23.9% 128 \|215.8\|156.1\|17.5%\|41.3\|35.0\|17.9% 256 \|249.2\|155.9\|59.8%\|41.1\|36.3\|13.2% 4bit gemm implementation with avx using tile. 1. tile size is 2blk by 4. in case of size less then tile, it reduce to 1blk by 4, 2blk by 1 and lastly 1blk by 1. with internal kernel, weight and activation are loaded based on SIMD register width and blk length: avx2 256bit register, 64 weights and activation are loaded. blklen16: 4 blks are computed by the internal kernel blklen32: 2 blks are computed by the internal kernel blklen64: 1 blk are computed by the internal kernel blklen128: 1 blks are computed 2 times by the internal kernel blklen16: 1 blks are computed 4 times by the internal kernel avx512 512bit register, 128 weights and activation are loaded. blklen16: 8 blks are computed by the internal kernel blklen32: 4 blks are computed by the internal kernel blklen64: 2 blk are computed by the internal kernel blklen128: 1 blks are computed by the internal kernel blklen16: 1 blks are computed 2 times by the internal kernel 2. blksum is precomputed during prepacking. computation is reformed: Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b) Sum_blk is over one blk Sum1 is over all blks for one output Sum2 is over all blks for one output Sum is computed with sgemm with the current implementation. Further improvement is possible. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>	2024-08-02 10:20:22 -07:00
Changming Sun	25722bb9e3	Add CUDA custom op header files to Linux tarball (#21551 ) ### Description The header files were added in PR #16454. Then, recently I made a PR #21464 that changed how we packed Linux tarballs. The new tarball misses the custom op header files. Therefore I need to make this change. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-01 04:23:02 -07:00
Yifan Li	5d78b9a17b	[TensorRT EP] Update TRT OSS Parser to 10.2 (#21552 ) ### Description <!-- Describe your changes. --> Update TRT OSS Parser to [latest 10.2-GA branch](`f161f95883`) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 17:27:38 -07:00
Yulong Wang	b03c9496aa	[js/web] allow load WebAssembly binary from buffer (#21534 ) ### Description This PR adds a new option `ort.env.wasm.wasmBinary`, which allows user to set to a buffer containing preload .wasm file content. This PR should resolve the problem from latest discussion in #20876.	2024-07-29 13:39:38 -07:00
liqun Fu	a4d3a1ce0c	pick changes from https://github.com/onnx/onnx/pull/6195 to fix heap-buffer-overflow in onnx::convPoolShapeInference (#21507 ) ### Description onnx 1.16.2 is not available before ort 1.19.0 code freeze. Thus pick the needed change as patch	2024-07-27 15:58:36 -07:00
Ranjit Ranjan	82b2955268	[AIX]test failure fix using gtest-1.15.0 for AIX (#21497 ) ### Description Local CI setup for AIX reported tests failure after the gtest 1.15.0 upgrade. ### Motivation and Context Below tests failure is observed after gtest upgrade. The following tests FAILED: 1 - onnxruntime_test_all (ILLEGAL) 7 - onnxruntime_logging_apis_test (Subprocess aborted) To fix this, I am enabling pthread support under gtest. This was disabled with previous version of gtest for some reason. Now by enabling this, above tests are getting passed with gtest 1.15.0.	2024-07-27 11:17:22 -07:00
Preetha Veeramalai	ca47f0fdd3	OVEP - PR 1.19 (#21443 ) ### Description Add OVEP features for 1.19 The PR has, - Added support for EpCtx with ORT Session options for optimized performance. - Added bug fixes - Support for OV 2024.3 --------- Co-authored-by: ubuntu <ubuntu@ubuntu-mtlp-118727.iind.intel.com> Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Maheshkar <ankit.maheshkar@intel.com>	2024-07-24 23:45:31 -07:00
Changming Sun	b04adcc381	Update copy_strip_binary.sh: use "make install" instead (#21464 ) ### Description Before this change, copy_strip_binary.sh manually copies each file from onnx runtime's build folder to an artifact folder. It can be hard when dealing with symbolic link for shared libraries. This PR will change the packaging pipelines to run "make install" first, before packaging shared libs . ### Motivation and Context Recently because of feature request #21281 , we changed libonnxruntime.so's SONAME. Now every package that contains this shared library must also contains libonnxruntime.so.1. Therefore we need to change the packaging scripts to include this file. Instead of manually construct the symlink layout, using `make install` is much easier and will make things more consistent because it is a standard way of making packages. Breaking change: After this change, our inference tarballs that are published to our Github release pages will be not contain ORT training headers.	2024-07-24 10:02:00 -07:00
Scott McKay	2580d935cb	CoreML: Add ML Program ConvTranspose (#21416 ) ### Description <!-- Describe your changes. --> Add ML Program ConvTranspose - some limitations to simplify the implementation for now - some limitations due to flaky CoreML output Added support for non-contiguous MLMultiArray output as we see that with some unit tests when the CPU-only flag is not set (e.g. innermost dim has min size of 16 but test output only has 8 values). - support only one non-contiguous dim to keep it simple - manually tested as we don't have a setup that can test objective-c code - test code is in model.mm and can be enabled via ifdef if we need to validate any future changes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address operator gaps in high priority model. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-24 16:08:20 +10:00
Tianlei Wu	2b7e2a5bd0	[CUDA] Fix cuda provider fallback inconsistency (#21425 ) * Fix fallback setting (cuda still falls back to cuda). * Fix cuda provider fallback inconsistent with/without CUDA_PATH environment variable. * Add cuda and cudnn major version requirement in error message. Example result in Windows: ``` >>> import onnxruntime >>> ort_session = onnxruntime.InferenceSession("model.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) 2024-07-19 17:43:44.2260019 [E:onnxruntime:Default, provider_bridge_ort.cc:1972 onnxruntime::TryGetProviderInfo_CUDA] D:\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc:1636 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\.conda\envs\py310\lib\site-packages\onnxruntime\capi\onnxruntime_providers_cuda.dll" 2024-07-19 17:43:44.2312351 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:970 onnxruntime::python::CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12., and the latest MSVC runtime. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported. >>> ort_session <onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x0000016BB2DF7D60> >>> ort_session.get_providers() ['CPUExecutionProvider'] ``` Example result in Linux: ``` >>> import onnxruntime >>> ort_session = onnxruntime.InferenceSession("resnet50-v2-7.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) 2024-07-20 20:33:26.486974543 [E:onnxruntime:Default, provider_bridge_ort.cc:1972 TryGetProviderInfo_CUDA] /work/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.12: cannot open shared object file: No such file or directory 2024-07-20 20:33:26.487034646 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:961 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9. and CUDA 12.*. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported. >>> ort_session.get_providers() ['CPUExecutionProvider'] ``` ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/21424	2024-07-23 11:58:04 -07:00
Changming Sun	f70215d4e6	Update C++ dependencies (#21410 ) 1. Update google benchmark from 1.8.3 to 1.8.5 2. Update google test from commit in main branch to tag 1.15.0 3. Update pybind11 from 2.12.0 to 2.13.1 4. Update pytorch cpuinfo to include the support for Arm Neoverse V2, Cortex X4, A720 and A520. 5. Update re2 from 2024-05-01 to 2024-07-02 6. Update cmake to 3.30.1 7. Update Linux docker images 8. Fix a warning in test/perftest/ort_test_session.cc:826:37: error: implicit conversion loses integer precision: 'streamoff' (aka 'long long') to 'const std::streamsize' (aka 'const long') [-Werror,-Wshorten-64-to-32]	2024-07-23 10:00:36 -07:00
Sheil Kumar	dd010edb37	Update DirectML from 1.14.1 to 1.15.0 (#21323 ) Update DirectML from 1.14.1 to 1.15.0 --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2024-07-22 16:59:03 -07:00
mindest	5b9369e93c	Fix typos according to reviewdog report. (#21335 ) ### Description Fix typos based on reviewdog report but with some exceptions/corrections.	2024-07-22 13:37:32 -07:00
Tianlei Wu	6ffaaebb60	[CUDA] Attention kernel provider option (#21344 ) ### Description * Add a cuda provider option `sdpa_kernel` to choose which attention kernel to run for testing purpose. * Allow dump which attention kernel is used per node. * Reserve a flag for cudnn flash attention which will be added soon. #### CUDA provider option sdpa_kernel Instead of setting environment variable, we also support setting it in provider option. Note that the setting is global per session. That could help performance testing of each kernel. #### Attention Kernel Debug Info Set an environment variable `ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1`, and ORT will print sdpa kernel used in each node: For example ``` ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1 ./onnxruntime_test_all --gtest_filter=MultiHeadAttentionTest* ``` It will show debug information of kernel used in testing: ``` [ RUN ] MultiHeadAttentionTest.SelfAttention_Batch2_HeadSize32_NoBias_NoMask_PackedQKV AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=0 TRT_FUSED_ATTENTION=1 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=1 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1 Operator=MultiHeadAttention Node=node1 DataType=fp16 TRT_FUSED_ATTENTION=1 AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=1 TRT_FUSED_ATTENTION=0 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=0 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1 Operator=MultiHeadAttention Node=node1 DataType=fp16 EFFICIENT_ATTENTION=1 ``` In this test case, the debug info shows that one session uses trt fused attention and another session use efficient attention.	2024-07-19 13:58:54 -07:00
Ranjit Ranjan	6c7562b097	Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. (#21133 ) ### Description Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. ### Motivation and Context changes in this PR contains: 1. Enablement code for building onnxruntime on AIX operating system. 2. while testing the build on AIX, we found issues related to big endian platform . More details about few of those issues can be found in [Big endian issue: Graph Transformation Attention Fusion tests are failing #12921](https://github.com/microsoft/onnxruntime/issues/12921) Below are list of files and the description about the change. 1. cmake/CMakeLists.txt [BUILDING on AIX issue] check for "IBMClang" is added for handling -Wno-unused-parameter 2. cmake/external/onnxruntime_external_deps.cmake [BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX 3. cmake/onnxruntime.cmake [BUILDING on AIX issue] o Blocking codes for AIX which generates generated_source.c and further requires some symbol files. o Putting NO AIX check for non-supported linker flags like --Xlinker o iconv linking 4. cmake/onnxruntime_framework.cmake [BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN' 5. cmake/onnxruntime_mlas.cmake [BUILDING on AIX issue]POWER10 releated macro/function definition . 6. cmake/onnxruntime_providers_cpu.cmake [BUILDING on AIX issue]Putting NO AIX check for non-supported linker flags like --Xlinker 7. cmake/onnxruntime_unittests.cmake [BUILDING on AIX issue] o Putting NO AIX check for non-supported linker flags like --Xlinker o Adding required libraries for AIX linker under applicatiion like onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc 8. cmake/patches/flatbuffers/flatbuffers.patch [BUILDING on AIX issue] Handling of TypeCode in include/flatbuffers/flatbuffers.h under AIX + clang 9. onnxruntime/contrib_ops/cpu/murmur_hash3.cc [Big endian issue] Byte-Conversion handlling in compute() and getblock() routines 10. onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc [Big endian issue] Handling of test failures . Byte swapping for quant_value. 11. onnxruntime/core/framework/tensorprotoutils.cc [Big endian issue] Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto . o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling ConvertRawDataInTensorProto() in big-endian system o ConvertRawDataInTensorProto : function used mainly on big-endian system for byte-swapping of tensor raw_data 12. onnxruntime/core/framework/tensorprotoutils.h [Big endian issue] Declaration of SetRawDataInTensorProto, ConvertRawDataInTensorProto 13. onnxruntime/core/graph/graph.cc [Big endian issue] o Call ConvertRawDataInTensorProto for SPARSE_TENSOR type o Call ConvertRawDataInTensorProto for SaveToOrtFormat 14. onnxruntime/core/mlas/lib/platform.cpp [BUILDING on AIX issue] POWER10 released enablement for AIX 15. onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp [BUILDING on AIX issue]Handling of __vector under AIX+clang 16. onnxruntime/core/mlas/lib/qgemm.h [BUILDING on AIX issue] Adding _AIX flag 17. onnxruntime/core/mlas/lib/qlmul.cpp [BUILDING on AIX issue] Handling of __vector under AIX+clang 18. onnxruntime/core/optimizer/attention_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 19. onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 20. onnxruntime/core/optimizer/constant_folding.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 21. onnxruntime/core/optimizer/embed_layer_norm_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 22. onnxruntime/core/optimizer/nchwc_transformer.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 23. onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 24. onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 25. onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 26. onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 27. onnxruntime/core/optimizer/reshape_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 28. onnxruntime/core/optimizer/stft_decomposition.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 29. onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 30. onnxruntime/core/platform/path_lib.h [BUILDING on AIX issue] Moving to normal function call, instead of template 31. onnxruntime/core/platform/posix/env.cc [BUILDING on AIX issue]Blocking syscall.h in AIX 32. onnxruntime/core/session/inference_session.cc [Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN 33. onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc [Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer and ExternalWriteReadWithLoadInitializers 34. onnxruntime/test/framework/sparse_kernels_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 35. onnxruntime/test/framework/tensorutils_test.cc [Big endian issue] Helper method ConvertEndianessForVector and call this from required place. 36. onnxruntime/test/framework/test_tensor_loader.cc o. [BUILDING on AIX issue] Handling of getcwd for AIX o. [Big endian issue] Bytes Swapping in run_external_data_test 37. onnxruntime/test/onnx/main.cc [Big endian issue] including <thread> for AIX 38. onnxruntime/test/onnx/tensorprotoutils.cc [Big endian issue] Bytes swapping in UnpackTensorWithRawData 39. onnxruntime/test/optimizer/graph_transform_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 40. onnxruntime/test/optimizer/graph_transform_test_builder.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 41. onnxruntime/test/optimizer/graph_transform_test_builder.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 42. onnxruntime/test/optimizer/initializer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 43. onnxruntime/test/optimizer/nchwc_optimizer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 44. onnxruntime/test/providers/base_tester.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 45. onnxruntime/test/providers/cpu/generator/random_test.cc [BUILDING on AIX issue] Adding AIX check in MultinomialGoodCase --------- Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>	2024-07-17 12:37:06 -07:00
Changming Sun	e5f18ba2c1	Change libonnxruntime.so's SONAME: remove the minor and patch version. (#21339 ) ### Description Resolve #21281 and #10589 . 1. Change libonnxruntime.so's SONAME: remove the minor and patch version. By default when creating an ELF shared object, linker will set the file's internal DT_SONAME field to the specified name which is the file name plus SOVERSION . For example, the file name for our library is libonnxruntime.so. And by default SOVERSION is the lib's VERSION number, which is something like 1.19.0. So the DT_SONAME field in libonnxruntime.so is something like libonnxruntime.so.1.18.0. You can use readelf tool to examine it. ``` readelf -d libonnxruntime.so \| grep SONAME 0x000000000000000e (SONAME) Library soname: [libonnxruntime.so.1.18.0] ``` When an executable is linked with a shared object which has a DT_SONAME field, then when the executable is run the dynamic linker will attempt to load the shared object specified by the DT_SONAME field rather than using the file name(which is libonnxruntime.so) given to the linker. After this change, the SONAME will be shorten to "libonnxruntime.so.1" instead. 2. Set default version strings for Windows DLLs, to resolve #10589	2024-07-15 14:21:34 -07:00
Edward Chen	9c2b85ad58	Fix Android build on Windows (#21304 ) - Pass a list of files instead of path separator-delimited string to project.files(). See this issue: https://github.com/gradle/gradle/issues/19817 - Check for host (instead of target) being Windows when using fallback patch program.	2024-07-15 12:29:02 -07:00
Ted Themistokleous	4ac4cd2668	Migraphx ep windows build (#21284 ) ### Description Repeat of #21084 with removal of policy CMP0144 to suppress warnings which uses CMake 3.27.0. ### Motivation and Context Already approved PR: https://github.com/microsoft/onnxruntime/pull/21084 Removed the added policy from CMake 3.27.0.	2024-07-11 21:21:38 -07:00
Qingnan Duan	80b56feb41	Implement FlashAttention for CPU (#20805 ) ### Description Implement [FlashAttention](https://arxiv.org/pdf/2205.14135) and [FlashAttention-2](https://arxiv.org/pdf/2307.08691) for MultiHeadAttention on CPU. ### Motivation and Context Accelerate the execution of MultiHeadAttention. Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on my Windows machine. May need further optimizations. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Qingnan Duan <qiduan@microsoft.com>	2024-07-11 14:19:59 -07:00
Edward Chen	20cd3394fc	[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation (#21193 ) Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B. Also moved some code around as it was getting big for a single file.	2024-07-10 15:39:26 -07:00
Changming Sun	8749fa381e	Update absl (#21300 ) ### Description Our macOS pipeline are failing because of a build error in absl. However, the bug fix we need is not available in the latest ABSL release. Here is the issue: https://github.com/abseil/abseil-cpp/pull/1536 And here is the fix: `779a3565ac` GTests uses ABSL. But this ABSL target also depends on GTest. So, it is a circular dependency. We should be able to avoid that by avoid building tests for ABSL. However, the version we are using has a problem with that: it has cmake target that still depends on GTest even when testing is disabled. It's strange that we suddenly hit this problem and it only happens on macOS.	2024-07-10 11:14:15 -07:00
Ștefan Talpalaru	1b19045afa	[build] allow MPI on Unix when NCCL is disabled (#21175 ) ### Description CMake logic fixed to allow enabling MPI while NCCL is disabled. ### Motivation and Context MPI is also used on the CPU backend, not only with CUDA, so it makes sense to decouple it properly from NCCL (which is for dealing with multiple Nvidia GPUs).	2024-07-09 21:21:40 -07:00
Hann Wang	d28c26a919	[ROCm] fix: obtain AMD GPU memory info through rocm_smi library (#21190 ) ### Description Previously ROCMExecutionProvider uses `hipMemGetInfo` to obtain the sizes of total memory and available memory. However, this API has been broken since ROCm 5.7. In this PR, we use `rocm_smi` library instead of `hipMemGetInfo`. ### Motivation and Context `hipMemGetInfo` API has been broken since ROCm 5.7 and inference with ROCMExecutionProvider will lead to following errors: ``` HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total); ``` MIOpen has a brute-force fix for this (`911e671895/src/hip/handlehip.cpp (L72)`). Instead of hard-coding available memory to 16GB, I suppose we could obtain memory info through `rocm_smi` library as in this PR.	2024-07-09 20:35:26 -07:00
Chen Feiyue	fffd430091	[VSINPU]Code improvement && Slice/Dropout OP support (#21217 ) ### Description - Refactor codes to meet line length limit and guard missing warning - Add slice/dropout op support - Move vsinpu ep's cmake settings from onnxruntime_providers.cmake to a separate file - Modify apis with param onnxruntime::Path because this kind is replaced by std:filesystem::path by #20920	2024-07-09 20:14:46 -07:00
Maximilian Müller	cc0de0d526	[Build] Propagate build option for CUDA minimal to TRT (#20695 ) ### Description Extend cuda minimal option to TRT provider, as with TRT 10 no linking to cuDNN is required anymore . Besides that with the new engine dump feature it is also possible to embed an engine in to an ONNX and not ship a builder lib. In addition to that this has roughly the same deserialization time/session setup time that using TRT standalone has. ### Motivation and Context ``` exe_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable\|1 trt_timing_cache_enable\|1 trt_dump_ep_context_model\|1 trt_weightless_engine_enable\|1' model.onnx exe_no_builder_lib\onnxruntime_perf_test.exe -I -e tensorrt -r 5 -i 'trt_engine_cache_enable\|1 trt_timing_cache_enable\|1 trt_dump_ep_context_model\|1 trt_weightless_engine_enable\|1' model_ctx.onnx ```	2024-07-09 14:40:04 -07:00
cloudhan	f39ee14b46	Add GQA support for ROCm (#21032 )	2024-07-03 14:55:31 +08:00
Baiju Meswani	116398c1a4	onnxruntime shared lib inside python package (#21223 )	2024-07-02 15:37:50 -07:00
Chen Feiyue	56b36a58ba	Initial PR for VSINPU execution provider (#20903 ) ### Description <!-- Describe your changes. --> -It is an initial PR for VSINPU execution provider ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> - For support VeriSilicon hardware - TIM-VX(Tensor Interface Module) (https://github.com/VeriSilicon/TIM-VX) is an integrated software solution by Verisilicon for our hardware(A311D/i.MX 8M Plus etc.) design, it is easy to use Verisilicon’s hardware by simply connecting onnxruntime with the TIM-VX API by this VSINPU execution provider.	2024-06-28 21:48:34 -07:00
Changming Sun	3a83f8b317	Update the functions in tensorprotoutils.h to use std::filesystem::path instead (#20920 ) ### Description 1. Update the functions in tensorprotoutils.h to use std::filesystem::path instead of onnxruntime::Path. Eventually we can remove the whole onnxruntime::Path class, but to this PR small I am not doing that. 2. Remove the _SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING macro def when TensorRT EP is enabled.	2024-06-28 20:03:57 -07:00
Preetha Veeramalai	6baaaf5165	OVEP options to disable CPU fallback at compile time (#21166 ) ### Description Provide user level options to control the fallback on CPU for models not supported on Intel's NPU hardware. ### Motivation and Context - Current workflow of OVEP allows safe fallback from OV NPU to OV CPU on compilation failures. Also supports MLAS CPU fallback in presence of unsupported custom ops. - The PR provides a build-time option to disable fallback from OV NPU to OV CPU. - The session Option "kOrtSessionOptionsDisableCPUEPFallback" disables OV CPU and MLAS CPU fallback. - Also has bug fix for proto creation. --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: ankitm3k <ankit.maheshkar@intel.com>	2024-06-28 08:31:02 -07:00
Changming Sun	d1ab94c2b0	Add compatibility for NumPy 2.0 (#21085 ) ### Description As suggested by SciPy's doc, we will `Build against NumPy 2.0.0, then it will work for all NumPy versions with the same major version number (NumPy does maintain backwards ABI compatibility), and as far back as NumPy 1.19 series at the time of writing` I think it works because in [numpyconfig.h#L64](https://github.com/numpy/numpy/blob/main/numpy/_core/include/numpy/numpyconfig.h#L64) there is a macro NPY_FEATURE_VERSION. By default it is set to NPY_1_19_API_VERSION. And the NPY_FEATURE_VERSION macro controls ABI. This PR only upgrade the build time dependency; When a user installs ONNX Runtime, they still can use numpy 1.x. ### Motivation and Context Recently numpy published a new version, 2.0.0, which is incompatible with the latest ONNX Runtime release.	2024-06-27 13:50:53 -07:00
mindest	eecc11afc7	[ROCm] Disable ck_tile in Debug build (#21178 ) ### Description tmp fix: disable ck_tile for Debug build. ### Motivation and Context Release build works fine for ck_tile, while Debug build fails. <details> <summary> Typical error log to revisit </summary> ``` [880/1797] Building HIP object CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o FAILED: CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o /opt/rocm/llvm/bin/clang++ -DEIGEN_MPL2_ONLY -DENABLE_ROCM_PROFILING -DENABLE_STRIDED_TENSORS -DENABLE_TRAINING -DENABLE_TRAINING_APIS -DENABLE_TRAINING_CORE -DENABLE_TRAINING_OPS -DENABLE_TRAINING_TORCH_INTEROP -DMIOPEN_VERSION=30100 -DORT_ENABLE_STREAM -DROCM_VERSION=60100 -DUSE_ROCM=1 -D_GNU_SOURCE -D__HIP_ROCclr__=1 -D__bf16__ -D__fp16__ -D__fp32__ -I/build/Debug/_deps/utf8_range-src -I/ws/onnxruntime/include/onnxruntime -I/ws/onnxruntime/include/onnxruntime/core/session -I/ws/onnxruntime/orttraining/orttraining/training_api/include -I/build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha -I/build/Debug/_deps/composable_kernel-src/include -I/build/Debug/_deps/composable_kernel-build/include -I/build/Debug/_deps/composable_kernel-src/library/include -isystem /opt/rocm-6.1.0/include -g -O -std=gnu++17 --offload-arch=gfx90a -fPIC -x hip -mllvm=-amdgpu-early-inline-all=true -mllvm=-amdgpu-function-calls=false -MD -MT CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -MF CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o.d -o CMakeFiles/onnxruntime_composable_kernel_fmha.dir/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp.o -x hip -c /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp In file included from /build/Debug/_deps/composable_kernel-build/fmha_fwd_d32_fp16_batch_b128x64x16x32x32x32_r2x1x1_w32x32x16_qr_async_vc_psddv.cpp:5: In file included from /build/Debug/_deps/composable_kernel-src/example/ck_tile/01_fmha/fmha_fwd.hpp:6: In file included from /build/Debug/_deps/composable_kernel-src/include/ck_tile/core.hpp:11: /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression 27 \| asm volatile("s_add_u32 m0, %0, m0" : : "n"(v) : "memory"); \| ^ /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression /build/Debug/_deps/composable_kernel-src/include/ck_tile/core/arch/utility.hpp:27:18: error: constraint 'n' expects an integer constant expression fatal error: too many errors emitted, stopping now [-ferror-limit=] 20 errors generated when compiling for gfx90a. ... ``` </details>	2024-06-27 12:04:17 +08:00
aciddelgado	ebd0368bb0	Make Flash Attention work on Windows (#21015 ) ### Description Previously, Flash Attention only worked on Linux systems. This PR will make it work and enable it to be built and run on Windows. Limitations of Flash Attention in Windows: Requires CUDA 12. ### Motivation and Context This will significantly increase the performance of Windows-based LLM's with hardware sm>=80. To illustrate the improvement of Flash Attention over Memory Efficient Attention, here are some average benchmark numbers for the GQA operator, run with configurations based on several recent models (Llama, Mixtral, Phi-3). The benchmarks were obtained on RTX4090 GPU using the test script located at (onnxruntime/test/python/transformers/benchmark_gqa_windows.py). * Clarifying Note: These benchmarks are just for the GQA operator, not the entire model. ### Memory Efficient Attention Kernel Benchmarks: \| Model Name \| Max Sequence Length \| Inference Interval (ms) \| Throughput (samples/second) \| \|----------------------------------------\|---------------------\|-------------------------\|-----------------------------\| \| Llama3-8B (Average Prompt) \| 8192 \| 0.19790525 \| 13105.63425 \| \| Llama3-8B (Average Token) \| 8192 \| 0.207775538 \| 12025.10172 \| \| Llama3-70B (Average Prompt) \| 8192 \| 0.216049167 \| 11563.31185 \| \| Llama3-70B (Average Token) \| 8192 \| 0.209730731 \| 12284.38149 \| \| Mixtral-8x22B-v0.1 (Average Prompt) \| 32768 \| 0.371928785 \| 7031.440056 \| \| Mixtral-8x22B-v0.1 (Average Token) \| 32768 \| 0.2996659 \| 7607.947159 \| \| Phi-3-mini-128k (Average Prompt) \| 131072 \| 0.183195867 \| 15542.0852 \| \| Phi-3-mini-128k (Average Token) \| 131072 \| 0.198215688 \| 12874.53494 \| \| Phi-3-small-128k (Average Prompt) \| 65536 \| 2.9884929 \| 2332.584142 \| \| Phi-3-small-128k (Average Token) \| 65536 \| 0.845072406 \| 2877.85822 \| \| Phi-3-medium-128K (Average Prompt) \| 32768 \| 0.324974429 \| 8094.909517 \| \| Phi-3-medium-128K (Average Token) \| 32768 \| 0.263662567 \| 8978.463687 \| ### Flash Attention Kernel Benchmarks: \| Model Name \| Max Sequence Length \| Inference Interval (ms) \| Throughput (samples/second) \| \|--------------------------------------\|---------------------\|-------------------------\|-----------------------------\| \| Llama3-8B (Average Prompt) \| 8192 \| 0.163566292 \| 16213.69057 \| \| Llama3-8B (Average Token) \| 8192 \| 0.161643692 \| 16196.14715 \| \| Llama3-70B (Average Prompt) \| 8192 \| 0.160510375 \| 17448.67753 \| \| Llama3-70B (Average Token) \| 8192 \| 0.169427308 \| 14702.62043 \| \| Mixtral-8x22B-v0.1 (Average Prompt) \| 32768 \| 0.164121964 \| 15618.51301 \| \| Mixtral-8x22B-v0.1 (Average Token) \| 32768 \| 0.1715865 \| 14524.32273 \| \| Phi-3-mini-128k (Average Prompt) \| 131072 \| 0.167527167 \| 14576.725 \| \| Phi-3-mini-128k (Average Token) \| 131072 \| 0.175940594 \| 15762.051 \| \| Phi-3-small-128k (Average Prompt) \| 65536 \| 0.162719733 \| 17824.494 \| \| Phi-3-small-128k (Average Token) \| 65536 \| 0.14977525 \| 16749.19858 \| \| Phi-3-medium-128K (Average Prompt) \| 32768 \| 0.156490786 \| 17679.2513 \| \| Phi-3-medium-128K (Average Token) \| 32768 \| 0.165333833 \| 14932.26079 \| Flash Attention is consistently faster for every configuration we benchmarked, with improvements in our trials ranging from ~20% to ~650%. In addition to these improvements in performance, Flash Attention has better memory usage. For example, Memory Efficient Attention cannot handle a max sequence length higher than 32,768, but Flash Attention can handle max sequence lengths at least as high as 131,072. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2024-06-24 09:43:49 -07:00
Changming Sun	f5625b8858	Revert "[MIGraphX EP] enable compilation and execution on Windows (21084)" (#21132 ) ### Description This reverts commit `1d7bf56947` because it broken the AMD GPU CI pipeline. Sorry when I reviewed the PR I forgot to run the AMD GPU CI pipeline. Will revert the PR first then ask the author to fix the issue.	2024-06-21 01:01:07 -07:00
Jake Mathern	b9eb1dc21e	Update protobuf_cmake.patch to allow extra disablements configurable by projects that build ORT (#20875 ) ### Description Update protobuf_cmake.patch to allow extra disablements. ORT repo already patches protobuf to not disable the warning 4996. ### Motivation and Context To meet SDL requirements, Microsoft repos have to fail build if there is warning 4996 Binskim also gives errors if warning 4996 is disabled. We can suppress the Binskim issues, but we need a way to disable the warnings for the minimal set of code that has them. Right now, WindowsAI disables 4996 for entirety of ORT, but it should only be disabled for protobuf.	2024-06-20 16:28:15 -07:00
Ted Themistokleous	1d7bf56947	[MIGraphX EP] enable compilation and execution on Windows (#36 ) (#21084 )	2024-06-20 16:21:11 -07:00
Changming Sun	be423747b1	Delete pyop (#21094 ) ### Description Remove the "--enable_language_interop_ops" build flag, because the code is incompatible with the latest numpy, and the build flag is not used anywhere except a macOS CI pipeline. It does not seem to have a ship plan. ### Motivation and Context The build error was: ``` onnxruntime/core/language_interop_ops/pyop/pyop.cc:122:85: error: no member named 'elsize' in '_PyArray_Descr' static_cast<int64_t>(PyArray_DescrFromType(type)->elsize), ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ```	2024-06-19 16:21:33 -07:00
Clément Péron	8ab8e649a7	tools: build: fix typo (#21052 ) ### Description Typo in the python build script	2024-06-19 16:14:58 -07:00
Tianlei Wu	01279d8896	[ROCM] Exclude flash attention from hipify (#21091 ) Exclude flash attention sub-directory from hipify.	2024-06-19 08:59:10 -07:00
cloudhan	ddd4ce3cb7	[ROCm] Update ck to use ck_tile (#21030 )	2024-06-19 14:06:10 +08:00
Changming Sun	9ef4f1b789	Update pybind11 (#21072 ) ### Description Upgrade pybind11 to the latest as suggested by @gnought in #21063 ### Motivation and Context Recently numpy released a new version, which caused compatibility issue between the latest numpy version and the latest ONNX Runtime version.	2024-06-17 19:50:57 -07:00
Scott McKay	159fe9d4f3	Update to mobile model usability checker (#19843 ) ### Description <!-- Describe your changes. --> - Add check for CoreML MLProgram supported ops - Only check usability with ORT Mobile package if requested - this package will be deprecated so info is a) of minimal value and b) can be confusing. - Output more things at INFO level - a lot of meaningful info was only output at DEBUG level. The default INFO level is more useful - dump full partition info at DEBUG level - Check subgraphs fully - CoreML can handle a subgraph - TBD if we want to add support for adding a subgraph to the parent graph for Loop and If nodes - most likely will be required for simple If nodes to be performant - Check 5D CoreML limitation ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Improve helper tools --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-06-18 07:50:33 +10:00
Adrian Lizarraga	8f0e896c95	Fix Reduced Op build with empty FP16 kernel function tables (#21038 ) ### Description - Fixes compilation error for "reduced operator" builds with no FP16 kernels and `MLAS_F16VEC_INTRINSICS_SUPPORTED` enabled. - Fixes linker error for "reduced operator" builds with QNN EP by excluding QNN EP unit tests. QNN EP unit tests require CPU EP operator implementations to evaluate accuracy. ### Motivation and Context Need to be able to build a reduced operator build with QNN EP. See https://github.com/microsoft/onnxruntime/blob/main/docs/Reduced_Operator_Kernel_build.md The following example operator config file causes a compilation error when either `MLAS_F16VEC_INTRINSICS_SUPPORTED` is defined or QNN EP is enabled. ``` # reduced_op_config.txt ai.onnx;12;Add ``` ```shell python tools\ci_build\build.py --include_ops_by_config reduced_op_config.txt --config Debug --build_wheel --build_shared_lib --skip_tests --build_dir build --parallel --use_qnn --qnn_home '<QNN_ROOT_DIR>' ```	2024-06-14 14:23:12 -07:00
Ye Wang	f35dd1407f	custom allreduce cuda kernel (#20703 ) ### Description <!-- Describe your changes. --> Conditionally route to custom AllReduce kernel when buffer size and gpu numbers meet certain requirements. Otherwise, keep using NCCL's AllReduce. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ye Wang <wangye@microsoft.com@h100vm-ort.kxelwkzfzxguje5bxvwxxs135a.gvxx.internal.cloudapp.net> Co-authored-by: Your Name <you@example.com>	2024-06-13 11:09:49 -07:00
Tianlei Wu	b3fc9b5a0e	[CUDA] upgrade cutlass to 3.5.0 (#20940 ) ### Description Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in Windows - [x] Upgrade cutlass to 3.5.0. - [x] Fix flash attention build error with latest cutlass header files and APIs. This fix is provided by @wangyems. - [x] Update efficient attention to use new cutlass fmha interface. - [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53. - [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test build error for cuda 11.8 to 12.3. - [x] Disable TRT 10 deprecate warnings. The following are not included in this PR: * TRT provider replaces the deprecated APIs. * Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This test is not built by default unless you add `--cmake_extra_defines onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command. To integrate to rel-1.18.1: Either bring in other changes (like onnx 1.16.1), or generate manifest and upload a new ONNX Runtime Build Time Deps artifact based on rel-1.18.1. ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/19891 https://github.com/microsoft/onnxruntime/issues/20924 https://github.com/microsoft/onnxruntime/issues/20953	2024-06-11 13:32:15 -07:00
Changming Sun	92ae60b01f	Revert a cmake change in protobuf_cmake.patch (#20964 ) Avoid patching external projects unless absolutely necessary #20875	2024-06-10 11:20:33 -07:00
Edward Chen	981893c318	Remove deprecated "mobile" packages (#20941 ) # Description This PR removes the building of the ORT "mobile" packages and much of the associated infrastructure which is no longer needed. Not removed yet - tools/ci_build/github/android/mobile_package.required_operators.config and the helper scripts that depend on it. # Motivation and Context The mobile packages were deprecated in 1.18. Users should use the full packages (Android - onnxruntime-android, iOS - onnxruntime-c/onnxruntime-objc) instead or do a custom build.	2024-06-07 16:20:32 -05:00
Changming Sun	f8b5c2805e	Update abseil-cpp.cmake: add version check (#20962 ) Some dev environments come with a preinstalled abseil. For example, conda users often do that. If the preinstalled abseil version is incompatible with what we have in cmake/deps.txt, it could result in a hard-to-understand build error. This PR adds a version check to improve that.	2024-06-06 19:42:31 -07:00
liqun Fu	51bc53580d	Update to onnx 1.16.1 (#20702 )	2024-06-04 11:06:28 -07:00
Changming Sun	d13cabf7f9	Upgrade GCC and remove the dependency on GCC8's experimental std::filesystem implementation (#20893 ) ### Description This PR upgrades CUDA 11 build pipelines' GCC version from 8 to 11. ### Motivation and Context GCC8 has an experimental std::filesystem implementation which is not ABI compatible with the formal one in later GCC releases. It didn't cause trouble for us, however, ONNX community has encountered this issue much. For example, https://github.com/onnx/onnx/issues/6047 . So this PR increases the minimum supported GCC version from 8 to 9, and removes the references to GCC's "stdc++fs" library. Please note we compile our code on RHEL8 and RHEL8's libstdc++ doesn't have the fs library, which means the binaries in ONNX Runtime's official packages always static link to the fs library. It is just a matter of which version of the library, an experimental one or a more mature one. And it is an implementation detail that is not visible from outside. Anyway, a newer GCC is better. It will give us the chance to use many C++20 features. #### Why we were using GCC 8? It is because all our Linux packages were built on RHEL8 or its equivalents. The default GCC version in RHEL8 is 8. RHEL also provides additional GCC versions from RH devtoolset. UBI8 is the abbreviation of Red Hat Universal Base Image 8, which is the containerized RHEL8. UBI8 is free, which means it doesn't require a subscription(while RHEL does). The only devtoolset that UBI8 provides is GCC 12, which is too new for being used with CUDA 11.8. And our CUDA 11.8's build env is a docker image from Nvidia that is based on UBI8. #### How the problem is solved Almalinux is an alternative to RHEL. Almalinux 8 provides GCC 11. And the CUDA 11.8 docker image from Nvidia is open source, which means we can rebuild the image based on Almalinux 8 to get GCC 11. I've done this, but I cannot republish the new image due to various complicated license restrictions. Therefore I put them at an internal location in onnxruntimebuildcache.azurecr.io.	2024-06-03 10:14:08 -07:00
Changming Sun	65ef270e06	Update Aten pipeline's docker file to use UBI8 (#20856 ) ### Description Now it uses CentOS 7 which is EOL. This PR updates it to UBI8. ### Motivation and Context To deprecate CentOS 7 .	2024-05-30 07:38:15 -07:00
Carson M	5bfca1dc57	[Build] Change `onnxruntime_NVCC_THREADS` from option to cache entry (#20768 ) ### Description Changes the `onnxruntime_NVCC_THREADS` CMake variable from an [`option`](https://cmake.org/cmake/help/latest/command/option.html) to a [cache entry](https://cmake.org/cmake/help/latest/command/set.html#set-cache-entry). ### Motivation and Context Fixes #19833. `option` in CMake (confusingly, IMHO) always defines a boolean option. The original definition of `onnxruntime_NVCC_THREADS` specified a default of `1`, which I presume is coerced to `ON`. Thus, if the option is not overridden with a value of another type, NVCC will receive a malformed option `--threads ON` (rather than the expected `--threads 1`), which causes the error reported in #19833. This error only occurred if compiling ONNX Runtime via CMake with `-Donnxruntime_USE_CUDA=ON`; the CI build script always overrode `onnxruntime_NVCC_THREADS` with a string value: `f1fef19b6e/tools/ci_build/build.py (L1152-L1154)`	2024-05-29 12:28:33 -07:00
Chi Lo	a7bc49a565	[TensorRT EP] Use latest commit of onnx-tensorrt parser (#20758 ) The 10 GA branch updated with several issues fixed. https://github.com/onnx/onnx-tensorrt/commits/10.0-GA/	2024-05-24 16:44:16 -07:00
Changming Sun	b522df0ae4	Update RE2 to the latest (#20775 ) Update RE2 to the latest. To keep the components up to date.	2024-05-23 14:30:15 -07:00
Edward Chen	a39f8862fd	SQNBitGemm - move workspace size calculation functions to hardware-specific implementations (#20757 ) The workspace usage may be hardware-specific. Moving away from a common workspace size calculation allows more flexibility in the hardware-specific implementations.	2024-05-22 15:12:17 -07:00
pengwa	8a98874e7e	Flash attention recompute (#20603 ) ### Flash attn recompute 1. Allow PythonOp(FlashAttn) can be recomputed correctly. `45879ff5c2` 2. Use JSON to pass the selected-to-recompute subgraphs. `3c374da678` #### Better Memory Efficiency Customer model can run both PyTorch SPDA and Flash Attn, this PR make it possible to let the Flash Attn path work with ORTModule layerwise recompute. The peak drop from 45.xGB to 32.xGB if we only compare the layers (not including other pieces, BTW there are few more optimization targeting other pieces as well later). #### Better Perf Using Flash ATTN bring additionally 16% end to end time reduction, with highly aligned loss curve. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/bb63894a-f281-49bc-a8e6-ff818439be38) #### Use JSON File to pass Recompute Plans To overcome the limitation of max length of the strings defined in session options. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-21 13:38:19 +08:00
Yulong Wang	036fcd93d4	[js/web] optimize module export and deployment (#20165 ) ### Description This PR make numbers of optimizations to onnxruntime-web's module export and deployment. See each section below for more details. #### Preview > [onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21](https://www.npmjs.com/package/onnxruntime-web/v/1.19.0-esmtest.20240513-a16cd2bd21) > ~~onnxruntime-web@1.19.0-esmtest.20240430-c7edbcc63d~~ > ~~onnxruntime-web@1.18.0-esmtest.20240428-624c681c83~~ > ~~onnxruntime-web@1.18.0-esmtest.20240411-1abb64e894~~ <details> <summary><h4>Breaking changes</h4></summary> There is no code change required, but there are a few differences regarding code import, flags, bundler config and deployment steps. #### Importing: Import table is changed. See following for details. <details> <summary><h5>Current import table:</h5></summary> \| Target Name \| Path for "import" or "require" \| WebGL \| JSEP \| wasm \| Proxy \| Training \| \|------\|-----\|-----\|-----\|-----\|-----\|-----\| \| `ort` (default) \| `onnxruntime-web` \| ✔️ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.all` \| `onnxruntime-web/experimental` \| ✔️ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| \| `ort.node` \| `onnxruntime-web` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.training` \| `onnxruntime-web/training` \| ❌ \| ❌ \| ✔️ \| ✔️<sup>\[1]</sup> \| ✔️ \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.wasm-core` \| `onnxruntime-web/wasm-core` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| ✔️ \| ❌ \| ❌ \| ✔️<sup>\[2]</sup> \| ❌ \| \| `ort.webgpu` \| `onnxruntime-web/webgpu` \| ❌ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| * [1] didn't test. may not actually work. * [2] not working. this is a mistake in build config. </details> <details> <summary><h5>Proposed update:</h5></summary> \| Target Name \| Path for "import" or "require" \| WebGL \| JSEP \| wasm \| Proxy \| Training \| \|------\|-----\|-----\|-----\|-----\|-----\|-----\| \| `ort` (default) \| `onnxruntime-web` \| ✔️ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| `ort.all` \| ~~`onnxruntime-web/experimental`~~<br/>`onnxruntime-web/all` \| ✔️ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| \| `ort.node` \| `onnxruntime-web` \| ❌ \| ❌ \| ✔️ \| ❌ \| ❌ \| \| `ort.training` \| `onnxruntime-web/training` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ✔️ \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| ❌ \| ❌ \| ✔️ \| ✔️ \| ❌ \| \| ~~`ort.wasm-core`~~ \| ~~`onnxruntime-web/wasm-core`~~ \| ~~❌~~ \| ~~❌~~ \| ~~✔️~~ \| ~~❌~~ \| ~~❌~~ \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| ✔️ \| ❌ \| ❌ \| ~~✔️~~ ❌ \| ❌ \| \| `ort.webgpu` \| `onnxruntime-web/webgpu` \| ❌ \| ✔️ \| ✔️ \| ✔️ \| ❌ \| </details> #### Flags: The following flags are deprecated: - `env.wasm.simd` (boolean): will be ignored. SIMD is always enabled in build. The following flags changed their type: - `env.wasm.wasmPaths`: When using this flag as a string ( for the URL prefix ), nothing is changed. When using this flag as an object ( for per-file path override ), the type changed: ```diff - export interface Old_WasmFilePaths{ - 'ort-wasm.wasm'?: string; - 'ort-wasm-threaded.wasm'?: string; - 'ort-wasm-simd.wasm'?: string; - 'ort-training-wasm-simd.wasm'?: string; - 'ort-wasm-simd-threaded.wasm'?: string; - }; + export interface New_WasmFilePaths { + /** + * Specify the override path for the main .wasm file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .wasm file is: + * - `ort-wasm-simd-threaded.wasm` for default build + * - `ort-wasm-simd-threaded.jsep.wasm` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.wasm` for training build + / + wasm?: URL\|string; + /* + * Specify the override path for the main .mjs file. + * + * This path should be an absolute path. + * + * If not modified, the filename of the .mjs file is: + * - `ort-wasm-simd-threaded.mjs` for default build + * - `ort-wasm-simd-threaded.jsep.mjs` for JSEP build (with WebGPU and WebNN) + * - `ort-training-wasm-simd-threaded.mjs` for training build + / + mjs?: URL\|string; + } ``` #### Bundler compatibility: Config changes are need for bundlers. See usage example in /js/web/test/e2e/ for Webpack, parcel and rollup. #### Deployment: - if consuming from a CDN, there is no breaking change. - if consuming from a local server, need to copy all `ort-.wasm` and `ort-.mjs` files (totally 6 files) in the dist folder. (previously only need to copy `ort-.wasm` files.) </details> <details> <summary><h4>Problems</h4></summary> There are a few problems with the current module export and deployment: - Script URL cannot be correctly inferred when imported as ESM. - Workers are forcefully encoded using Blob URL, which makes onnxruntime-web not working in CSP environment and Node.js, when using proxy or multi-threading feature. - Generated JS code (by Emscripten) is encoded using `function.toString()`, which is unstable and error-prone. - When running with a different Emscripten build, always need the build step. Making it difficult to swap artifacts in deveopment/debug. </details> <details> <summary><h4>Goals</h4></summary> - Full ESM support - Support variances of ways to import. Including: - import from HTML's `<script>` tag (IIFE format, exporting to global variable `ort`) ```html <script src="https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.js"></script> ``` - import from source code inside `<script type="module">` tag (ESM) ```html <script type="module"> import * as ort from "https://example.com/cdn-path-to-onnxruntime-web/dist/ort.min.mjs"; // using 'ort' </script> ``` - import in a CommonJS project (CJS format, resolve from package.json "exports" field) ```js // myProject/main.js const ort = require('onnxruntime-web'); ``` - import in an ESM project (ESM format, resolve from package.json "exports" field) ```js // myProject/main.js (or main.mjs) import * as ort from 'onnxruntime-web'; ``` - Support popular bundlers when importing onnxruntime-web into a CJS/ESM project. - webpack (esm requires extra post-process step) - rollup - parcel (esm requires extra post-process step) - More bundlers TBD - Multi-threading support for Node.js NOTE: keeping single JavaScript file (the all-in-one bundle) is no longer a goal. This is because technically there is a conflict with the other requirements. </details> <details> <summary><h4>Important Design Decisions</h4></summary> - Drop support of single JavaScript output. - The current onnxruntime-web distribution uses a single JavaScript file to include all code. While there are a few benefits, it also creates problems as mentioned above. Since ESM is being used more and more widely, and browsers are making more restricted security checks and requirement, the old Blob based solution is going to be replaced. - To achieve the requirement, specifically, the CSP environment support, we have to offer a non Blob based solution. Therefore, we have to distribute multiple files and drop the single file solution. - Do not run parser/postprocess on Emscripten generated JavaScript. - Emscripten is evolving quickly so we should only depends on what's in its documentation instead of a certain implementation details. (for example, currently we patch on its code to deal with a special variable `_scriptDir`) - Keep the generated files as-is also helps to: - reduce the size of ort.min.js - make it easier to replace build artifacts when in development/debug - Drop support for non-SIMD and non-MultiThread. This helps to reduce the number of artifacts in distribution. - (fixed-sized) SIMD is supported in any mainstream JS environment. - Multi-thread as WebAssembly feature is supported in any mainstream JS environment. In some environment the feature is guarded with cross origin policy, but it can still work if not trying to create any worker. - Use ESM output for Emscripten generated JavaScript. - There are 2 ways to dynamically import classic (umd) modules and neither of them are recommended: - dynamically creating a <script> tag. This changes the HTML structure and have quite a lot of compatibility issue - use `fetch()` and `eval()`. However `eval` is strongly suggested to be avoid because there is a great perf hit. - importing ESM is super easy - just use the `import()` call. Considering ESM is widely supported in modern browsers and Node.js this is the better option. - Add Blob based solution as a fallback for cross-origin workers. - There are still wide use case of importing onnxruntime-web from CDN. In this usage, make it able create worker by using `fetch()`+`Blob` to create a same-origin Blob URL. </details> <details> <summary><h4>Distribution File Manifest</h4></summary> The distribution folder contains the following files: - WebAssembly artifacts. These files are the result of compiling the ONNX Runtime C++ code to WebAssembly by Emscripten. \| File Name \| Build Flags \| \|------\|-----\| \| ort-wasm-simd-threaded.mjs <br/> ort-wasm-simd-threaded.wasm \| `--enable_wasm_simd` <br/> `--enable_wasm_threads` \| \| ort-training-wasm-simd-threaded.mjs <br/> ort-training-wasm-simd-threaded.wasm \| `--enable_training_apis` <br/> `--enable_wasm_simd` <br/> `--enable_wasm_threads` \| \| ort-wasm-simd-threaded.jsep.mjs <br/> ort-wasm-simd-threaded.jsep.wasm \| `--enable_wasm_simd` <br/> `--enable_wasm_threads` <br/> `--use_jsep` <br/> `--use_webnn` \| - onnxruntime-web JavaScript artifacts. These files are generated by ESBuild as the entry point for onnxruntime-web. There are multiple build targets for different use cases: \| Target Name \| Path for "import" or "require" \| Description \| \|------\|-----\|-----\| \| `ort` \| `onnxruntime-web` \| The default target. \| \| `ort.all` \| `onnxruntime-web/all` \| The target including webgl. \| \| `ort.node` \| `onnxruntime-web` \| The default target for Node.js. \| \| `ort.training` \| `onnxruntime-web/training` \| The target including training APIs \| \| `ort.wasm` \| `onnxruntime-web/wasm` \| The target including only WebAssembly (CPU) EP \| \| `ort.webgl` \| `onnxruntime-web/webgl` \| The target including only WebGL EP \| For each target, there are multiple files generated: \| File Name \| Description \| \|------\|-----\| \| [target].js \| The entry point for the target. IIFE and CommonJS format. \| \| [target].mjs \| The entry point for the target. ESM format. \| \| [target].min.js <br/> [target].min.js.map \| The entry point for the target. Minimized with sourcemap. IIFE and CommonJS format. \| \| [target].min.mjs <br/> [target].min.mjs.map \| The entry point for the target. Minimized with sourcemap. ESM format. \| \| [target].proxy.mjs \| (if appliable) The proxy ESM module for the target. \| \| [target].proxy.min.mjs <br/> [target].proxy.min.mjs.map \| (if appliable) The proxy ESM module for the target. Minimized with sourcemap. \| </details> <details> <summary><h4>Dynamic Import Explained</h4></summary> - Local Served \| No Proxy: ``` [Bundle or ort.min.js] \| + import()--> [ort-wasm-simd-threaded.mjs] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Local Served \| Proxy: ``` [Bundle or ort.min.js] \| + import()--> [ort.proxy.min.mjs] \| + new Worker()--> [ort.proxy.min.mjs (worker)] \| + import()--> [ort-wasm-simd-threaded.mjs] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [ort-wasm-simd-threaded.mjs (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin \| No Proxy: ``` [Bundle or ort.min.js] \| + fetch('ort-wasm-simd-threaded.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort-wasm-simd-threaded)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` - Cross Origin \| Proxy ``` [Bundle or ort.min.js] \| + fetch('ort.proxy.min.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort.proxy)] \| + new Worker()--> [blob:... (ort.proxy) (worker)] \| + fetch('ort-wasm-simd-threaded.mjs') \| + URL.createObjectURL(res.blob()) \| + import()--> [blob:... (ort-wasm-simd-threaded)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] \| + new Worker()--> [blob:... (ort-wasm-simd-threaded) (worker)] \| + WebAssembly.instantiateStreaming()--> [ort-wasm-simd-threaded.wasm] ``` </details>	2024-05-20 09:51:16 -07:00
cloudhan	5d07291247	hipify int4 gemv (#20666 ) Hipify MatMulNBits to accommodate the need of Phi3 onnx release.	2024-05-18 16:59:03 +08:00
Edward Chen	e81c8676e3	MatMulNBits + Add fusion (#20587 ) - Add MatMulNBits Bias input - Add graph transformer to fuse MatMulNBits + Add	2024-05-16 11:00:59 -07:00
Hector Li	0e11d0c4f8	Enable Qnn nuget nightly (#20662 ) ### Description Enable Qnn nuget nightly	2024-05-13 21:28:43 -07:00
Jon Campbell	768c79317c	Enable QNN HTP support for Node (#20576 ) ### Description Add support for using Onnx Runtime with Node ### Motivation and Context Onnx Runtime supports the QNN HTP, but does not support it for Node.js. This adds baseline support for the Onnx Runtime to be used with Node. Note it does not update the node packages that are distributed officially. This simply patches the onnxruntime.dll to allow 'qnn' to be used as an execution provider. Testing was done using the existing onnxruntime-node package. The `onnxruntime.dll` and `onnxruntime_binding.node` were swapped into `node_modules\onnxruntime-node\bin\napi-v3\win32\arm64` with the newly built version, then the various QNN dlls and .so files were placed next to the onnxruntime.dll. Testing was performed on a variety of models and applications, but the easiest test is to modify the [node quickstart example](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/quick-start_onnxruntime-node).	2024-05-09 13:11:07 -07:00
George Wu	58d7b12205	support --arm64ec for qnn ep build (#20607 ) link against binaries in arm64x-windows-msvc when building qnn ep with --arm64ec build option.	2024-05-08 11:09:15 -07:00
Scott McKay	8d09baf49f	Clarify when protobuf dependency builds protoc (#20542 ) ### Description <!-- Describe your changes. --> Currently figuring out if the protobuf dependency is building protoc it is a little obtuse and inconsistent * in some places we directly set protobuf_BUILD_PROTOC_BINARIES to OFF to indicate the protobuf dependency is not building protoc * e.g. macOS/iOS/visionOS builds * for a user provided protoc path we don't set protobuf_BUILD_PROTOC_BINARIES, and inside protobuf_function.cmake that determines if `protobuf::protoc` is added as a dependency or not * `0dda8b0c44/cmake/external/protobuf_function.cmake (L40-L45)` To be more consistent/explicit, set protobuf_BUILD_PROTOC_BINARIES to OFF when ONNX_CUSTOM_PROTOC_EXECUTABLE set and valid. Remove outdated script that built and external protoc binary which was used in later builds. The build setup will fetch a pre-built protoc so there's no need for this additional build. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Make it easier to figure out if protoc is coming from the protobuf dependency.	2024-05-08 08:30:11 +10:00
moyo1997	aff04ba08a	Dev/mookerem/arm64x update (#20536 ) Made some changes to the arm64x.cmake script to: - handle edge case - Enable Projects that include onnxruntime as submodule and build it, to be able to build as x without causing onnxruntime build_as_x to fail.	2024-05-07 12:50:38 -07:00
Chi Lo	c86476a636	[TensorRT] adapt for TRT lib name change after TRT 10 GA (update) (#20550 ) https://github.com/microsoft/onnxruntime/pull/20445 The nvonnxparser still needs major version appending to it when building oss parser.	2024-05-06 15:00:13 -07:00
Scott McKay	f9febc4f35	Remove usage of 'required reason' iOS API from protobuf (#20529 ) ### Description <!-- Describe your changes. --> Using certain APIs is about to require a [privacy manifest](https://developer.apple.com/documentation/bundleresources/privacy_manifest_files/describing_use_of_required_reason_api) to be added to a package. Our version of protobuf uses `mach_absolute_time`. Patch as per https://github.com/protocolbuffers/protobuf/pull/15662/ to remove usage. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Usage of API will require a privacy manifest for an iOS app to be accepted as of 5/1/2024 #20519	2024-05-02 08:21:08 +10:00
Yifan Li	29417762f7	[TensorRT EP] support TensorRT 10-GA (#20506 ) ### Description <!-- Describe your changes. --> This branch is based on rel-1.18.0 and supports TensorRT 10-GA. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-05-01 11:10:53 -07:00
Hector Li	755aaea9a6	Qnn nuget update (#20527 ) ### Description Update Qnn nuget package to include Qnn libs and license file	2024-04-30 22:12:53 -07:00
Tianlei Wu	9f0fae29e8	[CUDA] Add SparseAttention operator for Phi-3-small (#20216 ) ### Description Add CUDA implementation for block sparse attention for Phi-3-small. Block sparse attention was proposed in [Sparse Transformers](https://arxiv.org/pdf/1904.10509) by OpenAI, and also adopted in [BigBird](https://arxiv.org/pdf/2007.14062) with different sparse layout. In Phi-3-small, the sparse layout is static, and works with unidirectional (causal) attention. Compared to dense attention, the benefit of block sparse is to speed up both training and inference. It could save memory thus support longer context length. - [x] Add operator spec and shape inference - [x] Symbolic shape inference - [x] Refactor GroupQueryAttention to expose common kernels for kv cache concatenation, q/k/v transpose etc. - [x] Add cuda kernel to convert block mask to CSR format - [x] Add cuda kernel to generate position ids - [x] Add compile script and template files to convert triton kernel to cubin and dispatcher. - [x] Add triton kernel v1 for prompt - [x] Add triton kernel v2 for token generation and support padding - [x] Update IO Binding Helper to allow buffer sharing. - [x] Test relevance - [x] Test performance ### Performance Test in A100-SXM4-80GB with `batch_size=4, num_heads=32, max_seq_len=8192, head_size=128, sparse_block_size=64, local_blocks=16, vert_stride=8, num_layout=8` We compare sparse attention to corresponding GQA with local attention windows size 1024, or GQA with dense causal. Average latency in milliseconds (for fused attention kernel used in prompt prefilling): seq_len \| GQA-Dense \| GQA-Local \| SparseAttention -- \| -- \| -- \| -- 64 \| 0.0465 \| 0.0722 \| 0.0641 128 \| 0.0618 \| 0.0787 \| 0.0672 256 \| 0.1086 \| 0.1076 \| 0.0943 512 \| 0.2535 \| 0.2487 \| 0.1676 1024 \| 0.7042 \| 0.7050 \| 0.3800 2048 \| 2.4125 \| 1.9316 \| 0.8966 4096 \| 8.9346 \| 4.5699 \| 2.1129 8192 \| 40.5401 \| 10.3508 \| 5.1748 Average latency in milliseconds (for fused attention kernel used in token generation: past_seq_len \| GQA-Dense \| GQA-Local \| SparseAttention -- \| -- \| -- \| -- 64 \| 0.0186 \| 0.0186 \| 0.0870 128 \| 0.0408 \| 0.0466 \| 0.1165 256 \| 0.0530 \| 0.0592 \| 0.0988 512 \| 0.0445\| 0.0447 \| 0.1150 1024 \| 0.0634 \| 0.0640 \| 0.1454 2048 \| 0.1027 \| 0.0637 \| 0.1589 4096 \| 0.1789 \| 0.0631 \| 0.1806 8192 \| 0.3288 \| 0.0655 \| 0.2146 We can see that the kernel for token generation still have room to improve. #### Limitations Only support right-side padding and unidirectional attention. The following are not supported in the first version: (1) Packed mode like PackedMultiHeadAttention where input has been removed padding. (2) paged attention. (3) bidirectional attention. (4) GPU compute capacity that is not 8.0, 8.6 and 8.9. (5) Left side padding. Some of these limitations will be removed in the future (may be in a new operator).	2024-04-30 09:06:29 -07:00
Edward Chen	358f5bb022	Update regex to match correct pattern. (#20483 ) In CMakeLists.txt:set_msvc_c_cpp_compiler_warning_level(), the regex should match the value that gets added by the function. The latter got updated, so this change updates the former to match.	2024-04-29 10:43:31 -07:00
George Wu	49b2bebe85	[qnn ep] include qnn sdk in onnxruntime-qnn python whl (#20485 ) script changes to include qnn sdk libs with onnxruntime-qnn python package.	2024-04-29 09:44:54 -07:00
zz002	50e41984b0	[VitisAI] Solve the problem that gsl cannot be found when compiling under linux (#20466 ) ### Description <!-- Describe your changes. --> [VitisAI] Solve the problem that gsl cannot be found when compiling under linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-04-28 20:56:16 -07:00
Hector Li	d2d4639ddb	fix the build issue for Win Arm64 Release build (#20475 ) ### Description Fix the build error for Win ARM64 Release build. graph_transform_test.cc(1,1): error C1128: number of sections exceeded object file format limit: compile with /bigobj [D:\build\Windows\Release\onnxruntime_test_all.vcxproj] ### Motivation and Context Fix issue: https://github.com/microsoft/onnxruntime/issues/20406	2024-04-25 22:08:19 -07:00
liqun Fu	cc26b2dac2	Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels (#20163 ) ### Description ``` Avx2: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22% Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32% ``` --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>	2024-04-25 21:30:50 -07:00
Chi Lo	a077330c3e	[TensorRT] adapt for TRT lib name change after TRT 10 GA (#20445 ) For TensorRT 10 GA onwards, the TensorRT libraries will have major version appended to the end on Windows, for example, nvinfer_10.dll, nvinfer_plugin_10.dll, nvonnxparser_10.dll ... Change cmake file accordingly.	2024-04-24 21:46:54 -07:00
Edward Chen	9cc5badc49	Fix Objective-C static analysis warnings. (#20417 ) Replace most usages of [NSString stringWithUTF8String:] with checked helper function. The issue is that the former can return nil.	2024-04-24 11:48:29 -07:00
Rachel Guo	14fcf0a52d	Support visionos build (#20365 ) ### Description <!-- Describe your changes. --> This PR supports a build of onnxruntime.xcframework for xros/xrsimulator for visionos via the build command of `python3 tools/ci_build/github/apple/build_apple_framework.py --config Release/Debug tools/ci_build/github/apple/default_vision_os_framework_build_settings.json`. For officially include visionos in ios cocoapods package and testing in CI, would require separate work for upgrading the Xcode version & upgrade macOS CI agent to macos-13-arm64 or higher. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> visionos support: https://github.com/microsoft/onnxruntime/discussions/19313 --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2024-04-23 18:15:07 -07:00
Scott McKay	9372e9a0a3	Support >2GB of Tensor data in training checkpoint (#20077 ) ### Description <!-- Describe your changes. --> Add ability to store initializer data in an external file. Update training checkpoint code to use external file if data > ~2GB. I don't see a way for the flatbuffers 64-bit offsets to be used, as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct). `0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)` Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible. Please feel free to suggest alternative approaches. Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running: `python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe` from onnxruntime\core\flatbuffers\schema which I thought was the correct way but maybe that's out of date. I think you can ignore all the diffs in the generated files and just worry about the changes to the .fbs files in onnxruntime/core/flatbuffers/schema. Basically start at the bottom of the files changed and work up as all the 'real' diffs are there. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: carzh <wolfivyaura@gmail.com>	2024-04-22 15:17:43 -07:00
Yulong Wang	a457c1df80	upgrade emsdk to 3.1.57 (#20295 ) ### Description upgrade emsdk to 3.1.57	2024-04-19 23:05:18 -07:00

1 2 3 4 5 ...

1776 commits