onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-23 22:13:38 +00:00

Author	SHA1	Message	Date
Jing Fang	7fa69461fd	[ARM] MatMulNBits FP16 support - kernels only (#22806 ) ### Description A break down PR of https://github.com/microsoft/onnxruntime/pull/22651 Add fp16 kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-11-12 14:28:47 -08:00
Ranjit Ranjan	193671295e	[AIX] Fix for AIX build break (#22745 ) ### Description With recent changes, below build error is found under AIX. ``` ld: 0706-012 The -p flag is not recognized. ld: 0706-012 The -a flag is not recognized. ld: 0706-012 The -t flag is not recognized. ld: 0706-012 The -h flag is not recognized. ld: 0706-012 The -= flag is not recognized. ld: 0706-012 The -$ flag is not recognized. ld: 0706-012 The -$ flag is not recognized. ld: 0706-012 The -O flag is not recognized. ld: 0706-027 The -R IGIN flag is ignored. collect2: error: ld returned 255 exit status ``` ### Motivation and Context AIX linker doesn't support -rpath option , so blocking this option under AIX.	2024-11-07 13:22:22 -08:00
Changming Sun	88676e62b9	Remove nsync (#20413 ) ### Description 1. Remove the onnxruntime::OrtMutex class and replace it with ~absl::Mutex~ std::mutex. 2. After this change, most source files will not include <Windows.h> indirectly. ### Motivation and Context To reduce the number of deps we have, and address some Github issues that are related to build ONNX Runtime from source. In PR #3000 , I added a custom implementation of std::mutex . It was mainly because at that time std::mutex's default constructor was not trivial on Windows. If you had such a mutex as a global var, it could not be initialized at compile time. Then VC++ team fixed this issue. Therefore we don't need this custom implementation anymore. This PR also removes nsync. I ran several models tests on Linux. I didn't see any perf difference. This PR also reverts PR #21005 , which is no longer needed since conda has updated its msvc runtime DLL. This PR unblocks #22173 and resolves #22092 . We have a lot of open issues with nsync. This PR can resolve all of them.	2024-10-21 15:32:14 -07:00
Jing Fang	1942e40e05	[ARM64] MatMulNBits: use neon instrinsics to convert between fp16 and fp32 (#22195 ) ### Description For fp16 Atype, the fallback operation is convert the data to fp32 and calculate. Added neon intrinsics version to speed up the conversion. Store address alignment and loop unrolling have insignificant impact on latency so they are omitted. \|Benchmark \| Time \| CPU \| \|--------------\|---------------------------------------------\|--------------------\| \|M_ConvertF16ToF32/baseline/real_time \| 1076961 ns \| 1083398 ns \| \|M_ConvertF16ToF32/aligned:0/real_time \| 46785 ns \| 46516 ns \| \|M_ConvertF16ToF32/aligned:1/real_time \| 46631 ns \| 46391 ns \| \|M_ConvertF16ToF32_unroll2/aligned:0/real_time \| 44074 ns \| 44392 ns \| \|M_ConvertF16ToF32_unroll2/aligned:1/real_time \| 44726 ns \| 45226 ns \| \|M_ConvertF32ToF16/baseline/real_time \| 520109 ns \| 527329 ns \| \|M_ConvertF32ToF16/aligned:0/real_time \| 73610 ns \| 74015 ns \| \|M_ConvertF32ToF16/aligned:1/real_time \| 71557 ns \| 71525 ns \| \|M_ConvertF32ToF16_unroll2/aligned:0/real_time \| 64227 ns \| 63374 ns \| \|M_ConvertF32ToF16_unroll2/aligned:1/real_time \| 67428 ns \| 67989 ns \| ### Motivation and Context speed up fallback implementation of Fp16 MatMulNBits	2024-09-26 13:55:40 -07:00
liqun Fu	a89bddd5c2	Matmul_nbits kernel for mlas sqnbits to support Fp16 inputs (#21807 )	2024-09-13 14:55:08 -07:00
wangshuai09	d539c27de8	Fix version check for using -mavxvnni (#21616 ) ### Description <!-- Describe your changes. --> Change the `CMAKE_CXX_COMPILER_VERSION` greater than `11` for using '-mavxvnni'. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> `CMakeFiles/onnxruntime_mlas.dir/root/Git.d/onnxruntime/onnxruntime/core/mlas/lib/x86_64/QgemmU8S8KernelAvx2.S.o cc: error: unrecognized command-line option ‘-mavxvnni’; did you mean ‘-mavx512vnni’?` using `gcc (GCC) 10.3.1`. `-mavxnni` is supported since [GCC 11 Release](https://gcc.gnu.org/gcc-11/changes.html), this PR change the version check.	2024-09-12 11:42:17 -07:00
Erick Muñoz	7489bfee53	Enable AVX NE CONVERT for FP16 to FP32 cast (#21183 ) ### Description Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT instructions to accelerate casting from FP16 to FP32. Added CPUID checks to determine support of the ISA. ### Motivation and Context Currently FP16 models executed on systems that lack complete FP16 operator support use single precision on every node to run the model, this means the original FP16 weights have to be casted to FP32 in order to run the model properly, this change aims to accelerate the casting by using upconvert instructions and therefore improve performance.	2024-09-09 21:19:31 -07:00
liqun Fu	b87e8edb98	Mlas int4 int8 with avx2/512 (#20687 ) ### Description model: phi-3-mini-4k-instruct avx2 symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|49.5\|70.0\|-29.2%\|9.6\|10.8\|-34.2% 32 \|76.8\|52.4\|9.7%\|15.2\|14.6\|4.1% 64 \|78.2\|71.4\|9.5%\|16.6\|16.3\|1.8% 128 \|72.9\|70.6\|3.2%\|17.1\|16.8\|1.7% 256 \|83.7\|63.6\|31.6%\|18.1\|17.4\|4% avx2 asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|50.7\|61.5\|-17.5%\|9.6\|9.2\|4.3% 32 \|77.4\|52.4\|47.7%\|14.6\|13.9\|5.0% 64 \|78.7\|63.0\|24.9%\|16.2\|15.9\|1.8% 128 \|80.0\|61.9\|29.2%\|17.2\|16.9\|1.7% 256 \|81.5\|63.3\|28.7%\|17.9\|17.3\|3.4% avx2vnni symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|82.9\|117.0\|-29.0%\|15.9\|19.3\|-17.6% 32 \|133.0\|100.4\|32.4%\|26.1\|24.5\|6.5% 64 \|166.9\|118.8\|40.4%\|28.3\|27.1\|4.4% 128 \|165.9\|119.6\|38.7%\|29.3\|28.5\|2.8% 256 \|165.2\|119.6\|38.1%\|30.2\|29.0\|4.1% avx2vnni asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|80.2\|118.9\|-32.5%\|15.1\|16.7\|-9.5% 32 \|130.7\|99.7\|31.0%\|25.0\|23.8\|5.0% 64 \|168.7\|124.9\|35.0%\|27.3\|26.8\|1.8% 128 \|169.6\|123.8\|36.9%\|29.2\|27.9\|4.6% 256 \|175.0\|125.7\|39.0%\|30.0\|29.7\|1.0% avx512 symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|135.2\|156.5\|-13.6\|25.5\|23.8\|7.1 32 \|150.0\|159.5\|-5.9\|34.9\|29.6\|17.9 64 \|167.5\|157.5\|6.3\|39.7\|34.4\|15.4 128 \|177.8\|158.0\|12.5\|40.3\|35.4\|13.8 256 \|182.6\|157.3\|16.0\|41.7\|37.7\|10.6 avx512 asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|136.1\|151.4\|-10.1%\|26.1\|19.9\|31.1% 32 \|150.0\|157.8\|-4.9%\|34.3\|29.3\|17.0% 64 \|165.7\|156.6\|5.8%\|38.7\|30.7\|26.0% 128 \|180.4\|156.6\|15.1%\|40.2\|34.7\|15.8% 256 \|181.3\|158.0\|14.7%\|41.6\|36.6\|13.6% avx512vnni symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|143.4\|155.4\|-7.7%\|25.6\|23.3\|9.8% 32 \|159.2\|157.0\|1.4%\|34.1\|29.8\|14.4% 64 \|182.0\|159.5\|14.1%\|38.4\|34.8\|10.3% 128 \|221.2\|160.8\|37.5%\|41.0\|36.4\|12.6% 256 \|250.5\|162.4\|54.2%\|41.6\|37.7\|10.3% avx512vnni asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|142.5\|152.3\|-6.4%\|26.3\|19.7\|33.5% 32 \|158.2\|155.0\|2.0%\|34.3\|29.2\|17.4% 64 \|184.1\|156.6\|17.5%\|38.3\|30.9\|23.9% 128 \|215.8\|156.1\|17.5%\|41.3\|35.0\|17.9% 256 \|249.2\|155.9\|59.8%\|41.1\|36.3\|13.2% 4bit gemm implementation with avx using tile. 1. tile size is 2blk by 4. in case of size less then tile, it reduce to 1blk by 4, 2blk by 1 and lastly 1blk by 1. with internal kernel, weight and activation are loaded based on SIMD register width and blk length: avx2 256bit register, 64 weights and activation are loaded. blklen16: 4 blks are computed by the internal kernel blklen32: 2 blks are computed by the internal kernel blklen64: 1 blk are computed by the internal kernel blklen128: 1 blks are computed 2 times by the internal kernel blklen16: 1 blks are computed 4 times by the internal kernel avx512 512bit register, 128 weights and activation are loaded. blklen16: 8 blks are computed by the internal kernel blklen32: 4 blks are computed by the internal kernel blklen64: 2 blk are computed by the internal kernel blklen128: 1 blks are computed by the internal kernel blklen16: 1 blks are computed 2 times by the internal kernel 2. blksum is precomputed during prepacking. computation is reformed: Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b) Sum_blk is over one blk Sum1 is over all blks for one output Sum2 is over all blks for one output Sum is computed with sgemm with the current implementation. Further improvement is possible. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>	2024-08-02 10:20:22 -07:00
Ranjit Ranjan	6c7562b097	Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. (#21133 ) ### Description Enablement of onnxruntime for AIX and fixing issues related to big-endian platform. ### Motivation and Context changes in this PR contains: 1. Enablement code for building onnxruntime on AIX operating system. 2. while testing the build on AIX, we found issues related to big endian platform . More details about few of those issues can be found in [Big endian issue: Graph Transformation Attention Fusion tests are failing #12921](https://github.com/microsoft/onnxruntime/issues/12921) Below are list of files and the description about the change. 1. cmake/CMakeLists.txt [BUILDING on AIX issue] check for "IBMClang" is added for handling -Wno-unused-parameter 2. cmake/external/onnxruntime_external_deps.cmake [BUILDING on AIX issue]Enabling gtest_disable_pthreads for AIX 3. cmake/onnxruntime.cmake [BUILDING on AIX issue] o Blocking codes for AIX which generates generated_source.c and further requires some symbol files. o Putting NO AIX check for non-supported linker flags like --Xlinker o iconv linking 4. cmake/onnxruntime_framework.cmake [BUILDING on AIX issue]Putting NO AIX check for -Wl,-rpath='$ORIGIN' 5. cmake/onnxruntime_mlas.cmake [BUILDING on AIX issue]POWER10 releated macro/function definition . 6. cmake/onnxruntime_providers_cpu.cmake [BUILDING on AIX issue]Putting NO AIX check for non-supported linker flags like --Xlinker 7. cmake/onnxruntime_unittests.cmake [BUILDING on AIX issue] o Putting NO AIX check for non-supported linker flags like --Xlinker o Adding required libraries for AIX linker under applicatiion like onnxruntime_shared_lib_test ,onnxruntime_logging_apis etc 8. cmake/patches/flatbuffers/flatbuffers.patch [BUILDING on AIX issue] Handling of TypeCode in include/flatbuffers/flatbuffers.h under AIX + clang 9. onnxruntime/contrib_ops/cpu/murmur_hash3.cc [Big endian issue] Byte-Conversion handlling in compute() and getblock() routines 10. onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc [Big endian issue] Handling of test failures . Byte swapping for quant_value. 11. onnxruntime/core/framework/tensorprotoutils.cc [Big endian issue] Implementation of SetRawDataInTensorProto , ConvertRawDataInTensorProto . o SetRawDataInTensorProto : Wrapper for set_raw_data(). Calling ConvertRawDataInTensorProto() in big-endian system o ConvertRawDataInTensorProto : function used mainly on big-endian system for byte-swapping of tensor raw_data 12. onnxruntime/core/framework/tensorprotoutils.h [Big endian issue] Declaration of SetRawDataInTensorProto, ConvertRawDataInTensorProto 13. onnxruntime/core/graph/graph.cc [Big endian issue] o Call ConvertRawDataInTensorProto for SPARSE_TENSOR type o Call ConvertRawDataInTensorProto for SaveToOrtFormat 14. onnxruntime/core/mlas/lib/platform.cpp [BUILDING on AIX issue] POWER10 released enablement for AIX 15. onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp [BUILDING on AIX issue]Handling of __vector under AIX+clang 16. onnxruntime/core/mlas/lib/qgemm.h [BUILDING on AIX issue] Adding _AIX flag 17. onnxruntime/core/mlas/lib/qlmul.cpp [BUILDING on AIX issue] Handling of __vector under AIX+clang 18. onnxruntime/core/optimizer/attention_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 19. onnxruntime/core/optimizer/compute_optimizer/shared_utils.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 20. onnxruntime/core/optimizer/constant_folding.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 21. onnxruntime/core/optimizer/embed_layer_norm_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 22. onnxruntime/core/optimizer/nchwc_transformer.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 23. onnxruntime/core/optimizer/qdq_transformer/avx2_weight_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 24. onnxruntime/core/optimizer/qdq_transformer/qdq_s8_to_u8.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 25. onnxruntime/core/optimizer/qdq_transformer/s8_to_u8.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 26. onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 27. onnxruntime/core/optimizer/reshape_fusion.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 28. onnxruntime/core/optimizer/stft_decomposition.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 29. onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 30. onnxruntime/core/platform/path_lib.h [BUILDING on AIX issue] Moving to normal function call, instead of template 31. onnxruntime/core/platform/posix/env.cc [BUILDING on AIX issue]Blocking syscall.h in AIX 32. onnxruntime/core/session/inference_session.cc [Big endian issue] Removing ORT_RETURN_IF_NOT, FLATBUFFERS_LITTLEENDIAN 33. onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc [Big endian issue] Call ConvertRawDataInTensorProto in CreateInitializer and ExternalWriteReadWithLoadInitializers 34. onnxruntime/test/framework/sparse_kernels_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 35. onnxruntime/test/framework/tensorutils_test.cc [Big endian issue] Helper method ConvertEndianessForVector and call this from required place. 36. onnxruntime/test/framework/test_tensor_loader.cc o. [BUILDING on AIX issue] Handling of getcwd for AIX o. [Big endian issue] Bytes Swapping in run_external_data_test 37. onnxruntime/test/onnx/main.cc [Big endian issue] including <thread> for AIX 38. onnxruntime/test/onnx/tensorprotoutils.cc [Big endian issue] Bytes swapping in UnpackTensorWithRawData 39. onnxruntime/test/optimizer/graph_transform_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 40. onnxruntime/test/optimizer/graph_transform_test_builder.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 41. onnxruntime/test/optimizer/graph_transform_test_builder.h [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 42. onnxruntime/test/optimizer/initializer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 43. onnxruntime/test/optimizer/nchwc_optimizer_test.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 44. onnxruntime/test/providers/base_tester.cc [Big endian issue] Use util function SetRawDataInTensorProto, instead of set_raw_data 45. onnxruntime/test/providers/cpu/generator/random_test.cc [BUILDING on AIX issue] Adding AIX check in MultinomialGoodCase --------- Co-authored-by: Vamshikrishna Thatikonda <vamshikrishna@in.ibm.com>	2024-07-17 12:37:06 -07:00
Qingnan Duan	80b56feb41	Implement FlashAttention for CPU (#20805 ) ### Description Implement [FlashAttention](https://arxiv.org/pdf/2205.14135) and [FlashAttention-2](https://arxiv.org/pdf/2307.08691) for MultiHeadAttention on CPU. ### Motivation and Context Accelerate the execution of MultiHeadAttention. Current performance: 10ms vs 16ms (com.microsoft.MultiHeadAttention) on my Linux machine and 10ms vs 38ms (com.microsoft.MultiHeadAttention) on my Windows machine. May need further optimizations. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Qingnan Duan <qiduan@microsoft.com>	2024-07-11 14:19:59 -07:00
Edward Chen	20cd3394fc	[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation (#21193 ) Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., computing the output in 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B. Also moved some code around as it was getting big for a single file.	2024-07-10 15:39:26 -07:00
Edward Chen	a39f8862fd	SQNBitGemm - move workspace size calculation functions to hardware-specific implementations (#20757 ) The workspace usage may be hardware-specific. Moving away from a common workspace size calculation allows more flexibility in the hardware-specific implementations.	2024-05-22 15:12:17 -07:00
liqun Fu	cc26b2dac2	Mlas Gemm 4bit avx2, avx512, and avx512vnni kernels (#20163 ) ### Description ``` Avx2: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 90.96 25.15 -72% 7.65 11.71 53% Blklen32: 90.73 48.55 -46% 7.86 14.28 81% Blklen64: 89.49 68.84 -23% 8.30 15.78 90% Blklen128: 87.38 78.37 -10% 7.90 16.05 103% Blklen256: 89.45 82.36 -7% 8.30 16.56 99% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 91.36 105.18 15% 7.57 9.52 25% Blklen32: 89.30 105.99 18% 7.65 9.68 26% Blklen64: 89.53 101.41 13% 7.97 9.84 23% Blklen128: 85.23 99.71 16% 7.86 10.39 32% Blklen256: 88.46 97.94 10% 8.32 10.23 22% Avx512vnni: Int8 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 132.18 21.56 -83% 10.34 11.48 11% Blklen32: 168.28 43.69 -74% 11.85 14.73 24% Blklen64: 201.81 60.29 -70% 12.36 15.47 25% Blklen128: 194.92 57.04 -71% 13.03 14.67 12% Blklen256: 218.76 70.20 -68% 13.33 16.31 22% Fp32 NS(Prompt) MLAS(Prompt) MLAS(Prompt)Gain/Loss NS(TokenGen) MLAS(TokenGen) MLAS(TokenGen)Gain/Loss Blklen16: 102.81 92.74 -9% 8.41 9.18 9% Blklen32: 109.49 97.08 -11% 8.83 11.51 30% Blklen64: 104.13 101.57 -2% 9.32 12.00 28% Blklen128: 108.45 103.69 -4% 9.58 12.45 29% Blklen256: 109.43 106.43 -2% 9.19 12.2 32% ``` --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>	2024-04-25 21:30:50 -07:00
Yi-Hong Lyu	6b6a62fb40	Add vectorized AVX512F kernel for ReduceMaximumF32Kernel (#20268 ) ### Description <!-- Describe your changes. --> This commit introduces a new vectorized AVX512F kernel, MlasReduceMaximumF32KernelAvx512F, which efficiently computes the maximum value of the supplied buffer. Additionally, microbenchmarks have been added for MlasComputeSoftmax (inplace), MlasReduceMaximumF32KernelAvx, MlasComputeSumExpF32KernelAvx512F, and MlasComputeSoftmaxOutputF32KernelAvx. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The goal of this commit is to enhance the performance of ReduceMaximumF32Kernel on CPUs with AVX512F instruction support. \| AVX \| \| \| AVX512 \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- name \| iterations \| real_time \| cpu_time \| iterations \| real_time \| cpu_time \| time_unit REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:3/real_time \| 271277304 \| 2.58095 \| 2.58091 \| 263338132 \| 2.65661 \| 2.65661 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:3/real_time \| 271220477 \| 2.58095 \| 2.58095 \| 263509929 \| 2.65652 \| 2.65649 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:3/real_time \| 271240587 \| 2.58064 \| 2.58064 \| 263479542 \| 2.65671 \| 2.65665 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:3/real_time \| 271227745 \| 2.58083 \| 2.58079 \| 263402506 \| 2.65657 \| 2.65657 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:3/real_time \| 271255069 \| 2.58073 \| 2.58071 \| 263463858 \| 2.65682 \| 2.65682 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:3/real_time \| 271257174 \| 2.58058 \| 2.58052 \| 263460120 \| 2.65682 \| 2.65682 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:4/real_time \| 174395051 \| 4.01401 \| 4.01401 \| 197330481 \| 3.5465 \| 3.54636 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:4/real_time \| 174645502 \| 3.99691 \| 3.99691 \| 197474831 \| 3.54298 \| 3.54278 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:4/real_time \| 174523308 \| 4.01391 \| 4.01386 \| 197389981 \| 3.54518 \| 3.54506 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:4/real_time \| 174779200 \| 3.99874 \| 3.99874 \| 197519075 \| 3.54227 \| 3.54209 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:4/real_time \| 174642874 \| 4.00645 \| 4.00641 \| 197642101 \| 3.54195 \| 3.54188 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:4/real_time \| 174546754 \| 4.0061 \| 4.00608 \| 197621033 \| 3.54296 \| 3.54281 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:5/real_time \| 162752651 \| 4.30119 \| 4.30114 \| 215552503 \| 3.24767 \| 3.24752 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:5/real_time \| 162717463 \| 4.30123 \| 4.30116 \| 215541082 \| 3.24711 \| 3.24695 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:5/real_time \| 162718819 \| 4.3016 \| 4.30153 \| 215589239 \| 3.24725 \| 3.24708 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:5/real_time \| 162719596 \| 4.30151 \| 4.30145 \| 215563846 \| 3.24956 \| 3.24949 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:5/real_time \| 162753333 \| 4.30125 \| 4.30125 \| 215537315 \| 3.24924 \| 3.24908 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:5/real_time \| 162752258 \| 4.3014 \| 4.30141 \| 215526482 \| 3.24744 \| 3.24735 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:7/real_time \| 143579660 \| 4.87526 \| 4.87516 \| 100000000 \| 5.25767 \| 5.25752 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:7/real_time \| 143585097 \| 4.87476 \| 4.87467 \| 100000000 \| 5.41583 \| 5.41567 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:7/real_time \| 143571011 \| 4.87506 \| 4.87503 \| 182359467 \| 3.83773 \| 3.83764 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:7/real_time \| 143587142 \| 4.87487 \| 4.8748 \| 182397261 \| 3.83807 \| 3.8379 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:7/real_time \| 143578465 \| 4.87525 \| 4.87521 \| 182428602 \| 3.83777 \| 3.83768 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:7/real_time \| 143588555 \| 4.87491 \| 4.87488 \| 125280452 \| 5.59791 \| 5.59766 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:9/real_time \| 284851058 \| 2.43476 \| 2.43476 \| 156879863 \| 4.42895 \| 4.42884 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:9/real_time \| 270700898 \| 2.59031 \| 2.59024 \| 157953114 \| 4.42995 \| 4.42968 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:9/real_time \| 282871172 \| 2.45385 \| 2.45385 \| 157801156 \| 4.42817 \| 4.42804 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:9/real_time \| 285307738 \| 2.47009 \| 2.47005 \| 158058507 \| 4.4279 \| 4.42786 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:9/real_time \| 285709536 \| 2.45481 \| 2.45476 \| 158070961 \| 4.42809 \| 4.42799 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:9/real_time \| 285449733 \| 2.47495 \| 2.47491 \| 158069718 \| 4.45026 \| 4.45017 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:11/real_time \| 189213618 \| 3.79684 \| 3.79676 \| 139459497 \| 5.01882 \| 5.01871 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:11/real_time \| 185600468 \| 3.76394 \| 3.76376 \| 139444892 \| 5.01922 \| 5.01905 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:11/real_time \| 184968668 \| 3.80636 \| 3.80636 \| 139470834 \| 5.01948 \| 5.01936 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:11/real_time \| 183867226 \| 3.80432 \| 3.80427 \| 139481986 \| 5.01975 \| 5.01944 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:11/real_time \| 184301650 \| 3.81634 \| 3.81634 \| 139452846 \| 5.01983 \| 5.01972 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:11/real_time \| 186215795 \| 3.82659 \| 3.82654 \| 139497736 \| 5.02119 \| 5.02113 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:13/real_time \| 135622415 \| 5.16256 \| 5.16252 \| 124661337 \| 5.61227 \| 5.61194 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:13/real_time \| 135618907 \| 5.15967 \| 5.1596 \| 124805224 \| 5.6088 \| 5.60854 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:13/real_time \| 135612192 \| 5.15506 \| 5.15501 \| 124803221 \| 5.60901 \| 5.60869 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:13/real_time \| 135906082 \| 5.15818 \| 5.15818 \| 124776601 \| 5.60898 \| 5.60886 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:13/real_time \| 135369523 \| 5.15709 \| 5.15682 \| 124790370 \| 5.60927 \| 5.60902 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:13/real_time \| 135596827 \| 5.1603 \| 5.1603 \| 124792145 \| 5.61637 \| 5.61614 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:15/real_time \| 110947137 \| 5.96511 \| 5.96495 \| 112861522 \| 6.20035 \| 6.20014 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:15/real_time \| 118004792 \| 6.22645 \| 6.22628 \| 112909900 \| 6.20073 \| 6.20073 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:15/real_time \| 112630319 \| 6.25564 \| 6.25552 \| 112874563 \| 6.19932 \| 6.19924 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:15/real_time \| 117403034 \| 6.17263 \| 6.17258 \| 112927318 \| 6.19866 \| 6.19842 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:15/real_time \| 108921863 \| 6.48624 \| 6.48612 \| 112927746 \| 6.20057 \| 6.20026 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:15/real_time \| 110358148 \| 6.66805 \| 6.66789 \| 112907312 \| 6.19938 \| 6.19908 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:16/real_time \| 203419574 \| 3.4415 \| 3.44137 \| 237134525 \| 2.95649 \| 2.95638 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:16/real_time \| 203414035 \| 3.4411 \| 3.44099 \| 237129564 \| 2.95178 \| 2.95171 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:16/real_time \| 203404068 \| 3.44157 \| 3.44151 \| 236981704 \| 2.9518 \| 2.95167 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:16/real_time \| 203391471 \| 3.44146 \| 3.44137 \| 237108807 \| 2.95203 \| 2.95196 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:16/real_time \| 203393801 \| 3.44131 \| 3.44127 \| 237126460 \| 2.95278 \| 2.95272 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:16/real_time \| 203407476 \| 3.44181 \| 3.44162 \| 237154444 \| 2.95293 \| 2.9528 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:500/real_time \| 37551439 \| 18.6407 \| 18.6407 \| 39222534 \| 17.858 \| 17.8571 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:500/real_time \| 37544097 \| 18.6404 \| 18.6401 \| 39174151 \| 17.8539 \| 17.8536 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:500/real_time \| 37549837 \| 18.6391 \| 18.6391 \| 39233956 \| 17.8507 \| 17.8505 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:500/real_time \| 45996345 \| 15.2157 \| 15.2153 \| 39285929 \| 17.848 \| 17.8474 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:500/real_time \| 46012429 \| 15.2184 \| 15.2179 \| 65664865 \| 10.7366 \| 10.7364 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:500/real_time \| 45912375 \| 15.2349 \| 15.2346 \| 65205908 \| 10.8498 \| 10.8492 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:4/D:2000/real_time \| 9493955 \| 73.7232 \| 73.7203 \| 10188090 \| 68.7931 \| 68.7908 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:8/D:2000/real_time \| 9495562 \| 73.7173 \| 73.7173 \| 10180895 \| 68.7533 \| 68.7511 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:16/D:2000/real_time \| 9487371 \| 73.7852 \| 73.7831 \| 10164473 \| 68.7279 \| 68.725 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:32/D:2000/real_time \| 10816047 \| 64.7322 \| 64.7287 \| 10168481 \| 68.8109 \| 68.8096 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:64/D:2000/real_time \| 10808802 \| 64.7232 \| 64.721 \| 19478320 \| 36.1471 \| 36.1461 \| ns REDUCEMAXIMUMF32KERNEL[]/ByteAligned:128/D:2000/real_time \| 10818192 \| 64.7304 \| 64.728 \| 19419672 \| 35.9635 \| 35.9635 \| ns	2024-04-16 13:52:43 -07:00
Rachel Guo	6b305f95e0	Support xcframework for mac catalyst builds. (#19534 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> MAUI on macOS uses mac-catalyst which requires a different native binary. --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Scott McKay <skottmckay@gmail.com>	2024-03-20 10:55:19 -07:00
snadampal	77da2ef278	[aarch64] Add Sbgemm kernel to accelerate fp32 tensor matmul with bfloat16 (#17031 ) ### Description This PR adds SbgemmKernel for aarch64. This includes Sbegmm kernel to implement matrix multiplication with bfloat16 SIMD instructions (bfmmla) and MatMul operator changes to invoke the Sbgemm kernel. To enable Sbgemm kernel, set the following session option: "kOrtSessionOptionsGemmFastMathMode" The PR also adds new test cases for mlas and ort. ### Motivation and Context This is to improve MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed 1.2x -1.76x performance improvement compared to sgemm (fp32) kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` And the unit test precision results are matching to sgemm kernel results. `./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync `	2024-01-22 14:43:06 -08:00
luoyu-intel	459c750b03	Update x64 template kernel library for 'sqnbitgemm' (#19016 ) ### Description <!-- Describe your changes. --> 1. Make JBLAS codes an external module of ORT. 2. Move q4 gemm code to contrib_ops. 3. Update template kernel library to v0.1 release. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> We found that the current LLM model performance is far below our expectations. Here is some performance data collected on Mistral-7B model with Xeon-8480: 8 threads \| prompt length=32 past_len=32 \| prompt length=1 past_len=32 -- \| -- \| -- ORT-main \| 1220ms \| 263ms Neural-speed \| 564ms \| 87ms ORT-this PR\|597ms\|120ms Although `Neural-speed` and `ORT-this PR` use the same int4 kernel code, there is a 33ms(87ms vs. 120ms) latency gap between the two frameworks. Through some statistics analysis, the summary latency of `MatMulNBits` is 86.7ms The summary latency of all int4 GEMMs in `Neural-speed` is 84.8ms. So other OPs introduce an extra 30ms latency. The performance of MatMulNBits in this PR meets our expectations. ### Remain Issues 1. For hybrid CPUs, like core 12900K, the ONNXRuntime thread pool uses TaskGranularityFactor to scale its number of threads. This is not expected in our code design. It may slow down the hybrid CPU performance by 30~40%. 2. Prepack uses a single thread which is very slow to init a session. 3. MatMulNBits with zero points will fall through to COMP_FP32 even accuracy_level=4. Our COMP_INT8 IGemmCore with zero points process is not optimized for now. It will be updated in the future. So, for an int4 model with zero points, whether the accuracy_level is 0 or 4 will be no difference.	2024-01-18 13:16:34 -08:00
Edward Chen	150c4cb8fe	[MLAS AArch64] SQNBitGemm CompInt8 kernel (#18953 ) Implement ARM NEON SQNBitGemm kernel that first block quantizes A to int8 and then does int8 multiplication.	2024-01-12 17:58:08 -08:00
luoyu-intel	5f00bc9931	Integrate high-performance x64 gemm library to MLAS (#17669 ) ### Description Improve MLAS to support high-performance x64 INT4 kernels ### Motivation and Context 1. improve LLM inference performance on Intel CPUs. 2. support more 4bit quantization types: nf4, fp4 3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension 4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16 5. support MatMulNBits' data format ### Tasks - [x] support block_size: 32, 128, -1(per channel) - [x] get weight pack size without memory allocation - [x] use ort's thread pool for parallelism - [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8 ### Benchmark Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 47613 \| 47401 \| 12970 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 6347792 \| 6317562 \| 109 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 11814014 \| 11757847 \| 59 Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 50222 \| 50031 \| 13759 Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 2038222 \| 2028743 \| 341 Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 3792832 \| 3774485 \| 191 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 58717 \| 58501 \| 11467 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 1360846 \| 1354598 \| 543 Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 2564232 \| 2551365 \| 266 Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 57929 \| 57694 \| 12047 Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5495330 \| 5465810 \| 126 Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10676240 \| 10617817 \| 66 Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 68305 \| 68047 \| 10026 Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5504862 \| 5476215 \| 126 Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 11758623 \| 11697337 \| 66 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 67713 \| 67451 \| 10298 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5508325 \| 5480237 \| 126 Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10738528 \| 10681656 \| 64 Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 60708 \| 60486 \| 11321 Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5523784 \| 5495736 \| 126 Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10829633 \| 10772161 \| 67 Reference: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time \| 53088 \| 52911 \| 13364 Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time \| 6268981 \| 6230335 \| 110 Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time \| 11701237 \| 11632339 \| 59 Win11+12900K 8 cores: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time \| 215976 \| 211295 \| 2884 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time \| 60960590 \| 60937500 \| 10 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time \| 1.18E+08 \| 1.19E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time \| 470377 \| 453059 \| 1414 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time \| 1.54E+08 \| 1.53E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time \| 3.18E+08 \| 3.13E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time \| 569072 \| 559398 \| 1229 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time \| 1.54E+08 \| 1.52E+08 \| 4 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time \| 3.22E+08 \| 3.28E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time \| 1486055 \| 1473325 \| 403 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time \| 4.14E+08 \| 4.14E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time \| 8.88E+08 \| 8.59E+08 \| 1 --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mengni.wang@intel.com>	2023-12-19 09:36:31 -08:00
junchao-loongson	4abec9749e	[mlas] add loongarch lsx and lasx optimize code (#17937 ) ### Description Hello we(@lixing-star) are the developers of loongson team. We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch architecture [100% tests passed, 0 tests failed out of 7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true) ### Development Environments1 ``` CPU: Loongson-3C5000L uname -a: Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux ``` ### LonngArch Documents - [LoongArch Reference Manual - Volume 1: Basic Architecture: This manual describes the basic part of the LoongArch architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html) - [LoongArch ELF psABI: This manual describes the LoongArch ELF psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html) - [more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)	2023-12-07 11:15:59 -08:00
Edward Chen	0a4d76d98b	MLAS AArch64 quantized int4 Gemm kernel (#18031 ) - Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs. - Connect MatMulNBits contrib op to MLAS function.	2023-11-15 09:31:54 -08:00
snadampal	d88d52eead	[aarch64] Remove mmla kernel support from apple (#18082 ) ### Description <!-- Describe your changes. --> The mmla kernels require additional ISA flags and are currently supported only on Linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> more context is in https://github.com/microsoft/onnxruntime/pull/15270 cc: @skottmckay , @chenfucn , @snnn	2023-10-25 11:34:57 -07:00
snadampal	780ee186d7	[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160 ) ### Description <!-- Describe your changes. --> This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This covers (i) symmetric quantization (zero point is Zero) (ii) asymmetric quantization (zero point is non zero) (iii) per channel as well as per tensor quantization (iv) Signed weights (U8S8 Gemm) (v) Unsigned weights (U8U8 Gemm) and (vi) Signed activations and weights (S8S8 Gemm) scenarios I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM` support MMLA QGEMM kernels are enabled for all the devices that support I8MM instructions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to improve INT8 quantized MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed up to 1.33x performance improvement compared to the optimized UDOT qgemm kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` I have also run the unit tests, and made sure all are passing ``` ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync ```	2023-10-24 07:49:04 +10:00
MistEO	870b0bc305	Fix typo of cmake (#17715 ) This caused a cmake configuration error.	2023-09-27 11:48:46 -07:00
Chen Fu	3c10f027de	4b quantization for weights of LLMs (#16833 ) ### Description Blockwise 4b quantization for LLMs. 1. Introduce 4b block-wise quantization for linear layer weights. 2. Implements matrix multiplication kernel for fp32 x int4 3. Implements special operator MatMulFpQ4 4. Implements quantization tool, that convert MatMul operator to MatMulFpQ4, when the right hand side is 2D const tensor. ### Motivation and Context Compress and accelerate LLMs \|Benchmark \| Time(ns)\| \|-------------\|----------\| \|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8\| 218054\| \|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8\| 35830155\| \|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8\| 73479790\| \|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8\| 270152\| \|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8\| 35826721\| \|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8\| 73021200\| \|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8\| 213832\| \|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8\| 36749874\| \|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8\| 72618120\| \|Benchmark \| Time(ns)\| \|-------------\|----------\| \|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8\| 522610\| \|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8\| 39237689\| \|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8\| 75983467\| --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-08-07 12:23:55 -07:00
Yi Zhang	2e214d6e27	Workaround to upgrade VS2022 for Windows ARM build (#16826 ) ### Description ### Motivation and Context It should be reverted when VS2022 is upgraded to 17.7 or above. ### Vefication https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=331401&view=logs&j=7517abfd-115a-5c61-78a0-7ba3c9e3a88d	2023-07-25 08:35:52 +08:00
Dipanjan Sengupta	a461608409	Amx flag removal (#16527 ) ### Description 1. Replacing AMX intrinsics with machine code macros in QGEMM kernel. 2. Removing AMX build flags for GCC in cmake file. 3. Fixing the link time optimization (LTO) issue introduced with asm .include of an assembly file. I have moved the AMX instruction macro definitions from QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue. Note that I am also pushing the macros defined in QgemmU8S8KernelAmxCommon.S for future reference. A special thanks to @laxmansole who helped in the development of the instruction macro definitions for AMX intrinsics and fixing the LTO issue. ### Motivation and Context The additional AMX flag in cmake adds an extra layer of dependency on GCC version to use the feature.These changes should allow the usage of the AMX feature with just the CPU ID check.	2023-07-13 11:19:49 -07:00
Scott McKay	697dd12f6e	Re-organize the transpose optimization and layout transformation files. (#16246 ) ### Description <!-- Describe your changes. --> Split out the more basic changes from #15552 for easier review. Re-organize to clarify the structure - Separate out generic base functionality from ORT specific components - pass in handlers for internal ORT ops to Optimize - Split out layout transformation from transpose optimization - Separate out level 1 transpose optimizer - Cleanup some naming to try and clarify things like an optimizer vs. general optimization code Most of the changes are from this movement of code. Two implementation changes: - the extended handlers are queried first in GetHandler - allows the extended handlers to override the default behaviour for an ONNX operator - simplify the Optimize function to remove OptimizerMode. - `can_modify_node` is used instead of `mode` and `ignore_assigned_nodes` and a long description of the current usage is added. I don't _think_ that changes the current behavior and hopefully clarifies what happens and when, and makes the base transpose optimizer implementation more generic. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Create a cleaner separation to support adding EP specific logic next to cleanly handle where an EP has additional layout sensitive behaviour required (e.g. it's Resize implementation only handles one layout).	2023-07-07 08:24:47 +10:00
Chen Fu	5c125b4366	Cfu revertamx (#16455 ) ### Description This is to revert two PRs that aim at reducing AMX toolchain requirements. Unfortunately we still have some pipeline issues. https://github.com/microsoft/onnxruntime/pull/16390 https://github.com/microsoft/onnxruntime/pull/16086 ### Motivation and Context Looks like gcc link time optimization does not work very well with inline assembly in the above PRs.	2023-06-23 09:20:23 -07:00
Dipanjan Sengupta	35fa6af428	Fix for the build break in AMX feature on Mac OS. (#16390 ) ### Description Fixing the build break issue in Apple pipeline due to AMX flag removal.	2023-06-16 21:00:41 -07:00
Dipanjan Sengupta	681a0d084d	Removing AMX build flag (#16086 ) ### Description 1. Replacing AMX intrinsics with machine code macro instructions in QGEMM kernel. 2. Removing AMX build flags for GCC in cmake file. ### Motivation and Context The additional AMX flag in cmake adds an extra layer of dependency on GCC version to use the feature.These changes should allow the usage of the AMX feature with just the CPU ID check.	2023-06-15 11:22:59 -07:00
Changming Sun	0204594f90	Cleanup WASM cmake code (#15996 ) ### Description Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if (CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code look more nature. For example, ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY) ``` becomes ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten") ```	2023-05-20 18:07:39 -07:00
George Nash	f2889b41c1	[AMX] Update assembler check (#15501 ) A recent commit added an assembler check if the ASM dialect was ATT This unfortunately broke the AMX build for systems that don't have the ASM-ATT dialect. This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed checks AMX is supported by the compiler and assembler. ### Description ### Motivation and Context On my build system the recent change to add the ASM-ATT version check disabled AMX code from the build. --------- Signed-off-by: George Nash <george.nash@intel.com>	2023-04-19 14:16:26 -07:00
Yateng Hong	9bb4e4bef4	Fix masm flags (#15417 ) ### Description Fix onnxruntime_mlas build failure with cmake 3.26. Updated CMAKE generator expression to make sure certain complier flags only apply for C/CXX compiler. ### Motivation and Context CMake changed the behavior of ASM_MASM in version 3.26. See https://gitlab.kitware.com/cmake/cmake/-/issues/24639. This also fixed the issue of #15101	2023-04-07 10:20:03 -07:00
Chen Fu	605c2f4b89	Remove fp16 support from apple (#15270 ) ### Description Removing fp16 support from apple build ### Motivation and Context FP16 support on ARM64 only available after armv8.2a, thus the clang compiler needs a compilation flag `-march=armv8.2-a+fp16`. Unfortunately, our current universal build does not support hardware specific compilation flags on cpp source files, as it would cause trouble when compiling against more than one hardware target. Until we figure out how to remove this limitation, had to disable fp16 support for Apple systems.	2023-03-30 16:44:26 -07:00
Chen Fu	41ddcd30a1	Fp16 NHWC Max and Average Pooling (#15181 ) ### Description Max and average pooling operators for fp16, NHWC ### Motivation and Context Continue on the steps for fp16 inference support	2023-03-28 08:22:55 -07:00
Jian Chen	527e006124	Update mlas (#15228 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-27 14:18:48 -07:00
JiCheng	126e7bf15f	[AMX] add assembler check (#15055 ) ### Description <!-- Describe your changes. --> AMX isn't supportted until assembler 2.40 even though the GCC frontend supports it. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-22 07:57:22 +08:00
Chen Fu	34175f0b7c	FP16 conv (#15062 ) ### Description Convolution for fp16 datatype. Use NHWC for computation. For NCHW input, it rearranges the input tensor to NHWC format before computing the result. Support two optional fusion: 1. Activation 2. Add (not yet implemented) ### Motivation and Context Accelerating fp16 inference	2023-03-21 10:32:43 -07:00
Jian Chen	6891ab5bac	fix_macos (#15018 ) ### Description <!-- Describe your changes. --> This fix macos packaging build on universal2 arch. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-14 21:54:44 -07:00
Chen Fu	acc2ac627f	Fp16 Activations (#14722 ) ### Description NEON fp16 SIMD implementation of Activation functions ### Motivation and Context Step 2 of fp16 SIMD support. --------- Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-28 17:20:40 -08:00
Chen Fu	733ca85b73	Cfu fp16 (#14538 ) ### Description FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel, and ARM64 NEON kernel. ### Motivation and Context First step in creating native support of fp16 model inferencing on ARM64 and AMD64 platforms. --------- Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-15 12:51:53 -08:00
Chen Fu	90142899bd	Supporting Intel AMX instructions in quantized GEMM (#14042 ) ### Description Using Intel AMX int8 instructions to accelerate quantized GEMM ### Motivation and Context AMX instructions accelerate quantized GEMM significantly: Prepacked B perf numbers (latency in ns) GEMM Config \| AVX512Vnni \| AMX -- \| --: \| --: M:384/N:1024/K:1024/Batch:1/Threads:4 \| 1057511 \| 285393 M:384/N:1024/K:3072/Batch:1/Threads:4 \| 2643929 \| 700397 M:384/N:1024/K:4096/Batch:1/Threads:4 \| 3784750 \| 890701 M:384/N:4096/K:1024/Batch:1/Threads:4 \| 2378139 \| 887251 M:384/N:1024/K:1024/Batch:1/Threads:16 \| 307137 \| 138481 M:384/N:1024/K:3072/Batch:1/Threads:16 \| 855730 \| 295027 M:384/N:1024/K:4096/Batch:1/Threads:16 \| 1126878 \| 317395 M:384/N:4096/K:1024/Batch:1/Threads:16 \| 781963 \| 237014 M:1536/N:1024/K:1024/Batch:1/Threads:16 \| 538864 \| 181459 M:1536/N:1024/K:3072/Batch:1/Threads:16 \| 1681002 \| 561600 M:1536/N:1024/K:4096/Batch:1/Threads:16 \| 2158127 \| 717470 M:1536/N:4096/K:1024/Batch:1/Threads:16 \| 2428622 \| 896140 M:3072/N:1024/K:1024/Batch:1/Threads:16 \| 1058029 \| 357031 M:3072/N:1024/K:3072/Batch:1/Threads:16 \| 3138504 \| 1095857 M:3072/N:1024/K:4096/Batch:1/Threads:16 \| 4155640 \| 1386183 M:3072/N:4096/K:1024/Batch:1/Threads:16 \| 4679030 \| 1778624 Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-01-10 12:16:27 -08:00
Edward Chen	2ecd1d6622	Switch GSL to MS GSL 4.0.0 (#13416 )	2022-10-29 04:15:20 -07:00
Jack·Boos·Yu	ea004e953f	[cmake] Export multi targets in static build (#11063 ) * [cmake] Export multi targets in static build * Install more components in static build, format some code * Fix code pos	2022-04-03 22:37:18 -07:00
Chen Fu	dc72159105	Symmetric Quant indirect Conv kernel for ARMv8 A55 chip (#10862 ) ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions. This change adds a Symmetric Quant indirect Conv kernel for a55 micro-architecture, where we replace ldr q4,[x1], with ldr d4,[x1], ldr x11,[x1], ins v4.d[1],x11 so that we can try to hide the memory load cycles behind computing cycles in the kernel. With this new kernel, cartoongan model shows significant perf improvement on Pixel5a little cores (2 threads running on two little cores): new kernel: 2188.59 ms old kernel: 2360.61 ms	2022-03-25 17:10:47 -07:00
Chen Fu	50a6f095cd	Symmetric QGEMM kernel for ARMv8 A55 chip (#10754 ) ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions. This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace ldr q4,[x1],#16 with ldr d4,[x1],#8 ldr x11,[x1],#8 ins v4.d[1],x11 so that we can try to hide the memory load cycles behind computing cycles in the kernel. Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-03-07 08:41:13 -08:00
Maxiwell	43ff27c7c8	ppc64le: optimizing the MlasQuantizeLinear() with VSX (#10644 ) This code is valid only when -mcpu is set to utilize POWER9 technology or above. A compatible code for POWER8 was created as well, but it was not tuned for performance.	2022-03-04 14:54:56 -08:00
RajalakshmiSR	5d8c5409ab	POWER10: QGEMM optimization (#10642 ) * POWER10: QGEMM optimization This patch makes use of POWER10 MMA feature for QGEMM function. This optimization includes signed and unsigned cases.Tested and there are no new failures with gcc11 and clang-14. * Changes as per review comments Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2022-03-02 08:36:26 -08:00
Chen Fu	c4f1dfcfaa	Cfu s8s8 (#10413 ) Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv. Perf number with single thread: Nokia G10 (baseline / new) in ms Pixel 4 (baseline/new) in ms mobilenet_edgetpu 220 / 213 18.5 / 17.6 cartoongan 8537 / 8521 967 / 928 Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-01-28 09:26:52 -08:00

1 2 3

131 commits