onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-03 23:49:44 +00:00

Author	SHA1	Message	Date
luoyu-intel	5f00bc9931	Integrate high-performance x64 gemm library to MLAS (#17669 ) ### Description Improve MLAS to support high-performance x64 INT4 kernels ### Motivation and Context 1. improve LLM inference performance on Intel CPUs. 2. support more 4bit quantization types: nf4, fp4 3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension 4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16 5. support MatMulNBits' data format ### Tasks - [x] support block_size: 32, 128, -1(per channel) - [x] get weight pack size without memory allocation - [x] use ort's thread pool for parallelism - [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8 ### Benchmark Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 47613 \| 47401 \| 12970 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 6347792 \| 6317562 \| 109 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 11814014 \| 11757847 \| 59 Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 50222 \| 50031 \| 13759 Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 2038222 \| 2028743 \| 341 Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 3792832 \| 3774485 \| 191 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 58717 \| 58501 \| 11467 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 1360846 \| 1354598 \| 543 Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 2564232 \| 2551365 \| 266 Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 57929 \| 57694 \| 12047 Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5495330 \| 5465810 \| 126 Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10676240 \| 10617817 \| 66 Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 68305 \| 68047 \| 10026 Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5504862 \| 5476215 \| 126 Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 11758623 \| 11697337 \| 66 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 67713 \| 67451 \| 10298 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5508325 \| 5480237 \| 126 Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10738528 \| 10681656 \| 64 Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 60708 \| 60486 \| 11321 Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5523784 \| 5495736 \| 126 Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10829633 \| 10772161 \| 67 Reference: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time \| 53088 \| 52911 \| 13364 Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time \| 6268981 \| 6230335 \| 110 Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time \| 11701237 \| 11632339 \| 59 Win11+12900K 8 cores: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time \| 215976 \| 211295 \| 2884 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time \| 60960590 \| 60937500 \| 10 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time \| 1.18E+08 \| 1.19E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time \| 470377 \| 453059 \| 1414 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time \| 1.54E+08 \| 1.53E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time \| 3.18E+08 \| 3.13E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time \| 569072 \| 559398 \| 1229 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time \| 1.54E+08 \| 1.52E+08 \| 4 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time \| 3.22E+08 \| 3.28E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time \| 1486055 \| 1473325 \| 403 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time \| 4.14E+08 \| 4.14E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time \| 8.88E+08 \| 8.59E+08 \| 1 --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mengni.wang@intel.com>	2023-12-19 09:36:31 -08:00
junchao-loongson	4abec9749e	[mlas] add loongarch lsx and lasx optimize code (#17937 ) ### Description Hello we(@lixing-star) are the developers of loongson team. We add 128 (lsx), 256 (lasx) vector optimization code for the loongarch architecture [100% tests passed, 0 tests failed out of 7](https://cloud.a-boat.cn:2021/api/public/dl/6831z1Bi?inline=true) ### Development Environments1 ``` CPU: Loongson-3C5000L uname -a: Linux localhost.localdomain 4.19.190-6.4.lns8.loongarch64 #1 SMP Thu Jul 14 12:08:04 CST 2022 loongarch64 loongarch64 loongarch64 GNU/Linux ``` ### LonngArch Documents - [LoongArch Reference Manual - Volume 1: Basic Architecture: This manual describes the basic part of the LoongArch architecture.](https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html) - [LoongArch ELF psABI: This manual describes the LoongArch ELF psABI.](https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html) - [more](https://loongson.github.io/LoongArch-Documentation/README-EN.html)	2023-12-07 11:15:59 -08:00
Edward Chen	0a4d76d98b	MLAS AArch64 quantized int4 Gemm kernel (#18031 ) - Implement MLAS function for quantized 4-bit int Gemm (Gemm with float A and quantized 4-bit int B) for ARM NEON. This is an initial implementation. Only the M=1 path (with M being number of rows of A and C) has any optimization attempted so far. More optimization to come in future PRs. - Connect MatMulNBits contrib op to MLAS function.	2023-11-15 09:31:54 -08:00
snadampal	d88d52eead	[aarch64] Remove mmla kernel support from apple (#18082 ) ### Description <!-- Describe your changes. --> The mmla kernels require additional ISA flags and are currently supported only on Linux ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> more context is in https://github.com/microsoft/onnxruntime/pull/15270 cc: @skottmckay , @chenfucn , @snnn	2023-10-25 11:34:57 -07:00
snadampal	780ee186d7	[aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160 ) ### Description <!-- Describe your changes. --> This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This covers (i) symmetric quantization (zero point is Zero) (ii) asymmetric quantization (zero point is non zero) (iii) per channel as well as per tensor quantization (iv) Signed weights (U8S8 Gemm) (v) Unsigned weights (U8U8 Gemm) and (vi) Signed activations and weights (S8S8 Gemm) scenarios I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM` support MMLA QGEMM kernels are enabled for all the devices that support I8MM instructions. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is to improve INT8 quantized MatMul performance on aarch64 platform. I have run the below benchmarking script (bert , roberta and gpt2 model inference) on AWS Graviton3 based c7g.4xl instance and observed up to 1.33x performance improvement compared to the optimized UDOT qgemm kernel performance. ``` cd onnxruntime/python/tools/transformers python3 benchmark.py ``` I have also run the unit tests, and made sure all are passing ``` ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync ```	2023-10-24 07:49:04 +10:00
MistEO	870b0bc305	Fix typo of cmake (#17715 ) This caused a cmake configuration error.	2023-09-27 11:48:46 -07:00
Chen Fu	3c10f027de	4b quantization for weights of LLMs (#16833 ) ### Description Blockwise 4b quantization for LLMs. 1. Introduce 4b block-wise quantization for linear layer weights. 2. Implements matrix multiplication kernel for fp32 x int4 3. Implements special operator MatMulFpQ4 4. Implements quantization tool, that convert MatMul operator to MatMulFpQ4, when the right hand side is 2D const tensor. ### Motivation and Context Compress and accelerate LLMs \|Benchmark \| Time(ns)\| \|-------------\|----------\| \|Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8\| 218054\| \|Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8\| 35830155\| \|Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8\| 73479790\| \|Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8\| 270152\| \|Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8\| 35826721\| \|Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8\| 73021200\| \|Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8\| 213832\| \|Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8\| 36749874\| \|Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8\| 72618120\| \|Benchmark \| Time(ns)\| \|-------------\|----------\| \|SGEMM/LLM/M:1/N:4096/K:4096/Threads:8\| 522610\| \|SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8\| 39237689\| \|SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8\| 75983467\| --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-08-07 12:23:55 -07:00
Yi Zhang	2e214d6e27	Workaround to upgrade VS2022 for Windows ARM build (#16826 ) ### Description ### Motivation and Context It should be reverted when VS2022 is upgraded to 17.7 or above. ### Vefication https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=331401&view=logs&j=7517abfd-115a-5c61-78a0-7ba3c9e3a88d	2023-07-25 08:35:52 +08:00
Dipanjan Sengupta	a461608409	Amx flag removal (#16527 ) ### Description 1. Replacing AMX intrinsics with machine code macros in QGEMM kernel. 2. Removing AMX build flags for GCC in cmake file. 3. Fixing the link time optimization (LTO) issue introduced with asm .include of an assembly file. I have moved the AMX instruction macro definitions from QgemmU8S8KernelAmxCommon.S to the amx_common.h to fix the LTO issue. Note that I am also pushing the macros defined in QgemmU8S8KernelAmxCommon.S for future reference. A special thanks to @laxmansole who helped in the development of the instruction macro definitions for AMX intrinsics and fixing the LTO issue. ### Motivation and Context The additional AMX flag in cmake adds an extra layer of dependency on GCC version to use the feature.These changes should allow the usage of the AMX feature with just the CPU ID check.	2023-07-13 11:19:49 -07:00
Scott McKay	697dd12f6e	Re-organize the transpose optimization and layout transformation files. (#16246 ) ### Description <!-- Describe your changes. --> Split out the more basic changes from #15552 for easier review. Re-organize to clarify the structure - Separate out generic base functionality from ORT specific components - pass in handlers for internal ORT ops to Optimize - Split out layout transformation from transpose optimization - Separate out level 1 transpose optimizer - Cleanup some naming to try and clarify things like an optimizer vs. general optimization code Most of the changes are from this movement of code. Two implementation changes: - the extended handlers are queried first in GetHandler - allows the extended handlers to override the default behaviour for an ONNX operator - simplify the Optimize function to remove OptimizerMode. - `can_modify_node` is used instead of `mode` and `ignore_assigned_nodes` and a long description of the current usage is added. I don't _think_ that changes the current behavior and hopefully clarifies what happens and when, and makes the base transpose optimizer implementation more generic. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Create a cleaner separation to support adding EP specific logic next to cleanly handle where an EP has additional layout sensitive behaviour required (e.g. it's Resize implementation only handles one layout).	2023-07-07 08:24:47 +10:00
Chen Fu	5c125b4366	Cfu revertamx (#16455 ) ### Description This is to revert two PRs that aim at reducing AMX toolchain requirements. Unfortunately we still have some pipeline issues. https://github.com/microsoft/onnxruntime/pull/16390 https://github.com/microsoft/onnxruntime/pull/16086 ### Motivation and Context Looks like gcc link time optimization does not work very well with inline assembly in the above PRs.	2023-06-23 09:20:23 -07:00
Dipanjan Sengupta	35fa6af428	Fix for the build break in AMX feature on Mac OS. (#16390 ) ### Description Fixing the build break issue in Apple pipeline due to AMX flag removal.	2023-06-16 21:00:41 -07:00
Dipanjan Sengupta	681a0d084d	Removing AMX build flag (#16086 ) ### Description 1. Replacing AMX intrinsics with machine code macro instructions in QGEMM kernel. 2. Removing AMX build flags for GCC in cmake file. ### Motivation and Context The additional AMX flag in cmake adds an extra layer of dependency on GCC version to use the feature.These changes should allow the usage of the AMX feature with just the CPU ID check.	2023-06-15 11:22:59 -07:00
Changming Sun	0204594f90	Cleanup WASM cmake code (#15996 ) ### Description Remove the "onnxruntime_BUILD_WEBASSEMBLY" cmake option. Use `if (CMAKE_SYSTEM_NAME STREQUAL "Emscripten")` instead. It makes some code look more nature. For example, ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR onnxruntime_BUILD_WEBASSEMBLY) ``` becomes ```cmake if (CMAKE_SYSTEM_NAME STREQUAL "iOS" OR CMAKE_SYSTEM_NAME STREQUAL "Android" OR CMAKE_SYSTEM_NAME STREQUAL "Emscripten") ```	2023-05-20 18:07:39 -07:00
George Nash	f2889b41c1	[AMX] Update assembler check (#15501 ) A recent commit added an assembler check if the ASM dialect was ATT This unfortunately broke the AMX build for systems that don't have the ASM-ATT dialect. This change assumes if the CMAKE_ASM-ATT_COMPILER_ID is not found and the CMAKE_ASM_COMPILER_ID is "GNU" based on all the other already passed checks AMX is supported by the compiler and assembler. ### Description ### Motivation and Context On my build system the recent change to add the ASM-ATT version check disabled AMX code from the build. --------- Signed-off-by: George Nash <george.nash@intel.com>	2023-04-19 14:16:26 -07:00
Yateng Hong	9bb4e4bef4	Fix masm flags (#15417 ) ### Description Fix onnxruntime_mlas build failure with cmake 3.26. Updated CMAKE generator expression to make sure certain complier flags only apply for C/CXX compiler. ### Motivation and Context CMake changed the behavior of ASM_MASM in version 3.26. See https://gitlab.kitware.com/cmake/cmake/-/issues/24639. This also fixed the issue of #15101	2023-04-07 10:20:03 -07:00
Chen Fu	605c2f4b89	Remove fp16 support from apple (#15270 ) ### Description Removing fp16 support from apple build ### Motivation and Context FP16 support on ARM64 only available after armv8.2a, thus the clang compiler needs a compilation flag `-march=armv8.2-a+fp16`. Unfortunately, our current universal build does not support hardware specific compilation flags on cpp source files, as it would cause trouble when compiling against more than one hardware target. Until we figure out how to remove this limitation, had to disable fp16 support for Apple systems.	2023-03-30 16:44:26 -07:00
Chen Fu	41ddcd30a1	Fp16 NHWC Max and Average Pooling (#15181 ) ### Description Max and average pooling operators for fp16, NHWC ### Motivation and Context Continue on the steps for fp16 inference support	2023-03-28 08:22:55 -07:00
Jian Chen	527e006124	Update mlas (#15228 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-27 14:18:48 -07:00
JiCheng	126e7bf15f	[AMX] add assembler check (#15055 ) ### Description <!-- Describe your changes. --> AMX isn't supportted until assembler 2.40 even though the GCC frontend supports it. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-22 07:57:22 +08:00
Chen Fu	34175f0b7c	FP16 conv (#15062 ) ### Description Convolution for fp16 datatype. Use NHWC for computation. For NCHW input, it rearranges the input tensor to NHWC format before computing the result. Support two optional fusion: 1. Activation 2. Add (not yet implemented) ### Motivation and Context Accelerating fp16 inference	2023-03-21 10:32:43 -07:00
Jian Chen	6891ab5bac	fix_macos (#15018 ) ### Description <!-- Describe your changes. --> This fix macos packaging build on universal2 arch. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-03-14 21:54:44 -07:00
Chen Fu	acc2ac627f	Fp16 Activations (#14722 ) ### Description NEON fp16 SIMD implementation of Activation functions ### Motivation and Context Step 2 of fp16 SIMD support. --------- Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-28 17:20:40 -08:00
Chen Fu	733ca85b73	Cfu fp16 (#14538 ) ### Description FP16 GEMM, including hardware agnostic driver code, a slow C++ kernel, and ARM64 NEON kernel. ### Motivation and Context First step in creating native support of fp16 model inferencing on ARM64 and AMD64 platforms. --------- Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-02-15 12:51:53 -08:00
Chen Fu	90142899bd	Supporting Intel AMX instructions in quantized GEMM (#14042 ) ### Description Using Intel AMX int8 instructions to accelerate quantized GEMM ### Motivation and Context AMX instructions accelerate quantized GEMM significantly: Prepacked B perf numbers (latency in ns) GEMM Config \| AVX512Vnni \| AMX -- \| --: \| --: M:384/N:1024/K:1024/Batch:1/Threads:4 \| 1057511 \| 285393 M:384/N:1024/K:3072/Batch:1/Threads:4 \| 2643929 \| 700397 M:384/N:1024/K:4096/Batch:1/Threads:4 \| 3784750 \| 890701 M:384/N:4096/K:1024/Batch:1/Threads:4 \| 2378139 \| 887251 M:384/N:1024/K:1024/Batch:1/Threads:16 \| 307137 \| 138481 M:384/N:1024/K:3072/Batch:1/Threads:16 \| 855730 \| 295027 M:384/N:1024/K:4096/Batch:1/Threads:16 \| 1126878 \| 317395 M:384/N:4096/K:1024/Batch:1/Threads:16 \| 781963 \| 237014 M:1536/N:1024/K:1024/Batch:1/Threads:16 \| 538864 \| 181459 M:1536/N:1024/K:3072/Batch:1/Threads:16 \| 1681002 \| 561600 M:1536/N:1024/K:4096/Batch:1/Threads:16 \| 2158127 \| 717470 M:1536/N:4096/K:1024/Batch:1/Threads:16 \| 2428622 \| 896140 M:3072/N:1024/K:1024/Batch:1/Threads:16 \| 1058029 \| 357031 M:3072/N:1024/K:3072/Batch:1/Threads:16 \| 3138504 \| 1095857 M:3072/N:1024/K:4096/Batch:1/Threads:16 \| 4155640 \| 1386183 M:3072/N:4096/K:1024/Batch:1/Threads:16 \| 4679030 \| 1778624 Co-authored-by: Yi-Hong Lyu <yilyu@microsoft.com> Co-authored-by: Chen Fu <fuchen@microsoft.com>	2023-01-10 12:16:27 -08:00
Edward Chen	2ecd1d6622	Switch GSL to MS GSL 4.0.0 (#13416 )	2022-10-29 04:15:20 -07:00
Jack·Boos·Yu	ea004e953f	[cmake] Export multi targets in static build (#11063 ) * [cmake] Export multi targets in static build * Install more components in static build, format some code * Fix code pos	2022-04-03 22:37:18 -07:00
Chen Fu	dc72159105	Symmetric Quant indirect Conv kernel for ARMv8 A55 chip (#10862 ) ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions. This change adds a Symmetric Quant indirect Conv kernel for a55 micro-architecture, where we replace ldr q4,[x1], with ldr d4,[x1], ldr x11,[x1], ins v4.d[1],x11 so that we can try to hide the memory load cycles behind computing cycles in the kernel. With this new kernel, cartoongan model shows significant perf improvement on Pixel5a little cores (2 threads running on two little cores): new kernel: 2188.59 ms old kernel: 2360.61 ms	2022-03-25 17:10:47 -07:00
Chen Fu	50a6f095cd	Symmetric QGEMM kernel for ARMv8 A55 chip (#10754 ) ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions. This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace ldr q4,[x1],#16 with ldr d4,[x1],#8 ldr x11,[x1],#8 ins v4.d[1],x11 so that we can try to hide the memory load cycles behind computing cycles in the kernel. Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-03-07 08:41:13 -08:00
Maxiwell	43ff27c7c8	ppc64le: optimizing the MlasQuantizeLinear() with VSX (#10644 ) This code is valid only when -mcpu is set to utilize POWER9 technology or above. A compatible code for POWER8 was created as well, but it was not tuned for performance.	2022-03-04 14:54:56 -08:00
RajalakshmiSR	5d8c5409ab	POWER10: QGEMM optimization (#10642 ) * POWER10: QGEMM optimization This patch makes use of POWER10 MMA feature for QGEMM function. This optimization includes signed and unsigned cases.Tested and there are no new failures with gcc11 and clang-14. * Changes as per review comments Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2022-03-02 08:36:26 -08:00
Chen Fu	c4f1dfcfaa	Cfu s8s8 (#10413 ) Adding S8S8 kernels for symmetric quantized indirect conv and depthwise conv. Perf number with single thread: Nokia G10 (baseline / new) in ms Pixel 4 (baseline/new) in ms mobilenet_edgetpu 220 / 213 18.5 / 17.6 cartoongan 8537 / 8521 967 / 928 Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-01-28 09:26:52 -08:00
Chen Fu	2afce4830c	Symmetric QGEMM (#10289 ) Adding code for symmetric quantized matrix multiplication. Used in quantized convolution, achieving significant perf gain. TODO, use Symmetric Quantized GEMM in other operators! TODO address activation buffer overread in custom allocators and tensors supplied by users. DOT kernel perf test: Pixel 5a: Cartoongan 513.539 ms 471.786 ms Efficient 57.5169 ms 56.4174 ms Edgetpu 14.6673 ms 13.5959 ms NEON kernel perf test Pixel 3a Cartoongan 1423.53 ms 1069.92 ms Efficient 114.086 ms 107.968 ms Edgetpu 39.2632 ms 36.9839 ms Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-01-24 10:49:04 -08:00
Yufeng Li	7208fcbe1c	use wasmscalar as default kernel (#9988 ) * use wasmscalar as default kernel	2022-01-03 10:55:08 -08:00
Changming Sun	4e9e01cb3c	Fix SDL warnings in CPU EP (#9975 )	2021-12-19 20:54:29 -08:00
Chen Fu	cd0af7ad44	Symmetric quantized convolution kernel ARM64 (#9772 ) Adding a symmetric quantized convolution kernel for ARM64 Note: Indirect conv performs worse for shallow convs (input channels are small). This is much more so for low end pre-dot CPUs, where only 128 or deeper conv is faster with indirect conv. With DOT-CPUs, 32 deep conv is already faster Co-authored-by: Chen Fu <fuchen@microsoft.com>	2021-12-13 21:14:45 -08:00
Yi-Hong Lyu	f60a287a64	Add __x86.get_pc_thunk.bx to avoid dependency (#9955 )	2021-12-08 04:50:41 -08:00
Yufeng Li	e613019174	add s8s8 support for quantized conv and gemm (#9902 ) * add s8s8 support for quantized conv and gemm	2021-12-03 14:55:18 -08:00
RajalakshmiSR	8564fc1933	POWER10: Add optimized dgemm kernel (#9652 ) * POWER10: Add optimized dgemm kernel This patch makes use of POWER10 matrix multiply assist feature and adds new DGEMM kernel. * Indentation update Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-11-22 20:28:21 -08:00
Zhang Lei	8ef6aff734	Zhalei/dwqconv3x3 5x5 arm64 (#9714 ) * Arm64 Depthwise Convolution 3x3. * Add 5x5 intrinsic dwqconv for arm64 * rebase to master, remove no-need logic after arm64 convsym enabled. * Some more adjustment on the instrunction pipeling. * Add specific test cases. * Fix test dimension too small. * Fix build warning as error on some CI. * better format, etc.	2021-11-18 13:57:16 -08:00
Chen Fu	1c84621020	Adding ARM64 depthwise convolution kernel for symmetric quantization (#9655 ) Adding ARM64 depthwise convolution kernel for symmetric quantization Motivation and Context Two improvements against current kernel code : 1. Signed int8 based instructions, no need to extend from 8b to 16b before multiplication. 2. Unrolled loop with manual software pipelining Co-authored-by: Chen Fu <fuchen@microsoft.com>	2021-11-15 12:18:43 -08:00
RajalakshmiSR	c54ad0dd0b	POWER: Add Dgemm kernel for POWER processor (#9459 ) * POWER: Add Dgemm kernel for POWER processor This patch adds new dgemm kernel specific to POWER processor. * POWER: Restrict new functions to VSX in header * Remove warning check in header * POWER: Dgemm Adjust indentation Fixing indentation based on review comments. Co-authored-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>	2021-10-26 20:27:24 -07:00
Yufeng Li	da3dd398c5	Kernels for QLinearConv with symmetrically quantized filter (#9323 ) Add kernels for QLinearConv with symmetric quantized filter, e.g., filter type is int8 and zero point of filter is 0. This PR includes kernels for avx2, avxvnni, avx512 and avx 512 vnni. Will adds kernels for ARM64 in following PR. Kernels uses direct input buffer directly for pointwise, and in-direct buffer for depthwise and non-group conv. The advantages of those new kernels are: no need to compute the sum of each pixel output image, and sum/offset of filter can be combined with bias. with in-direct buffer, im2col returns an array of buffer pointers instead of memcpy'ing the original data. This saves memcpy time and reduces the size of the intermediate buffer needed to hold the im2col transform. In the future, will compute im2col ahead of time for input with fixed input size.	2021-10-18 19:40:18 -07:00
Yulong Wang	8c57d51928	support WebAssembly SIMD for qgemm (#9191 ) * support WebAssembly SIMD for qgemm * remove '--experimental-wasm-bulk-memory' for test	2021-09-30 12:40:56 -07:00
Tracy Sharpe	4828d2ebb1	MLAS: port aarch64 sgemv kernel to Windows ARM64 (#9071 )	2021-09-15 18:40:40 -07:00
Rajalakshmi Srinivasaraghavan	e83cc534d4	Fix cmake POWER10 detection Recent commit `60c98a8` changed variable mlas_common_srcs which affects POWER10 detection.	2021-09-12 11:56:55 -07:00
Changming Sun	60c98a86b7	CMake file changes for macOS universal2 support (#8953 )	2021-09-04 13:30:33 -07:00
Chen Fu	00b345eb7b	ARM Neon S8S8 kernel for QGemm (#8695 ) Using signed int, qgemm kernel avoids extending uint8 to int16 while computing matrix multiplication, achieving higher performance. We also find that by using only lower 64b of vector registers to load A and B matrix, we can get further performance improvements. We also experimented with using ldp to load two 64b in one shot, vs using two ldr to load one 64b at a time, in both Big and little cores, there is no noticeable differences. Submitting the LDP version. At this point we don't need to choose kernel based on micro-architecture. Inference time of resnet50, thread count 2 Big Core on Pixel 3a Current master: 292.947 ms First iteration S8S8: 188.239 ms LDP load two 64b reg: 178.715 ms LDR load one 64b reg: 179.536 ms Little Core Master: 546.317 ms S8S8: 513.332 ms LDP: 489.19 ms LDR: 497.865 ms Raspberry Pi 3B+ Master: 660.08 ms S8S8: 608.577 ms LDP: 603.675 ms LDR 602.075 ms	2021-08-18 09:58:47 -07:00
Tracy Sharpe	539d1d44c1	Optimize ARM64EC build (#8515 ) Add sgemm and qgemm optimized kernels for ARM64EC configuration.	2021-07-27 23:46:39 -07:00
Tracy Sharpe	b2b9de939f	cleanup onnxruntime_mlas.cmake of old gcc workarounds (#8469 )	2021-07-22 22:01:05 -07:00

1 2 3

113 commits