onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-10 17:37:14 +00:00

History

luoyu-intel 5f00bc9931 Integrate high-performance x64 gemm library to MLAS (#17669 ) ### Description Improve MLAS to support high-performance x64 INT4 kernels ### Motivation and Context 1. improve LLM inference performance on Intel CPUs. 2. support more 4bit quantization types: nf4, fp4 3. support dynamic block size: block size aligned with kernel's tiling size(e.g. 4 for VNNI kernel), per channel on N dimension 4. support most Intel ISAs: avx2, avx_vnni, avx512f, avx512_vnni, amx_bf16, amx_int8, avx512_fp16 5. support MatMulNBits' data format ### Tasks - [x] support block_size: 32, 128, -1(per channel) - [x] get weight pack size without memory allocation - [x] use ort's thread pool for parallelism - [x] support ISAs: avx2, avx512f, avx_vnni, avx512_vnni, amx_int8 ### Benchmark Ubuntu 20.22 + Intel(R) Xeon(R) Platinum 8480+ 56 cores Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 47613 \| 47401 \| 12970 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 6347792 \| 6317562 \| 109 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 11814014 \| 11757847 \| 59 Q4GEMM_Jblas/Q4G128SymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 50222 \| 50031 \| 13759 Q4GEMM_Jblas/Q4G128SymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 2038222 \| 2028743 \| 341 Q4GEMM_Jblas/Q4G128SymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 3792832 \| 3774485 \| 191 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1/N:4096/K:4096/Threads:56/real_time \| 58717 \| 58501 \| 11467 Q4GEMM_Jblas/Q4GPerNSymInt8/M:1024/N:4096/K:4096/Threads:56/real_time \| 1360846 \| 1354598 \| 543 Q4GEMM_Jblas/Q4GPerNSymInt8/M:2048/N:4096/K:4096/Threads:56/real_time \| 2564232 \| 2551365 \| 266 Q4GEMM_Jblas/Q4G32SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 57929 \| 57694 \| 12047 Q4GEMM_Jblas/Q4G32SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5495330 \| 5465810 \| 126 Q4GEMM_Jblas/Q4G32SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10676240 \| 10617817 \| 66 Q4GEMM_Jblas/Q4G128SymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 68305 \| 68047 \| 10026 Q4GEMM_Jblas/Q4G128SymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5504862 \| 5476215 \| 126 Q4GEMM_Jblas/Q4G128SymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 11758623 \| 11697337 \| 66 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 67713 \| 67451 \| 10298 Q4GEMM_Jblas/Q4GPerNSymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5508325 \| 5480237 \| 126 Q4GEMM_Jblas/Q4GPerNSymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10738528 \| 10681656 \| 64 Q4GEMM_Jblas/Q4G32AsymFp32/M:1/N:4096/K:4096/Threads:56/real_time \| 60708 \| 60486 \| 11321 Q4GEMM_Jblas/Q4G32AsymFp32/M:1024/N:4096/K:4096/Threads:56/real_time \| 5523784 \| 5495736 \| 126 Q4GEMM_Jblas/Q4G32AsymFp32/M:2048/N:4096/K:4096/Threads:56/real_time \| 10829633 \| 10772161 \| 67 Reference: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:56/real_time \| 53088 \| 52911 \| 13364 Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:56/real_time \| 6268981 \| 6230335 \| 110 Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:56/real_time \| 11701237 \| 11632339 \| 59 Win11+12900K 8 cores: Benchmark \| Time \| CPU \| Iterations -- \| -- \| -- \| -- Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:4096/Threads:8/real_time \| 215976 \| 211295 \| 2884 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:4096/Threads:8/real_time \| 60960590 \| 60937500 \| 10 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:4096/Threads:8/real_time \| 1.18E+08 \| 1.19E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:4096/Threads:8/real_time \| 470377 \| 453059 \| 1414 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:4096/Threads:8/real_time \| 1.54E+08 \| 1.53E+08 \| 5 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:4096/Threads:8/real_time \| 3.18E+08 \| 3.13E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:4096/K:11008/Threads:8/real_time \| 569072 \| 559398 \| 1229 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:4096/K:11008/Threads:8/real_time \| 1.54E+08 \| 1.52E+08 \| 4 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:4096/K:11008/Threads:8/real_time \| 3.22E+08 \| 3.28E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:1/N:11008/K:11008/Threads:8/real_time \| 1486055 \| 1473325 \| 403 Q4GEMM_Jblas/Q4G32SymInt8/M:1024/N:11008/K:11008/Threads:8/real_time \| 4.14E+08 \| 4.14E+08 \| 2 Q4GEMM_Jblas/Q4G32SymInt8/M:2048/N:11008/K:11008/Threads:8/real_time \| 8.88E+08 \| 8.59E+08 \| 1 --------- Signed-off-by: Mengni Wang <mengni.wang@intel.com> Co-authored-by: Mengni Wang <mengni.wang@intel.com>		2023-12-19 09:36:31 -08:00
..
external	FIX: Our cmake script didn't check googletest's hash (#18826 )	2023-12-15 08:48:15 -08:00
patches	Update absl and gtest to fix an ARM64EC build error (#18735 )	2023-12-07 15:55:17 -08:00
tensorboard	Improve dependency management (#13523 )	2022-12-01 09:51:59 -08:00
adjust_global_compile_flags.cmake	Update cmake to 3.27 and upgrade Linux CUDA docker files from CentOS7 to UBI8 (#16856 )	2023-09-05 18:12:10 -07:00
arm64x.cmake	Build onnxruntime.dll as arm64x (#18633 )	2023-12-06 16:49:00 -08:00
CMakeLists.txt	Integrate high-performance x64 gemm library to MLAS (#17669 )	2023-12-19 09:36:31 -08:00
CMakeSettings.json
codeconv.runsettings
deps.txt	Update absl and googletest (#18827 )	2023-12-14 16:15:07 -08:00
deps_update_and_upload.py	[Linter] Bump ruff and remove pylint (#17797 )	2023-10-05 21:07:33 -07:00
EnableVisualStudioCodeAnalysis.props
gdk_toolchain.cmake
Info.plist.in
libonnxruntime.pc.cmake.in
linux_arm32_crosscompile_toolchain.cmake	Add a build validation for Linux ARM64 cross-compile (#18200 )	2023-11-08 13:03:18 -08:00
linux_arm64_crosscompile_toolchain.cmake	Add a build validation for Linux ARM64 cross-compile (#18200 )	2023-11-08 13:03:18 -08:00
nuget_helpers.cmake
onnxruntime.cmake	Add MacOS build to ORT C Pod (#18550 )	2023-11-28 10:11:53 -08:00
onnxruntime_codegen_tvm.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_common.cmake	Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470 )	2023-09-08 13:35:04 -07:00
onnxruntime_compile_triton_kernel.cmake	[ROCm] Add ROCm Triton TunableOp for GroupNorm (#16196 )	2023-07-11 13:55:30 +08:00
onnxruntime_config.h.in	Enabling c++ 20 in MacOS build (#16187 )	2023-09-26 11:27:02 -07:00
onnxruntime_csharp.cmake	Refactor training build options (#13964 )	2023-01-03 13:28:16 -08:00
onnxruntime_flatbuffers.cmake	Rework some external targets to ease building with `-DFETCHCONTENT_FULLY_DISCONNECTED=ON` (#15323 )	2023-04-03 17:45:12 -07:00
onnxruntime_framework.cmake	[C#, CPP] Introduce Float16/BFloat16 support and tests for C#, C++ (#16506 )	2023-07-14 10:46:52 -07:00
onnxruntime_framework.natvis	[C#, CPP] Introduce Float16/BFloat16 support and tests for C#, C++ (#16506 )	2023-07-14 10:46:52 -07:00
onnxruntime_fuzz_test.cmake	Fix fuzz test (#14385 )	2023-01-22 22:17:43 -08:00
onnxruntime_graph.cmake	Pre-link when creating static library for apple framework (#18241 )	2023-11-03 23:38:29 +10:00
onnxruntime_ios.toolchain.cmake
onnxruntime_java.cmake	Update build option for training in java to enable_training_api (#15638 )	2023-04-24 11:53:08 -07:00
onnxruntime_java_unittests.cmake	Update build option for training in java to enable_training_api (#15638 )	2023-04-24 11:53:08 -07:00
onnxruntime_kernel_explorer.cmake	[ROCm] TunableOp: Update rocBLAS get_solutions API (since ROCm5.6) (#16657 )	2023-07-13 11:20:26 +08:00
onnxruntime_language_interop_ops.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_mlas.cmake	Integrate high-performance x64 gemm library to MLAS (#17669 )	2023-12-19 09:36:31 -08:00
onnxruntime_nodejs.cmake	Added DML and CUDA provider support in onnxruntime-node (#16050 )	2023-08-25 16:57:06 -07:00
onnxruntime_objectivec.cmake	Objective C Training API: TrainingSession (#16374 )	2023-06-28 09:13:56 -07:00
onnxruntime_opschema_lib.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_optimizer.cmake	Memory optimization refactor and refinement (#17481 )	2023-11-23 11:39:00 +08:00
onnxruntime_providers.cmake	Add API for NPU Device Selection in the DML EP (#17612 )	2023-10-11 14:53:00 -07:00
onnxruntime_providers_acl.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_armnn.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_azure.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_cann.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_coreml.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_cpu.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_cuda.cmake	MoE with Expert Slicing (#18565 )	2023-12-05 16:56:38 -08:00
onnxruntime_providers_dml.cmake	Add API for NPU Device Selection in the DML EP (#17612 )	2023-10-11 14:53:00 -07:00
onnxruntime_providers_dnnl.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_js.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_migraphx.cmake	CUDA EP vs ROCM EP hipify audit (#17776 )	2023-10-13 10:13:53 +08:00
onnxruntime_providers_nnapi.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_openvino.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_qnn.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_rknpu.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_rocm.cmake	CUDA EP vs ROCM EP hipify audit (#17776 )	2023-10-13 10:13:53 +08:00
onnxruntime_providers_tensorrt.cmake	[TensorRT EP] Properly set CUDA_INCLUDE_DIR for onnx-tensorrt (#18274 )	2023-11-03 20:04:10 -07:00
onnxruntime_providers_tvm.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_vitisai.cmake	[VitisAI] 1. api compatbile 2. dynamic load onnx (#18470 )	2023-12-14 14:43:41 -08:00
onnxruntime_providers_webnn.cmake	Split onnxruntime_providers.cmake to multiple (#17853 )	2023-10-09 20:33:44 -07:00
onnxruntime_providers_xnnpack.cmake	Update XNNPACK to latest version (#18038 )	2023-11-03 09:04:28 -07:00
onnxruntime_pyop.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_python.cmake	[QNN EP Quantization] Add fusion preprocessing to QNN quantization (#18719 )	2023-12-12 08:43:04 -08:00
onnxruntime_rocm_hipify.cmake	MoE with Expert Slicing (#18565 )	2023-12-05 16:56:38 -08:00
onnxruntime_session.cmake	added support for cmake "find_package" (#8919 )	2023-06-19 22:20:31 -07:00
onnxruntime_snpe_provider.cmake	Use target name for flatbuffers (#13991 )	2022-12-20 11:44:02 -08:00
onnxruntime_training.cmake	Triton Codegen for ORTModule (#15831 )	2023-07-13 18:17:58 +08:00
onnxruntime_unittests.cmake	Disable mlas unit test in ARM64EC build (#18747 )	2023-12-15 09:17:47 -08:00
onnxruntime_util.cmake	Improve dependency management (#13523 )	2022-12-01 09:51:59 -08:00
onnxruntime_webassembly.cmake	Update XNNPACK to latest version (#18038 )	2023-11-03 09:04:28 -07:00
precompiled_header.cmake
Sdl.ruleset	Add a Github workflow for Prefast (#15763 )	2023-05-03 11:42:51 -07:00
set_winapi_family_desktop.h
target_delayload.cmake
uwp_stubs.h	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
wcos_rules_override.cmake
winml.cmake	Update winml to use #cores - #soc cores by Default as the number of intraopthreads (#18384 )	2023-11-28 09:26:48 -08:00
winml_cppwinrt.cmake
winml_sdk_helpers.cmake
winml_unittests.cmake	Update C/C++ dependencies: abseil, date, nsync, googletest, wil, mp11, cpuinfo and safeint (#15470 )	2023-09-08 13:35:04 -07:00