mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-05-30 23:18:20 +00:00
### Description Blockwise 4b quantization for LLMs. 1. Introduce 4b block-wise quantization for linear layer weights. 2. Implements matrix multiplication kernel for fp32 x int4 3. Implements special operator MatMulFpQ4 4. Implements quantization tool, that convert MatMul operator to MatMulFpQ4, when the right hand side is 2D const tensor. ### Motivation and Context Compress and accelerate LLMs |Benchmark | Time(ns)| |-------------|----------| |Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054| |Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155| |Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790| |Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152| |Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721| |Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200| |Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832| |Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874| |Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120| |Benchmark | Time(ns)| |-------------|----------| |SGEMM/LLM/M:1/N:4096/K:4096/Threads:8| 522610| |SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689| |SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467| --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| external | ||
| patches | ||
| tensorboard | ||
| adjust_global_compile_flags.cmake | ||
| CMakeLists.txt | ||
| CMakeSettings.json | ||
| codeconv.runsettings | ||
| deps.txt | ||
| EnableVisualStudioCodeAnalysis.props | ||
| gdk_toolchain.cmake | ||
| Info.plist.in | ||
| libonnxruntime.pc.cmake.in | ||
| nuget_helpers.cmake | ||
| onnxruntime.cmake | ||
| onnxruntime_codegen_tvm.cmake | ||
| onnxruntime_common.cmake | ||
| onnxruntime_compile_triton_kernel.cmake | ||
| onnxruntime_config.h.in | ||
| onnxruntime_csharp.cmake | ||
| onnxruntime_flatbuffers.cmake | ||
| onnxruntime_framework.cmake | ||
| onnxruntime_framework.natvis | ||
| onnxruntime_fuzz_test.cmake | ||
| onnxruntime_graph.cmake | ||
| onnxruntime_ios.toolchain.cmake | ||
| onnxruntime_java.cmake | ||
| onnxruntime_java_unittests.cmake | ||
| onnxruntime_kernel_explorer.cmake | ||
| onnxruntime_language_interop_ops.cmake | ||
| onnxruntime_mlas.cmake | ||
| onnxruntime_nodejs.cmake | ||
| onnxruntime_objectivec.cmake | ||
| onnxruntime_opschema_lib.cmake | ||
| onnxruntime_optimizer.cmake | ||
| onnxruntime_providers.cmake | ||
| onnxruntime_pyop.cmake | ||
| onnxruntime_python.cmake | ||
| onnxruntime_rocm_hipify.cmake | ||
| onnxruntime_session.cmake | ||
| onnxruntime_snpe_provider.cmake | ||
| onnxruntime_training.cmake | ||
| onnxruntime_unittests.cmake | ||
| onnxruntime_util.cmake | ||
| onnxruntime_webassembly.cmake | ||
| precompiled_header.cmake | ||
| Sdl.ruleset | ||
| set_winapi_family_desktop.h | ||
| target_delayload.cmake | ||
| uwp_stubs.h | ||
| wcos_rules_override.cmake | ||
| winml.cmake | ||
| winml_cppwinrt.cmake | ||
| winml_sdk_helpers.cmake | ||
| winml_unittests.cmake | ||