mirror of
https://github.com/saymrwulf/onnxruntime.git
synced 2026-05-29 23:06:41 +00:00
### Description Blockwise 4b quantization for LLMs. 1. Introduce 4b block-wise quantization for linear layer weights. 2. Implements matrix multiplication kernel for fp32 x int4 3. Implements special operator MatMulFpQ4 4. Implements quantization tool, that convert MatMul operator to MatMulFpQ4, when the right hand side is 2D const tensor. ### Motivation and Context Compress and accelerate LLMs |Benchmark | Time(ns)| |-------------|----------| |Q4GEMM/Q4Sym/M:1/N:4096/K:4096/Threads:8| 218054| |Q4GEMM/Q4Sym/M:1024/N:4096/K:4096/Threads:8| 35830155| |Q4GEMM/Q4Sym/M:2048/N:4096/K:4096/Threads:8| 73479790| |Q4GEMM/Q4Zp8/M:1/N:4096/K:4096/Threads:8| 270152| |Q4GEMM/Q4Zp8/M:1024/N:4096/K:4096/Threads:8| 35826721| |Q4GEMM/Q4Zp8/M:2048/N:4096/K:4096/Threads:8| 73021200| |Q4GEMM/Q4Sym128/M:1/N:4096/K:4096/Threads:8| 213832| |Q4GEMM/Q4Sym128/M:1024/N:4096/K:4096/Threads:8| 36749874| |Q4GEMM/Q4Sym128/M:2048/N:4096/K:4096/Threads:8| 72618120| |Benchmark | Time(ns)| |-------------|----------| |SGEMM/LLM/M:1/N:4096/K:4096/Threads:8| 522610| |SGEMM/LLM/M:1024/N:4096/K:4096/Threads:8| 39237689| |SGEMM/LLM/M:2048/N:4096/K:4096/Threads:8| 75983467| --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| c_cxx | ||
| execution_providers/images | ||
| images | ||
| python | ||
| ABI_Dev_Notes.md | ||
| Android_testing.md | ||
| C_API_Guidelines.md | ||
| cmake_guideline.md | ||
| Coding_Conventions_and_Standards.md | ||
| ContribOperators.md | ||
| FAQ.md | ||
| How_To_Update_ONNX_Dev_Notes.md | ||
| Memory_Optimizer.md | ||
| Model_Test.md | ||
| NotesOnThreading.md | ||
| ONNX_Runtime_Server_Usage.md | ||
| onnxruntime_dependencies.dot | ||
| onnxruntime_dependencies.png | ||
| onnxruntime_extensions.md | ||
| OperatorKernels.md | ||
| ORT_Format_Update_in_1.13.md | ||
| ORT_Use_Trtion_Kernel.md | ||
| ORTMobilePackageOperatorTypeSupport.md | ||
| ORTModule_Convergence_Notes.md | ||
| ORTModule_ModuleWithLoss_Wrapper.md | ||
| ORTModule_Training_Guidelines.md | ||
| PR_Guidelines.md | ||
| Privacy.md | ||
| Python_Dev_Notes.md | ||
| Reduced_Operator_Kernel_build.md | ||
| ReleaseManagement.md | ||
| Roadmap.md | ||
| Server.md | ||
| TVM_EP.md | ||
| Versioning.md | ||
| WinML_principles.md | ||