onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-07 00:13:17 +00:00

History

Jing Fang 9be30348b9 [CPU EP] Add blocked quantization to QuantizeLinear op kernel (#20977 ) ### Description Add blocked quantization to QuantizeLinear op kernel. If the quantize axis is not the last axis, block the tensor using 1x128 blocks. Blocks are dispatched to multiple threads for concurrently processing. Currently only support scalar instructions. If the quantize axis is the last axis, block the tensor using 1 x quant_block_size blocks. Blocks are dispatched to multiple threads for concurrent processing. If output type is int types, call mlas kernel to use the SIMD instructions in each block. #### Benchmark data 20 core 2GHz CPU, RelWithDebInfo config, 196 x 4096 tensor, quantize float to int4x2 Quantize before last axis: * single thread, scalar instruction: 31380900 ns * 8 thread, scalar instruction: 5098620 ns Quantize last axis: * single thread, scalar instruction: 27927900 ns * 8 thread, SIMD instruction: 102261 ns more thread, SIMD instruction, larger block size helps ### Motivation and Context ONNX added blocked quantization to QuantizeLinear in optset 21		2024-06-11 20:25:28 -07:00
..
contrib_ops	[CUDA] upgrade cutlass to 3.5.0 (#20940 )	2024-06-11 13:32:15 -07:00
core	[CPU EP] Add blocked quantization to QuantizeLinear op kernel (#20977 )	2024-06-11 20:25:28 -07:00
python	[Quant tool] Improve performance of int4 weight quantization (#20935 )	2024-06-05 16:48:40 -07:00
test	[CPU EP] Add blocked quantization to QuantizeLinear op kernel (#20977 )	2024-06-11 20:25:28 -07:00
tool/etw
wasm	[js/web] optimize module export and deployment (#20165 )	2024-05-20 09:51:16 -07:00
__init__.py	Bump up version in main from 1.18.0 to 1.19.0 (#20489 )	2024-04-29 20:21:41 -07:00
ReformatSource.ps1
ReformatSourcePython.bat
VSCodeCoverage.runsettings