onnxruntime/onnxruntime/core
Jing Fang 9be30348b9
[CPU EP] Add blocked quantization to QuantizeLinear op kernel (#20977)
### Description
Add blocked quantization to QuantizeLinear op kernel.

If the quantize axis is not the last axis, block the tensor using 1x128
blocks. Blocks are dispatched to multiple threads for concurrently
processing. Currently only support scalar instructions.

If the quantize axis is the last axis, block the tensor using 1 x
quant_block_size blocks. Blocks are dispatched to multiple threads for
concurrent processing. If output type is int types, call mlas kernel to
use the SIMD instructions in each block.

#### Benchmark data
20 core 2GHz CPU, RelWithDebInfo config, 196 x 4096 tensor, quantize
float to int4x2

Quantize before last axis:
 * single thread, scalar instruction: 31380900 ns
 * 8 thread, scalar instruction: 5098620 ns

Quantize last axis:
 * single thread, scalar instruction: 27927900 ns
 * 8 thread, SIMD instruction: 102261 ns

more thread, SIMD instruction, larger block size helps

### Motivation and Context
ONNX added blocked quantization to QuantizeLinear in optset 21
2024-06-11 20:25:28 -07:00
..
codegen
common Update comment in cpuid_info.cc (#20974) 2024-06-10 08:52:38 -05:00
dll
dlpack
eager
flatbuffers Use flatbuffers::String::str instead of c_str. (#20487) 2024-04-27 13:41:38 +10:00
framework Fix compiler error when onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS is enabled (#20889) 2024-05-31 18:07:53 -07:00
graph relax seq len checking in rotary_emb (#20778) 2024-06-08 18:39:06 +08:00
language_interop_ops
mickey [CUDA] upgrade cutlass to 3.5.0 (#20940) 2024-06-11 13:32:15 -07:00
mlas [MLAS] Use C-style casting for power vector instructions (#20957) 2024-06-06 15:11:59 -07:00
optimizer Bug fix for gather fusion with on-device training (#20891) 2024-06-03 14:41:39 -07:00
platform Fully dynamic ETW controlled logging for ORT and QNN logs (#20537) 2024-06-06 21:11:14 -07:00
providers [CPU EP] Add blocked quantization to QuantizeLinear op kernel (#20977) 2024-06-11 20:25:28 -07:00
quantization
session Fully dynamic ETW controlled logging for ORT and QNN logs (#20537) 2024-06-06 21:11:14 -07:00
util [CPU EP] Add blocked quantization to QuantizeLinear op kernel (#20977) 2024-06-11 20:25:28 -07:00