onnxruntime

saymrwulf/onnxruntime

Fork 0

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-17 21:10:43 +00:00

Commit graph

Author	SHA1	Message	Date
Chen Fu	26b396418d	Block-wise 4b quantization matmul operator change (#18172 ) ### Description Replace block-wise 4b quantization implementation ### Motivation and Context In https://github.com/microsoft/onnxruntime/pull/18101 we have an augmented block-wise 4b quantization interface and implementation. Here we use this new implementation in onnxruntime contrib ops --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-11-03 15:29:43 -07:00
Jambay Kinley	d30d4d372a	Add MatMul FP4 and NF4 Support (#18066 ) ### Description Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to support quantization on weight. This PR adds: - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating point) and NF4 (4-bit NormalFloat) quantization on weight. - a naive implementation for MatMulBnb4 on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for MatMulBnb4 and related benchmark tool. - tool to quantize model to FP4 or NF4.	2023-10-25 15:34:58 -07:00
Yufeng Li	11af34440a	Add MatMul 4bits support on GPU (#17890 ) ### Description <!-- Describe your changes. --> Add a contrib op MatMulNBits and related toolchain to support quantization on weight. This PR only adds support for 4bits. It: - add schema for contrib op MatMulNBits which can support 1-7 bits quantization on weight. - a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e., implemented like MatMul(A, Dequantize(B)). - a special implementation for GemV for 4bits MatMulNBits and related benchmark tool - tool to quantization model with 4bits. Next: - add general and more efficient kernels for 4bits MatMulNBits on CPU and GPU	2023-10-13 16:55:30 -07:00

Author

SHA1

Message

Date

Chen Fu

26b396418d

Block-wise 4b quantization matmul operator change (#18172 )

### Description
Replace block-wise 4b quantization implementation


### Motivation and Context
In https://github.com/microsoft/onnxruntime/pull/18101 we have an
augmented block-wise 4b quantization interface and implementation. Here
we use this new implementation in onnxruntime contrib ops

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

2023-11-03 15:29:43 -07:00

Jambay Kinley

d30d4d372a

Add MatMul FP4 and NF4 Support (#18066 )

### Description
Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to
support quantization on weight.

This PR adds:
- schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating
point) and NF4 (4-bit NormalFloat) quantization on weight.
- a naive implementation for MatMulBnb4 on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for MatMulBnb4 and related benchmark
tool.
- tool to quantize model to FP4 or NF4.

2023-10-25 15:34:58 -07:00

Yufeng Li

11af34440a

Add MatMul 4bits support on GPU (#17890 )

### Description
<!-- Describe your changes. -->
Add a contrib op MatMulNBits and related toolchain to support
quantization on weight. This PR only adds support for 4bits. It:

- add schema for contrib op MatMulNBits which can support 1-7 bits
quantization on weight.
- a naive implementation for 4bits MatMulNBits on CPU and GPU, i.e.,
implemented like MatMul(A, Dequantize(B)).
- a special implementation for GemV for 4bits MatMulNBits and related
benchmark tool
- tool to quantization model with 4bits. 

Next:
- add general and more efficient kernels for 4bits MatMulNBits on CPU
and GPU

2023-10-13 16:55:30 -07:00

3 commits