mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-17 21:10:43 +00:00

History

Yufeng Li c20965f9b2 enable pipeline to run quantization tests (#6416 ) * enable pipeline to run quantization tests setup test pipeline for quantization		2021-01-25 09:33:08 -08:00
..
CalTableFlatBuffers	update quantize to support basic optimization and e2e example for image classification (#6313 )	2021-01-14 09:27:10 -08:00
E2E_example_model	enable pipeline to run quantization tests (#6416 )	2021-01-25 09:33:08 -08:00
notebooks	Update ORT to 1.5.1 in Bert Quantization Notebook (#5396 )	2020-10-08 09:55:01 -07:00
operators	update quantize to support basic optimization and e2e example for image classification (#6313 )	2021-01-14 09:27:10 -08:00
__init__.py	enable pipeline to run quantization tests (#6416 )	2021-01-25 09:33:08 -08:00
calibrate.py	Add python example of TensorRT INT8 inference on ResNet model (#6255 )	2021-01-15 09:59:56 -08:00
onnx_model.py	update quantize to support basic optimization and e2e example for image classification (#6313 )	2021-01-14 09:27:10 -08:00
onnx_quantizer.py	update quantize to support basic optimization and e2e example for image classification (#6313 )	2021-01-14 09:27:10 -08:00
quant_utils.py	Add python example of TensorRT INT8 inference on ResNet model (#6255 )	2021-01-15 09:59:56 -08:00
quantize.py	Load the model path correctly (#6369 )	2021-01-21 09:23:50 -08:00
README.md	Lochi/quantization tool for trt (#6103 )	2020-12-21 20:59:08 -08:00
registry.py	update quantize to support basic optimization and e2e example for image classification (#6313 )	2021-01-14 09:27:10 -08:00

README.md

Quantization Tools

Quantization in ORT refers to 8 bit linear quantization of an onnx model.

Quantization specifics

During quantization the floating point real values are mapped to an 8 bit quantization space and it is of the form : VAL_fp32 = Scale * (VAL_quantized - Zero_point)

Scale is a positive real number used to map the floating point numbers to a quantization space. It is calculated as follows : For unsigned 8 bit

scale = (data_range_max - data_range_min) / (quantization_range_max - quantization_range_min)

For signed 8 bit

scale = Abs(data_range_max, data_range_min) * 2 / (quantization_range_max - quantization_range_min)

Zero point represents zero in quantization space. It is important that floating point zero value be exactly representable in quantization space. This is because in lot of CNNs, zero padding is used and if after quantization it is not possible to represent 0 uniquely then it will lead to accuracy errors.

Quantizing an onnx model

There are 3 ways of quantizing a model: dynamic, static and auantize-aware training quantization.

Dynamic quantization : This method calculates the quantization parameter (scale and zero point) for activations dynamically.
Static quantization: It leverages the calibration data to calculates the quantization parameter of activations.
Quantize-Aware training quantization. The quantization parameter of activation are calculated while training, and the training process can control activation to a certain range.

List of Supported Quantized Ops:

The following ops were chosen as phase 1 ops because in most of the CNN models these ops consume most amount of compute and power and therefore there is benefit in quantizing these ops to get perf benefits.

Conv
MatMul
MaxPool
Relu
Clip
Add (Experimental)
Mul (Experimental)

Quantization and model opset versions

Quantization is fairly new in ONNX and ONNXRuntime. Quantization ops were introduced in ONNX opset version 10. Therefore it is important that the model which is being quantized be opset 10 or higher. In case the model opset version is < 10 then it is recommended that the model should be reconverted to ONNX from its original framework using the latest opset.

Quantization and Graph Optimization

Please note quantization and graph optimizations may not always work together.

Quantizing an optimized model

If a model is optimized using level 99 (i.e. all possible optimizations are run on that model) then it is possible that after these optimizations are applied the model is converted in a way that quantization cannot be applied on this model anymore and therefore after running quantization script there will be no change in the model.

Optimizing a quantized model

Same goes the other way round. After quantizing a model some graph optimizations which otherwise might have been applicable on this model may not be applicable anymore.

It is advised that the model owner be aware of this and run perf evaluations to understand which technique gives the best performance for their model.

Quantization API

Quantization has 3 main APIs quantize_dynamic, quantize_static, and quantize_qat, which corresponds to dynamic quantization, static quantization and quantize-aware training quantization respectively.

Options

See below for a description of the common options to quantize_dynamic, quantize_static and quantize_qat:

model_input:

file path of model to quantize
model_output:

file path of model to quantize
op_types_to_quantize: *defalut: []

specify the types of operators to quantize, like ['Conv'] to quantize Conv only. It quantizes all supported operators by default.
per_channel: default: False

If True, weights of Conv nodes are quantized per output channel.

If False, they are quantized per tensor. Refer QLinearConv for more information.
activation_type: defalut: QuantType.QUInt8

quantization data type of activation. It can be QuantType.QInt8 or QuantType.QUInt8
weight_type: defalut: QuantType.QUInt8

quantization data type of weight. It can be QuantType.QInt8 or QuantType.QUInt8
nodes_to_quantize: default: []

List of nodes names to quantize. When this list is not None only the nodes in this list are quantized. example: [ 'Conv__224', 'Conv__252' ]
nodes_to_exclude: default: []

List of nodes names to exclude. The nodes in this list will be excluded from quantization when it is not None.

In addition, user needs to provide an implementation of CalibrationDataReader for quantize_static CalibrationDataReader takes in the calibration data and generates input of the model

Example

Dynamic quantization

import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = 'path/to/the/model.onnx'
model_quant = 'path/to/the/model.quant.onnx'
quantized_model = quantize_dynamic(model_fp32, model_quant, weight_type=QuantType.QUInt8)

QAT quantization

import onnx
from onnxruntime.quantization import quantize_qat, QuantType

model_fp32 = 'path/to/the/model.onnx'
model_quant = 'path/to/the/model.quant.onnx'
quantized_model = quantize_qat(model_fp32, model_quant)

Static quantization

Please refer to ./E2E_example_model for an example of static quantization.

Calibration support for Static Quantization

MinMax static calibration

This Quantization tool also provides API for generating calibration table using MinMax algorithm, as previously mentioned, users need to provide implementation of CalibrationDataReader.data_reader.py is an example of data reader implementaion with both serial and batch processing. After calling the API, three different format of calibration tables are generated with filename calibration.* (FlatBuffers, Python dictionary and plain text). Note: In order to include all tensors from the model for better calibration, please run symbolic_shape_infer.py first. (see here)

Example

data_reader = YoloV3DataReader(calibration_dataset, model_path=augmented_model_path)
generate_calibration_table(model_path, augmented_model_path, data_reader, calibration_dataset=calibration_dataset, stride=1000, batch_size=20)

Please see E2E_example_model/e2e_user_yolov3_example.py for more details.

Evaluation for qunatization

COCO dataset evaluation

This tool integrates COCO dataset API to evaluate model prediction. Please make sure to install COCO API first (pip install pycocotools)

Example

dr = YoloV3DataReader(validation_dataset, model_path=model_path, start_index=i, size_limit=stride, batch_size=20, is_evaluation=True)
evaluator = YoloV3Evaluator(model_path, dr, providers=providers)
evaluator.predict()
results += evaluator.get_result()
...
evaluator.evaluate(results, annotations)

Please see E2E_example_model/e2e_user_yolov3_example.py for more details.

README.md

Quantization Tools

Quantization specifics

Quantizing an onnx model

List of Supported Quantized Ops:

Quantization and model opset versions

Quantization and Graph Optimization

Quantizing an optimized model

Optimizing a quantized model

Quantization API

Options

model_input:

model_output:

op_types_to_quantize: *defalut: []

per_channel: default: False

activation_type: defalut: QuantType.QUInt8

weight_type: defalut: QuantType.QUInt8

nodes_to_quantize: default: []

nodes_to_exclude: default: []

Example

Calibration support for Static Quantization

MinMax static calibration

Example

Evaluation for qunatization

COCO dataset evaluation

Example