vai_q_onnx Usage - 3.5 English

Vitis AI User Guide (UG1414)

Document ID

UG1414

Release Date

2023-09-28

Version

3.5 English


vai_q_onnx.quantize_static(
    model_input,
    model_output,
    calibration_data_reader,
    quant_format=vai_q_onnx.VitisQuantFormat.FixNeuron,
    calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
    input_nodes=[],
    output_nodes=[],
    op_types_to_quantize=None,
    per_channel=False,
    reduce_range=False,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    nodes_to_quantize=None,
    nodes_to_exclude=None,
    optimize_model=True,
    use_external_data_format=False,
    calibrate_method=CalibrationMethod.MinMax,
    extra_options=None)

Arguments

model_input

File path of the model to quantize.

model_output

File path of the quantized model.

calibration_data_reader

A calibration data reader. It enumerates calibration data and generates inputs for the original model. If you want to use random data for a quick test, you can set calibration_data_reader to None.

quant_format

QOperator: quantizes the model with quantized operators directly.
QDQ: quantize the model by inserting QuantizeLinear/DeQuantizeLinear on the tensor. Supports only 8-bit quantization.
VitisQuantFormat.QDQ: quantizes the model by inserting VAIQuantizeLinear/VAIDeQuantizeLinear on the tensor. Supports more bit-width and configurations.
VitisQuantFormat.FixNeuron: quantizes the model by inserting FixNeuron (composition of QuantizeLinear and DeQuantizeLinear) on the tensor.

calibrate_method

For DPU devices, set calibrate_method to 'vai_q_onnx.PowerOfTwoMethod.NonOverflow' or 'vai_q_onnx.PowerOfTwoMethod.MinMSE' to apply power-of-2 scale quantization. The PowerOfTwoMethod has two supported methods currently: MinMSE and NonOverflow. The default method is MinMSE.

input_nodes

A list(string) object. Names of the start nodes to be quantized. Nodes before these start nodes in the model are not optimized or quantized. For example, this argument can skip some pre-processing nodes or stop quantizing the first node. The default value is [].

output_nodes

A list(string) object. Names of the end nodes to be quantized. Nodes after these nodes in the model are not optimized or quantized. For example, this argument can skip some post-processing nodes or stop quantizing the last node. The default value is [].

op_types_to_quantize

Specifies the types of operators to quantize, such as ['Conv'] to quantize Conv only. It quantizes all supported operators by default.

per_channel

Quantize weights per channel. For DPU, this must be set to False as it currently does not support per-channel.

reduce_range

Quantize weights with 7 bits. For DPU, the reduce_range is not supported, so this must be set to False.

weight_type

Quantization data type of weight. For DPU, this must be set to QuantType.QInt8. For more details on data type selection, refer to https://onnxruntime.ai/docs/performance/quantization.html.

nodes_to_quantize

List of nodes names to quantize. The nodes in this list are quantized only when this list is None.

nodes_to_exclude

List of nodes names to exclude. The nodes in this list are excluded from quantization when it is None.

optimize_model

Optimizes the model before quantization is going to be deprecated soon. It is not recommended because optimization changes the computation graph, making debugging quantization loss difficult.

use_external_data_format

Option used for large size (>2GB) model. The default value is False.

extra_options

Key-value pair dictionary for various options in different cases. Currently used pairs:

ActivationSymmetric: Symmetrize calibration data for activations (default is False). In PowerOfTwoMethod calibrate_method, it should always set ActivationSymmetric as True.
WeightSymmetric: symmetrize calibration data for weights (The default value is True). In PowerOfTwoMethod calibrate_method, it should always set WeightSymmetric to True.
ForceQuantizeNoInputCheck: By default, some latent operators, such as maxpool and transpose, do not quantize if their input is not quantized already. Setting to True to force such an operator always quantizes input and generates quantized output. Also, the true behavior could be disabled per node using the nodes_to_exclude.
MatMulConstBOnly: The default value is False for static mode. If enabled, only MatMul with const B is quantized.
AddQDQPairToWeight: The default value is False, which quantizes floating-point weight and feeds it to the solely inserted DeQuantizeLinear node. If True, it remains floating-point weight and inserts QuantizeLinear/DeQuantizeLinear nodes to weight. In PowerOfTwoMethod calibrate_method, QDQ should always appear as a pair. Therefore, you need to add qdq pair to weight, and it should always set AddQDQPairToWeight to True.