vai_q_onnx.quantize_static(
model_input,
model_output,
calibration_data_reader,
quant_format=vai_q_onnx.VitisQuantFormat.FixNeuron,
calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
input_nodes=[],
output_nodes=[],
op_types_to_quantize=None,
per_channel=False,
reduce_range=False,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
nodes_to_quantize=None,
nodes_to_exclude=None,
optimize_model=True,
use_external_data_format=False,
calibrate_method=CalibrationMethod.MinMax,
extra_options=None)
Arguments
- model_input
- File path of the model to quantize.
- model_output
- File path of the quantized model.
- calibration_data_reader
- A calibration data reader. It enumerates calibration data and generates
inputs for the original model. If you want to use random data for a quick
test, you can set calibration_data_reader to None.
- quant_format
-
- QOperator: quantizes the model with quantized
operators directly.
- QDQ: quantize the model by inserting QuantizeLinear/DeQuantizeLinear on
the tensor. Supports only 8-bit quantization.
-
VitisQuantFormat.QDQ: quantizes the
model by inserting VAIQuantizeLinear/VAIDeQuantizeLinear on the
tensor. Supports more bit-width and configurations.
-
VitisQuantFormat.FixNeuron:
quantizes the model by inserting FixNeuron (composition of
QuantizeLinear and DeQuantizeLinear) on the tensor.
- calibrate_method
- For DPU devices, set calibrate_method to
'vai_q_onnx.PowerOfTwoMethod.NonOverflow' or
'vai_q_onnx.PowerOfTwoMethod.MinMSE' to apply power-of-2 scale quantization.
The PowerOfTwoMethod has two supported methods currently: MinMSE and
NonOverflow. The default method is MinMSE.
- input_nodes
- A list(string) object. Names of the start nodes to be
quantized. Nodes before these start nodes in the model are not optimized or
quantized. For example, this argument can skip some pre-processing nodes or
stop quantizing the first node. The default value is [].
- output_nodes
- A list(string) object. Names of the end nodes to be
quantized. Nodes after these nodes in the model are not optimized or
quantized. For example, this argument can skip some post-processing nodes or
stop quantizing the last node. The default value is [].
- op_types_to_quantize
- Specifies the types of operators to quantize, such as ['Conv'] to quantize
Conv only. It quantizes all supported operators by default.
- per_channel
- Quantize weights per channel. For DPU, this must be set to False as it
currently does not support per-channel.
- reduce_range
- Quantize weights with 7 bits. For DPU, the reduce_range is not supported,
so this must be set to False.
- weight_type
- Quantization data type of weight. For DPU, this must be set to
QuantType.QInt8. For more details on data type selection, refer to https://onnxruntime.ai/docs/performance/quantization.html.
- nodes_to_quantize
- List of nodes names to quantize. The nodes in this list are quantized only
when this list is None.
- nodes_to_exclude
- List of nodes names to exclude. The nodes in this list are excluded from
quantization when it is None.
- optimize_model
- Optimizes the model before quantization is going to be deprecated soon. It
is not recommended because optimization changes the computation graph,
making debugging quantization loss difficult.
- use_external_data_format
- Option used for large size (>2GB) model. The default value is
False.
- extra_options
- Key-value pair dictionary for various options in different cases.
Currently used pairs:
- ActivationSymmetric
- Symmetrize calibration data for activations
(default is False). In PowerOfTwoMethod calibrate_method, it
should always set ActivationSymmetric as True.
- WeightSymmetric
- symmetrize calibration data for weights (The
default value is True). In PowerOfTwoMethod calibrate_method, it
should always set WeightSymmetric to True.
- ForceQuantizeNoInputCheck
- By default, some latent operators, such as
maxpool and transpose, do not quantize if their input is not
quantized already. Setting to True to force such an operator
always quantizes input and generates quantized output. Also, the
true behavior could be disabled per node using the
nodes_to_exclude.
- MatMulConstBOnly
- The default value is False for static mode. If
enabled, only MatMul with const B is quantized.
- AddQDQPairToWeight
- The default value is False, which quantizes
floating-point weight and feeds it to the solely inserted
DeQuantizeLinear node. If True, it remains floating-point weight
and inserts QuantizeLinear/DeQuantizeLinear nodes to weight. In
PowerOfTwoMethod calibrate_method, QDQ should always appear as a
pair. Therefore, you need to add qdq pair to weight, and it
should always set AddQDQPairToWeight to True.