Quantization Strategy Configuration

Vai_q_pytorch supports the quantization configuration file in JSON format for multiple quantization strategy configurations.

Usage

To activate the customized configuration, you only need to pass the configuration file to the torch_quantizer API:

config_file = "./pytorch_quantize_config.json"
quantizer = torch_quantizer(quant_mode=quant_mode, 
                            module=model, 
                            input_args=(input), 
                            device=device, 
                           quant_config_file=config_file)

The ./example/ directory contains three examples: int_config.json, bfloat16_config.json, and mix_precision_config.json. To quantize the model, use the configuration files with the config_xxx_config.json command:

python resnet18_quant.py --quant_mode calib --config_file int_config.json
python resnet18_quant.py --quant_mode test --config_file int_config.json

In the example configuration file, the model configuration in overall_quantizer_config is set to entropy calibration method and per_tensor quantization:

"overall_quantize_config": {
  ...
  "method": "entropy",
  ...
  "per_channel": false,
  ...
},

The tensor_quantize_config specifies the weight configuration using the maxmin calibration method and per_tensor quantization, indicating that weights employ a distinct quantization method from the model configuration:

"tensor_quantize_config": {
  ...
  "weights": {
    ...
    "method": "maxmin",
    ...
    "per_channel": false,
    ...
    }

There is a single-layer quantization setup in the layer_quantize_config list. This setup is determined by the type of layer and applies per-channel quantization to torch.nn.Conv2d layers.

"layer_quantize_config": [
  {
    "layer_type": "torch.nn.Conv2d",
    ...
    "overall_quantize_config": {
      ...
      "per_channel": false,

Configurations

convert_relu6_to_relu: (Global quantizer setting) Whether to convert ReLU6 to ReLU. Options: True or False.
include_cle: (Global quantizer setting) Whether to use cross-layer equalization. Options: True or False.
include_bias_corr: (Global quantizer setting) Whether to use bias correction. Options: True or False
target_device: (Global quantizer setting) Target type for the quantized model. Options: DPU, CPU, GPU
quantizable_data_type: (Global quantizer setting) Tensor types to be quantized in model
data_type: (Tensor quantization setting) Data type used in quantization. Option: int, bfloat16, float16, float32
bit_width: (Tensor quantization setting) Bit width used in quantization. Only applicable when the data type is int.
method: (Tensor quantization setting) Method used to calibrate the quantization scale. Options: Maxmin, Percentile, Entropy, MSE, diffs. Only applicable when the data type is int.
round_mode: (Tensor quantization setting) Rounding method in the quantization process. Options: half_even, half_up, half_down, std_round. Only applicable when the data type is int.
symmetry: (Tensor quantization setting) Whether to use symmetric quantization. Options: True or False. Only applicable when the data type is int.
per_channel: (Tensor quantization setting) Whether to use per_channel quantization. Options: True or False. Only applicable when the data type is int.
signed: (Tensor quantization setting) Whether to use signed quantization. Options: True or False. Only applicable when the data type is int.
narrow_range: (Tensor quantization setting) Whether to use symmetric integer range for signed quantization. Options: True or False. Only applicable when the data type is int.
scale_type: (Tensor quantization setting) Scale type used in the quantization process. Options: Float, poweroftwo. Only applicable when the data type is int.
calib_statistic_method: (Tensor quantization setting) Method for selecting the optimal quantization scale when multiple batch data have different scales. Options: modal, max, mean, median. Only applicable when the data type is int.

Hierarchical Configuration: Quantization configuration follows a hierarchical workflow.

When the configuration file is not provided in the torch_quantizer API, the default configuration, tailored for the DPU device and using the poweroftwo quantization method, is applied.
When a configuration file is provided, the model configuration, encompassing global quantizer settings and global tensor quantization settings, overrides default settings.
When only the model configuration is specified in the file, all tensors within the model will adopt the same configuration.
You can use the layer configuration to assign specific configuration parameters to specific layers.

Default Configurations:

The following are the details of the default configuration:

"convert_relu6_to_relu": false,
"include_cle": true,
"include_bias_corr": true,
"target_device": "DPU",
"quantizable_data_type": [
  "input", 
  "weights", 
  "bias", 
  "activation"],
"datatype": "int",
"bit_width": 8, 
"method": "diffs", 
"round_mode": "std_round", 
"symmetry": true, 
"per_channel": false, 
"signed": true, 
"narrow_range": false, 
"scale_type": "poweroftwo", 
"calib_statistic_method": "modal"

Model Configurations:

In the example configuration file int_config.json, all tensors in the model are assigned the same int8 quantization configurations. In such cases, the global quantization parameters must be specified under the overall_quantize_config keyword as follows:

  "convert_relu6_to_relu": false,
  "include_cle": false,
  "keep_first_last_layer_accuracy": false,
  "keep_add_layer_accuracy": false,
  "include_bias_corr": false,
  "target_device": "CPU",
  "quantizable_data_type": [
    "input",
    "weights",
    "bias",
    "activation"],
"overall_quantize_config": {
    "datatype": "int",
    "bit_width": 8, 
    "method": "maxmin", 
    "round_mode": "half_even", 
    "symmetry": true, 
    "per_channel": false, 
    "signed": true, 
    "narrow_range": false, 
    "scale_type": "float", 
    "calib_statistic_method": "max"
}

Similar to int_config.json, all tensors in the model are configured with the same bfloat16 quantization settings in bfloat16_config.json. In this case, the only data type specified in the global quantization parameters is as follows:

  "convert_relu6_to_relu": false,
  "convert_silu_to_hswish": false,
  "include_cle": false,
  "keep_first_last_layer_accuracy": false,
  "keep_add_layer_accuracy": false,
  "include_bias_corr": false,
  "target_device": "CPU",
  "quantizable_data_type": [
    "input",
    "weights",
    "bias",
    "activation"
  ],
  "overall_quantize_config": {
    "datatype": "bfloat16"
  }

Optionally, the quantization configuration of different tensors in the model can be set separately. The configurations must be set in the tensor_quantize_config keyword. For example, configuration file mix_precision_config.json, the global datatype of quantization is bfloat16, and change the data type of bias to float16. The remaining parameters remain consistent with the global settings:

"tensor_quantize_config": {

    "bias": {
        "datatype": "float16", 
    } 
}

Layer Configurations:

Layer quantization configurations must be incorporated into the layer_quantize_config list. The configuration methods for each layer involve two parameters: the layer type and the layer name. There are five notes to consider when performing layer configuration:

Each layer configuration must be in dictionary format.
In each layer configuration, the quantizable_data_type and overall_quantize_config parameters are required. In the overall_quantize_config parameter, all quantization parameters for this layer must be included.
If the setting is based on layer type, the layer_name parameter should be null.
If the setting is based on the layer name, you need to perform a calibration process for the model. After calibration, you need to pick the required layer name from the Python file generated in the quantized_result directory. Additionally, ensure that the layer_type parameter is null.
Similar to the model configuration, the quantization configuration for different tensors within a layer can be customized separately. These individual tensor configurations should be specified using the tensor_quantize_config keyword.

In the example configuration file, there are two layer configurations. One is based on layer type, and the other is based on layer name. In the layer configuration based on layer type, torch.nn.Conv2d layer needs to be set to specific quantization parameters. The per_channel parameter of weight is set to true, method parameter of activation is set to entropy:

{
  "layer_type": "torch.nn.Conv2d",
  "layer_name": null,
  "quantizable_data_type": [
    "weights",
    "bias",
    "activation"],
  "overall_quantize_config": {
    "bit_width": 8,
    "method": "maxmin",
    "round_mode": "half_even",
    "symmetry": true,
    "per_channel": false,
    "signed": true,
    "narrow_range": false,
    "scale_type": "float",
    "calib_statistic_method": "max"
  },
  "tensor_quantize_config": {
    "weights": {
      "per_channel": true
    },
    "activation": {
      "method": "entropy"
    }
  }
}

In the layer configuration based on layer name, the layer named ResNet::ResNet/Conv2d[conv1]/input.2 must be set to specific quantization parameters. The round_mode of activation in this layer is set to half_up:

{
  "layer_type": null,
  "layer_name": "ResNet::ResNet/Conv2d[conv1]/input.2",
  "quantizable_data_type": [
    "weights",
    "bias",
    "activation"],
  "overall_quantize_config": {
    "bit_width": 8,
    "method": "maxmin",
    "round_mode": "half_even",
    "symmetry": true,
    "per_channel": false,
    "signed": true,
    "narrow_range": false,
    "scale_type": "float",
    "calib_statistic_method": "max"
  },
  "tensor_quantize_config": {
    "activation": {
      "round_mode": "half_up"
    }
  }
}

The layer name ResNet::ResNet/Conv2d[conv1]/input.2 is picked from generated file quantize_result/ResNet.py of the example/resnet18_quant.py of the example code:

Run the example code with the python resnet18_quant.py --subset_len 100 command. The quantize_result/ResNet.py file is generated.
In the file, the name of the first convolution layer is ResNet::ResNet/Conv2d[conv1]/input.2.
Copy the layer name to the quantization configuration file if this layer is set to a specific configuration.

import torch
import pytorch_nndct as py_nndct
class ResNet(torch.nn.Module):
  def __init__(self):
    super(ResNet, self).__init__()
    self.module_0 = py_nndct.nn.Input() #ResNet::input_0
    self.module_1 = py_nndct.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=[7, 7], stride=[2, 2], padding=[3, 3], dilation=[1, 1], groups= 1, bias=True) #ResNet::ResNet/Conv2d[conv1]/input.2

Configuration Restrictions

Due to design constraints related to DPUs, when using integer quantization and deploying quantized models on the DPU, the quantization configuration should meet the following restrictions:

method: diffs or maxmin
round_mode: std_round for weights, bias, and input; half_up for activation.
symmetry: true
per_channel: false
signed: true
narrow_range: true
scale_type: poweroftwo
calib_statistic_method: modal.

For CPU and GPU devices, there are no similar restrictions in place. However, conflicts might arise when employing different configurations. For instance, the calibration methods 'maxmin,' 'percentile,' 'mse,' or 'entropy' do not support the calibration statistic method 'modal.' Furthermore, if the symmetry mode is set to asymmetry, the calibration methods 'mse' and 'entropy' are unsupported. If configuration conflicts occur, the quantization tool provides an error message to notify the user.

Quantization Strategy Configuration - 3.5 English

Vitis AI User Guide (UG1414)

Usage

Configurations

Configuration Restrictions