Generally, there is a small accuracy loss after quantization, but for some
networks such as MobileNets, the accuracy loss can be large. Fast finetuning uses the
AdaQuant algorithm to adjust the weights and quantize parameters layer-by-layer with the
unlabeled calibration dataset to improve accuracy for some models. It takes longer than
normal PTQ (still much shorter than QAT as the calib_dataset
is smaller than the training dataset). Fast finetuning is
disabled, by default. It can be turned on to improve the performance if you meet
accuracy issues. A recommended workflow is to first try PTQ without fast finetuning and
then try quantization with fast finetuning if the accuracy is not acceptable. QAT is
another method to improve the accuracy, but it takes more time and needs the training
dataset. You can activate fast finetuning by setting include_fast_ft=True
during post-training quantization.
quantized_model = quantizer.quantize_model(calib_dataset=calib_dataset, calib_step=None, calib_batch_size=None, include_fast_ft=True, fast_ft_epochs=10)
Here,
-
include_fast_ft
indicates whether to do fast finetuning or not. -
fast_ft_epochs
indicates the number of finetuning epochs for each layer.