The following are some tips for getting better training results:
- Load the pre-trained floating-point weights as initial values to start the quantization aware training if possible. It is possible to train from scratch with random initial values, but this will make training more difficult and long.
- If pre-trained floating-point weights are loaded, then
different initial learning rates and learning rate decrease strategies need
to be used for the network parameters and quantizer parameters,
respectively. In general, the learning rate of network parameters needs to
be set small, while the learning rate of quantizer parameters needs to be
larger.
model = qat_processor.trainable_model() param_groups = [{ 'params': model.quantizer_parameters(), 'lr': 1e-2, 'name': 'quantizer' }, { 'params': model.non_quantizer_parameters(), 'lr': 1e-5, 'name': 'weight' }] optimizer = torch.optim.Adam(param_groups)
- For the choice of optimizer, avoid using torch.optim.SGD, as this optimizer may prevent the training from converging. We recommend using torch.optim.Adam or torch.optim.RMSprop and their variants.