In the original WeGO workflow, because WeGO only accepts a quantized INT8 model as input, it is necessary to perform a separate quantization process initially. It can be achieved by explicitly using the Vitis AI quantizer, which converts the float32 model into an INT8 model. It creates the need to perform extra tasks for the users, such as performing conda environment switch operations between quantizer and WeGO and figuring out the relationship between Vitis AI quantizer and WeGO. To improve the ease of use and make the entire process from quantization to deployment smoother, WeGO integrates the Vitis AI quantizer into its flow, enabling on-the-fly quantization when a float32 model is offered as WeGO’s input. Besides the original WeGO API for compilation, a new API is introduced in WeGO for quantization purposes, and the quantizer details are transparent to the end users. The quantization integration in WeGO is in the early stage with the following limitations:
- The integration flow currently supports Only PTQ (Post-Training Quantization). If the model's accuracy is significantly lower than expected, fine-tuning or QAT (Quantization Aware Training) must be used to improve the accuracy by following the native Vitis AI quantization flow.
- Only CPUs are adopted for quantization in WeGO; currently, GPUs are not supported. This might introduce some issues when quantizing large models, as the process may consume much time.