The design of the pruning algorithm is such that it reduces the number of model parameters while minimizing the accuracy loss. The process is iterative, as shown in the following figure. Pruning results in accuracy loss, while fine-tuning of the remaining weights through training recovers accuracy. A trained, unpruned model serves as the input for the first iteration, referred to as the baseline model. This model is pruned and fine-tuned. Next, the fine-tuned model obtained from the previous iteration becomes the new baseline and is again pruned and fine-tuned. This process is repeated through multiple iterations until the desired sparsity is reached. This iterative approach is required because a model cannot be pruned with a high pruning ratio in a single pass while maintaining accuracy. When too many parameters are pruned in a single iteration, the accuracy loss can become too steep, making accuracy recovery through fine-tuning impossible.
Leveraging the process of iterative pruning, higher pruning rates can be achieved without any significant loss of model performance.
The four primary stages in iterative pruning are as follows:
- Analysis
- Perform a sensitivity analysis on the model to determine the optimal pruning strategy.
- Pruning
- Reduce the number of computations in the input model.
- Fine-tuning
- Retrain the pruned model to recover accuracy.
- Transformation
- Generate a dense model with fewer weights.