Neural networks are typically over-parameterized with significant redundancy. Pruning is the process of eliminating redundant weights while keeping the accuracy loss as low as possible.
Industry research has led to several techniques that serve to reduce the computational cost of neural networks for inference. These techniques include:
- Fine-grained pruning
- Coarse-grained pruning
- Neural Architecture Search (NAS)
The simplest form of pruning is called fine-grained pruning and results in sparse matrices (i.e., matrices which have many elements equal to zero), which requires the addition of specialized hardware and techniques for weight skipping and compression. Xilinx does not currently implement fine-grained pruning.
The Vitis™ AI pruner employs coarse-grained pruning, which eliminates neurons that do not contribute significantly to the accuracy of the network. For convolutional layers, the coarse-grained method prunes the entire 3D kernel and hence is also known as channel pruning. Inference acceleration can be achieved without specialized hardware for coarse-grained pruned models. Pruning always reduces the accuracy of the original model. Retraining (fine-tuning) adjusts the remaining weights to recover accuracy.
Coarse-grained pruning works well on large models with common convolutions, for example, ResNet and VGGNet, but when it comes to depthwise convolution based models such as MobileNet-v2, the accuracy of the pruned model drops dramatically even at a small pruning rate.
In addition to pruning, the Vitis AI provides a one-shot neural architecture search (NAS) based approach to reduce the computational cost of inference. This method requires a four-step process:
- Train
- Search
- Prune
- Fine-tune (optional)
Compared with coarse-grained pruning, one-shot NAS implementations assemble multiple candidate "subnetworks" into a single, over-parameterized graph known as a Supernet. The training optimization algorithm attempts to optimize all candidate networks simultaneously using supervised learning. Upon the completion of this training process, candidate subnetworks are ranked based on computational cost and accuracy. The developer selects the best candidate to meet their requirements. The one-shot NAS method is effective in compressing models that implement both depthwise convolutions and conventional convolutions but requires a long training time and a higher level of skill on the part of the developer.