Abstract
Deep neural networks typically have extensive parameters and computational operations. Pruning and quantization techniques have been widely used to reduce the complexity of deep models. Both techniques can be jointly used for realizing significantly higher compression ratios. However, separate optimization processes and difficulties in choosing the hyperparameters limit the application of both the techniques simultaneously. In this study, we propose a novel compression framework, termed as quantized sparse training, that prunes and quantizes networks jointly in a unified training process. We integrate pruning and quantization into a gradient-based optimization process based on the straight-through estimator. Quantized sparse training enables us to simultaneously train, prune, and quantize a network from scratch. The empirical results validate the superiority of the proposed methodology over the recent state-of-the-art baselines with respect to both the model size and accuracy. Specifically, quantized sparse training achieves a 135 KB model size in the case of VGG16, without any accuracy degradation, which is 40% of the model size feasible based on the state-of-the-art pruning and quantization approach.
- [1] . 2021. Rethinking floating point overheads for mixed precision DNN accelerators. Proc. Mach. Learn. Syst. 3 (2021), 223–239.Google Scholar
- [2] . 2016. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems. MIT Press, 2270–2278.Google Scholar
- [3] . 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems. MIT Press, 2654–2662.Google Scholar
- [4] . 2019. Post-training 4-bit quantization of convolution networks for rapid-deployment. In Advances in Neural Information Processing Systems. MIT Press.Google Scholar
- [5] . 2017. Deep rewiring: Training very sparse deep networks. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [6] . 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. Retrieved from https://arXiv:1308.3432.Google Scholar
- [7] . 2020. LSQ+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 696–697.Google Scholar
Cross Ref
- [8] . 2015. Net2net: Accelerating learning via knowledge transfer. Retrieved from https://arXiv:1511.05641.Google Scholar
- [9] . 2019. Low-bit quantization of neural networks for efficient inference. In Proceedings of the ICCV Workshops. 3009–3018.Google Scholar
Cross Ref
- [10] . 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or \( -1 \). Retrieved from https://arXiv:1602.02830.Google Scholar
- [11] . 2019. NeST: A neural network synthesis tool based on a grow-and-prune paradigm. IEEE Trans. Comput. 68, 10 (2019), 1487–1497.Google Scholar
Cross Ref
- [12] . 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems. MIT Press, 1269–1277.Google Scholar
Digital Library
- [13] . 2019. Sparse networks from scratch: Faster training without losing performance. In Advances in Neural Information Processing Systems. MIT Press.Google Scholar
- [14] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186. Google Scholar
Cross Ref
- [15] . 2019. Network pruning via transformable architecture search. Retrieved from https://arXiv:1905.09717.Google Scholar
- [16] . 2019. Releq: An automatic reinforcement learning approach for deep quantization of neural networks. In Proceedings of the NeurIPS ML for Systems Workshop.Google Scholar
- [17] . 2020. Learned step size quantization. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [18] . 2021. APNN-TC: Accelerating arbitrary precision neural networks on ampere GPU tensor cores. Retrieved from https://arXiv:2106.12169.Google Scholar
- [19] . 2020. Activation density driven energy-efficient pruning in training. Retrieved from https://arXiv:2002.02949.Google Scholar
- [20] . 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [21] . 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 249–256.Google Scholar
- [22] . 2016. A primer on neural network models for natural language processing. J. Artific. Intell. Res. 57 (2016), 345–420.Google Scholar
Digital Library
- [23] . 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4852–4861.Google Scholar
Cross Ref
- [24] . 2018. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1586–1595.Google Scholar
Cross Ref
- [25] . 2016. Dynamic network surgery for efficient dnns. In Advances in Neural Information Processing Systems. MIT Press, 1379–1387.Google Scholar
- [26] . 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [27] . 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. MIT Press, 1135–1143.Google Scholar
Digital Library
- [28] . 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026–1034.Google Scholar
Digital Library
- [29] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [30] . 2018. Soft filter pruning for accelerating deep convolutional neural networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2234–2240.Google Scholar
Digital Library
- [31] . 2018. AMC: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision. 784–800.Google Scholar
Cross Ref
- [32] . 2019. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4340–4349.Google Scholar
Cross Ref
- [33] . 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29 (2012).Google Scholar
Cross Ref
- [34] . 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.Google Scholar
Cross Ref
- [35] . 2019. Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware. Retrieved from https://arXiv:1903.08066.Google Scholar
- [36] . 2017. Categorical reparameterization with Gumbel-Softmax. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [37] . 2019. Dissecting the NVidia turing T4 GPU via microbenchmarking. Retrieved from https://arXiv:1903.07486.Google Scholar
- [38] . 2019. QKD: Quantization-aware knowledge distillation. Retrieved from https://arXiv:1911.12491.Google Scholar
- [39] . 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper. Retrieved from https://arXiv:1806.08342.Google Scholar
- [40] . 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. MIT Press, 1097–1105.Google Scholar
Digital Library
- [41] . 1990. Optimal brain damage. In Advances in Neural Information Processing Systems. MIT Press, 598–605.Google Scholar
Digital Library
- [42] . 2018. Extremely low bit neural network: Squeeze the last bit out with admm. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [43] . 2016. Ternary weight networks. Retrieved from https://arXiv:1605.04711.Google Scholar
- [44] . 2017. Pruning filters for efficient convnets. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [45] . 2020. Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers. Retrieved from https://arXiv:2005.06870.Google Scholar
- [46] . 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. 2736–2744.Google Scholar
Cross Ref
- [47] . 2019. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE International Conference on Computer Vision. 3296–3305.Google Scholar
Cross Ref
- [48] . 2012. Neural networks for machine learning, coursera. Coursera, Video Lectures.Google Scholar
- [49] . 2019. Rethinking the value of network pruning. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [50] . 2019. Relaxed quantization for discretized neural networks. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [51] . 2017. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems. MIT Press.Google Scholar
- [52] . 2018. Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference. Retrieved from https://arXiv:1805.08941.Google Scholar
- [53] . 2016. The concrete distribution: A continuous relaxation of discrete random variables. Retrieved from https://arxiv.org/abs/1611.00712?context=cs.Google Scholar
- [54] . 2018. Mixed precision training. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [55] . 2019. Cascaded projection: End-to-end network compression and acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10715–10724.Google Scholar
Cross Ref
- [56] . 2018. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Commun. 9, 1 (2018), 1–12.Google Scholar
Cross Ref
- [57] . 2019. Importance estimation for neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11264–11272.Google Scholar
Cross Ref
- [58] . 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the International Conference on Machine Learning. PMLR, 4646–4655.Google Scholar
- [59] . 2019. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1325–1334.Google Scholar
Cross Ref
- [60] . 2019. PyTorch: An imperative Style, high-performance deep learning library. In Advances in Neural Information Processing Systems. MIT Press, 8024–8035.Google Scholar
- [61] . 2020. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 58–70.Google Scholar
Cross Ref
- [62] . 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.Google Scholar
Cross Ref
- [63] . 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arXiv:1409.1556.Google Scholar
- [64] . 2018. Clip-q: Deep network compression learning by in-parallel pruning-quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7873–7882.Google Scholar
Cross Ref
- [65] . 2020. Mixed precision dnns: All you need is a good parametrization. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [66] . 2020. Bayesian bits: Unifying quantization and pruning. In Advances in Neural Information Processing Systems. MIT Press.Google Scholar
- [67] . 2021. Activation density based mixed-precision quantization for energy efficient neural networks. Retrieved from https://arXiv:2101.04354.Google Scholar
- [68] . 2017. Attention is all you need. Adv. Neural Info. Process. Syst. 30 (2017).Google Scholar
- [69] . 2019. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8612–8620.Google Scholar
Cross Ref
- [70] . 2020. SparseRT: Accelerating unstructured sparsity on GPUs for deep learning inference. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques.Google Scholar
Digital Library
- [71] . 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems. MIT Press, 2074–2082.Google Scholar
- [72] . 2018. Mixed precision quantization of convnets via differentiable neural architecture search. Retrieved from https://arXiv:1812.00090.Google Scholar
- [73] . 2020. Generative low-bitwidth data free quantization. In Proceedings of the European Conference on Computer Vision. Springer, 1–17.Google Scholar
Cross Ref
- [74] . 2020. Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2178–2188.Google Scholar
Cross Ref
- [75] . 2020. DeepHoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [76] . 2018. A unified framework of dnn weight pruning and weight clustering/quantization using admm. Retrieved from https://arXiv:1811.01907.Google Scholar
- [77] . 2020. Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8715–8724.Google Scholar
Cross Ref
- [78] . 2019. Understanding straight-through estimator in training activation quantized neural nets. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [79] . 2020. SNAP: An efficient sparse neural acceleration processor for unstructured sparse deep neural network inference. IEEE J. Solid-State Circ. 56, 2 (2020), 636–647.Google Scholar
Cross Ref
- [80] . 2018. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision. 184–199.Google Scholar
Cross Ref
- [81] . 2019. Improving neural network quantization without retraining using outlier channel splitting. In Proceedings of the International Conference on Machine Learning. PMLR, 7543–7552.Google Scholar
- [82] . 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [83] . 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Retrieved from https://arXiv:1606.06160.Google Scholar
- [84] . 2020. Towards accurate quantization and pruning via data-free knowledge transfer. Retrieved from https://arXiv:2010.07334.Google Scholar
- [85] . 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations.Google Scholar
Index Terms
Quantized Sparse Training: A Unified Trainable Framework for Joint Pruning and Quantization in DNNs
Recommendations
Studying the plasticity in deep convolutional neural networks using random pruning
Recently, there has been a lot of work on pruning filters from deep convolutional neural networks (CNNs) with the intention of reducing computations. The key idea is to rank the filters based on a certain criterion (say, $$l_1$$l1-norm, average ...
Training Compact DNNs with ℓ 1 / 2 Regularization
Highlights- We propose a network compression model based on ℓ 1 / 2 regularization. To the best of our knowledge, it is the first work utilizing non-Lipschitz continuous ...
AbstractDeep neural network(DNN) has achieved unprecedented success in many fields. However, its large model parameters which bring a great burden on storage and calculation hinder the development and application of DNNs. It is worthy of ...
Neural Network Compression and Acceleration by Federated Pruning
Algorithms and Architectures for Parallel ProcessingAbstractIn recent years, channel pruning is one of the important methods for deep model compression. But the resulting model still has tremendous redundant feature maps. In this paper, we propose a novel method, namely federated pruning algorithm, to ...






Comments