Abstract
The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, sometimes even better than, the original dense networks. Sparsity promises to reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.
- When available, we include an arXiv reference to promote open access.Google Scholar
- We maintain a public repository with the full bibliography of this paper to the benefit of the community at https://github.com/spcl/sparsity-in-deep-learning.Google Scholar
- Alessandro Achille, Matteo Rovere, and Stefano Soatto. 2019. Critical Learning Periods in Deep Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1711.08856Google Scholar
- Sher Afghan and Uwe Naumann. 2020. Interval Adjoint Significance Analysis for Neural Networks. In International Conference on Computational Science. 365-378.Google Scholar
- Alireza Aghasi, Afshin Abdi, Nam Nguyen, and Justin Romberg. 2017. Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1611.05162Google Scholar
- Subutai Ahmad and Luiz Scheinkman. 2019. How Can We Be So Dense? The Benefits of Using Highly Sparse Representations. (2019). arXiv:cs.LG/1903.11257Google Scholar
- Alham Fikriand Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/1704.05021Google Scholar
- Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In International Symposium on Computer Architecture (ISCA). Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1610.02132Google Scholar
- Dan Alistarh, Torsten Hoeer, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1809.10505Google Scholar
- Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. 2019. A Convergence Theory for Deep Learning via Over-Parameterization. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1811.03962Google Scholar
- Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, and Aaron Courville. 2016. Dynamic Capacity Networks. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1511.07838Google Scholar
- Jose M. Alvarez and Mathieu Salzmann. 2017. Compression-aware Training of Deep Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/1711.02638Google Scholar
- Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In International Symposium on Microarchitecture (MICRO).Google Scholar
- Shun-ichi Amari. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation 10, 2 (1998), 251-276.Google Scholar
- Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2017. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13, 3 (2017), 1-18.Google Scholar
- Zahra Atashgahi, Ghada Sokar, Tim van der Lee, Elena Mocanu, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. 2020. Quick and Robust Feature Selection: the Strength of Energy-efficient Sparse Training for Autoencoders. (2020). arXiv:cs.LG/2012.00560Google Scholar
- Kambiz Azarian, Yash Bhalgat, Jinwon Lee, and Tijmen Blankevoort. 2020. Learned Threshold Pruning. (2020). arXiv:cs.LG/2003.00075Google Scholar
- Jimmy Ba, Roger Grosse, and James Martens. 2016a. Distributed second-order optimization using Kronecker-factored approximations. In International Conference on Learning Representations (ICLR).Google Scholar
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016b. Layer normalization. (2016). arXiv:cs.LG/1607.06450Google Scholar
- Pierre Baldi and Peter J Sadowski. 2013. Understanding Dropout. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2013/hash/71f6278d140af599e06ad9bf1ba03cb0-Abstract.htmlGoogle Scholar
- Brian R. Bartoldson, Ari S. Morcos, Adrian Barbu, and Gordon Erlebacher. 2020. The Generalization-Stability Tradeoff In Neural Network Pruning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1906.03728Google Scholar
- Debraj Basu, Deepesh Data, Can Karakus, and Suhas N Diggavi. 2020. Qsparse-local-SGD: Distributed SGD with quantization, sparsification, and local computations. IEEE Journal on Selected Areas in Information Theory 1, 1 (2020), 217-226. arXiv:stat.ML/1906.02367Google Scholar
- Cenk Baykal, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. 2019. Data-dependent coresets for compressing neural networks with applications to generalization bounds. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1804.05345Google Scholar
- Amir Beck and Marc Teboulle. 2009. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Img. Sci. 2, 1 (March 2009), 183-202.Google Scholar
- Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. 2018. Deep Rewiring: Training very sparse deep networks. In International Conference on Learning Representations (ICLR). arXiv:cs.NE/1711.05136Google Scholar
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. (2020). arXiv:cs.CL/2004.05150Google Scholar
- Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, and Torsten Hoeer. 2019. A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning. In International Parallel and Distributed Processing Symposium (IPDPS). arXiv:cs.DC/1901.10183Google Scholar
- Tal Ben-Nun and Torsten Hoeer. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. ACM Computing Surveys (CSUR) 52, 4 (2018), 1-43. arXiv:cs.LG/1802.09941Google Scholar
- Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2016. Conditional Computation in Neural Networks for faster models. (2016). arXiv:cs.LG/1511.06297Google Scholar
- Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. (2013). arXiv:cs.LG/1308.3432Google Scholar
- Richard F Betzel, John D Medaglia, Lia Papadopoulos, Graham L Baum, Ruben Gur, Raquel Gur, David Roalf, Theodore D Satterthwaite, and Danielle S Bassett. 2017. The modular organization of human anatomical brain networks: Accounting for the cost of wiring. Network Neuroscience 1, 1 (2017), 42-68. arXiv:q-bio.NC/1608.01161Google Scholar
- Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. 2018. Benchmark Analysis of Representative Deep Neural Network Architectures. IEEE Access 6 (2018), 64270-64277. arXiv:cs.CV/1810.00736 Google Scholar
Cross Ref
- Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. What is the state of neural network pruning?. In Machine Learning and Systems (MLSys). arXiv:cs.LG/2003.03033Google Scholar
- Alfred Bourely, John Patrick Boueri, and Krzysztof Choromonski. 2017. Sparse Neural Networks Topologies. (2017). arXiv:cs.LG/1706.05683Google Scholar
- Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/2005.14165Google Scholar
- Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1710.10174Google Scholar
- P. Burrascano. 1993. A pruning technique maximizing generalization. In International Conference on Neural Networks.Google Scholar
- Miguel Á. Carreira-Perpinan and Yerlan Idelbayev. 2018. "Learning-Compression" Algorithms for Neural Net Pruning. In Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Giovanna Castellano and Anna Maria Fanelli. 2000. Variable selection using neural-network models. Neurocomputing 31, 1-4 (2000), 1-13.Google Scholar
- Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo. 1997. An iterative pruning algorithm for feedforward neural networks. IEEE Transactions on Neural Networks 8, 3 (1997), 519-531.Google Scholar
- Hema Chandrasekaran, Hung-Han Chen, and Michael T. Manry. 2000. Pruning of basis functions in nonlinear approximators. Neurocomputing 34, 1 (2000), 29 - 53.Google Scholar
- Soravit Changpinyo, Mark Sandler, and Andrey Zhmoginov. 2017. The Power of Sparsity in Convolutional Neural Networks. (2017). arXiv:cs.CV/1702.06257Google Scholar
- Shih-Kang Chao, Zhanyu Wang, Yue Xing, and Guang Cheng. 2020. Directional Pruning of Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.09358Google Scholar
- Yves Chauvin. 1989. A Back-Propagation Algorithm with Optimal Use of Hidden Units. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/1988/hash/9fc3d7152ba9336a670e36d0ed79bc43-Abstract.htmlGoogle Scholar
- Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition.Google Scholar
- Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2017. AdaComp: Adaptive residual gradient compression for dataparallel distributed training. In AAAI Conference on Artificial Intelligence (AAAI). arXiv:cs.LG/1712.02679Google Scholar
- Jianda Chen, Shangyu Chen, and Sinno Jialin Pan. 2020. Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/2020/hash/a914ecef9c12ffdb9bede64bb703d877-Abstract.htmlGoogle Scholar
- Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020. The Lottery Ticket Hypothesis for Pretrained BERT Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2007.12223Google Scholar
- Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127-138.Google Scholar
- Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (2019), 292-308. arXiv:cs.DC/1807.07928Google Scholar
- Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2020. A Survey of Model Compression and Acceleration for Deep Neural Networks. (2020). arXiv:cs.LG/1710.09282Google Scholar
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. (2014). arXiv:cs.NE/1410.0759Google Scholar
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. (2019). arXiv:cs.LG/1904.10509Google Scholar
- Minsu Cho, Ameya Joshi, and Chinmay Hegde. 2020. ESPN: Extremely Sparse Pruned Networks. (2020). arXiv:cs.LG/2006.15741Google Scholar
- Tejalal Choudhary, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. 2020. A comprehensive survey on model compression and acceleration. Artificial Intelligence Review (2020), 1-43.Google Scholar
- Tautvydas Cibas, Françoise Fogelman Soulié, Patrick Gallinari, and Sarunas Raudys. 1996. Variable selection with neural networks. Neurocomputing 12, 2 (1996), 223 - 248.Google Scholar
- Joseph Paul Cohen, Henry Z. Lo, and Wei Ding. 2017. RandomOut: Using a convolutional gradient norm to rescue convolutional filters. (2017). arXiv:cs.CV/1602.05931Google Scholar
- Maxwell D. Collins and Pushmeet Kohli. 2014. Memory Bounded Deep Convolutional Networks. (2014). arXiv:cs.CV/1412.1442Google Scholar
- Gonçalo M Correia, Vlad Niculae, and André FT Martins. 2019. Adaptively sparse transformers. In Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). arXiv:cs.CL/1909.00015Google Scholar
- Justin Cosentino, Federico Zaiter, Dan Pei, and Jun Zhu. 2019. The Search for Sparse, Robust Neural Networks. In NeurIPS Safety and Robustness in Decision Making Workshop. arXiv:cs.LG/1912.02386Google Scholar
- Baiyun Cui, Yingming Li, Ming Chen, and Zhongfei Zhang. 2019. Fine-tune BERT with Sparse Self-Attention Mechanism. In Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).Google Scholar
- Bin Dai, Chen Zhu, and David Wipf. 2018b. Compressing Neural Networks using the Variational Information Bottleneck. In International Conference on Machine Learning (ICML). arXiv:cs.CV/1802.10399Google Scholar
- Xiaoliang Dai, Hongxu Yin, and Niraj K. Jha. 2018a. NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm. IEEE Trans. Comput. 68, 10 (2018), 1487- 1497. arXiv:cs.NE/1711.02017Google Scholar
- Stéphane d'Ascoli, Levent Sagun, Joan Bruna, and Giulio Biroli. 2020. Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1906.06766Google Scholar
- Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. 2020. Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights. (2020). arXiv:cs.AR/2007.00864Google Scholar
- Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, and Dan Alistarh. 2021. New Bounds For Distributed Mean Estimation and Variance Reduction. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2002.09268Google Scholar
- Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Gregory Rogez, and Puneet K. Dokania. 2021. Progressive Skeletonization: Trimming more fat from a network at initialization. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/2006.09081Google Scholar
- Luisa De Vivo, Michele Bellesi, William Marshall, Eric A Bushong, Mark H Ellisman, Giulio Tononi, and Chiara Cirelli. 2017. Ultrastructural evidence for synaptic scaling across the wake/sleep cycle. Science 355, 6324 (2017), 507-510.Google Scholar
- Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. Proc. IEEE 108, 4 (2020), 485-532.Google Scholar
- Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. 2013. Predicting Parameters in Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1306.0543Google Scholar
- Tim Dettmers and Luke Zettlemoyer. 2019. Sparse Networks from Scratch: Faster Training without Losing Performance. (2019). arXiv:cs.LG/1907.04840Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). arXiv:cs.CL/1810.04805Google Scholar
- Sourya Dey, Kuan-Wen Huang, Peter A. Beerel, and Keith M. Chugg. 2019. Pre-Defined Sparse Neural Networks With Hardware Acceleration. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (2019), 332-345. arXiv:cs.LG/1812.01164Google Scholar
- Graham H Diering, Raja S Nirujogi, Richard H Roth, Paul F Worley, Akhilesh Pandey, and Richard L Huganir. 2017. Homer1a drives homeostatic scaling-down of excitatory synapses during sleep. Science 355, 6324 (2017), 511-515.Google Scholar
- Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. 2019a. Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1904.03837Google Scholar
- Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. 2019b. Global Sparse Momentum SGD for Pruning Very Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1909.12778Google Scholar
- William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).Google Scholar
- Pedro Domingos. 2020. Every Model Learned by Gradient Descent Is Approximately a Kernel Machine. (2020). arXiv:cs.LG/2012.00152Google Scholar
- Xin Dong, Shangyu Chen, and Sinno Jialin Pan. 2017. Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.NE/1705.07565Google Scholar
- Xiao Dong, Lei Liu, Guangli Li, Jiansong Li, Peng Zhao, Xueying Wang, and Xiaobing Feng. 2019. Exploiting the input sparsity to accelerate deep neural networks: poster. In Symposium on Principles and Practice of Parallel Programming (PPoPP).Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16×16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/2010.11929Google Scholar
- Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In Workshop on Machine Learning in HPC Environments (MLHPC).Google Scholar
- Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1810.02054Google Scholar
- Aritra Dutta, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. 2020. On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In AAAI Conference on Artificial Intelligence (AAAI). arXiv:cs.DC/1911.08250Google Scholar
- Erich Elsen, Marat Dukhan, Trevor Gale, and Karen Simonyan. 2020. Fast Sparse ConvNets. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1911.09723Google Scholar
- Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural Architecture Search: A Survey. Journal of Machine Learning Research 20, 55 (2019), 1-21. arXiv:stat.ML/1808.05377Google Scholar
- Andries P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12, 6 (2001), 1386-1399.Google Scholar
- Andries Petrus Engelbrecht and Ian Cloete. 1996. A sensitivity analysis algorithm for pruning feedforward neural networks. In International Conference on Neural Networks.Google Scholar
- Andries Petrus Engelbrecht, Ian Cloete, and Jacek M Zurada. 1995. Determining the significance of input parameters using sensitivity analysis. In International Workshop on Artificial Neural Networks.Google Scholar
- Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020a. Rigging the Lottery: Making All Tickets Winners. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1911.11134Google Scholar
- Utku Evci, Yani A. Ioannou, Cem Keskin, and Yann Dauphin. 2020b. Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win. (2020). arXiv:cs.LG/2010.03533Google Scholar
- Angela Fan, Edouard Grave, and Armand Joulin. 2020. Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1909.11556Google Scholar
- William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. (2021). arXiv:cs.LG/2101.03961Google Scholar
- William Finnoff, Ferdinand Hergert, and Hans Georg Zimmermann. 1993. Improving model selection by nonconvergent methods. Neural Networks 6, 6 (1993), 771-783.Google Scholar
- L. Fletcher, V. Katkovnik, F. E. Steffens, and A. P. Engelbrecht. 1998. Optimizing the number of hidden nodes of a feedforward artificial neural network. In International Joint Conference on Neural Networks (IJCNN).Google Scholar
- Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1803.03635Google Scholar
- Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2020a. Linear Mode Connectivity and the Lottery Ticket Hypothesis. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1912.05671Google Scholar
- Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2020b. Stabilizing the Lottery Ticket Hypothesis. (2020). arXiv:cs.LG/1903.01611Google Scholar
- Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2021. Pruning Neural Networks at Initialization: Why are We Missing the Mark?. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2009.08576Google Scholar
- Jonathan Frankle, David J. Schwab, and Ari S. Morcos. 2020. The Early Phase of Neural Network Training. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2002.10365Google Scholar
- Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. A note on the group lasso and a sparse group lasso. (2010). arXiv:math.ST/1001.0736Google Scholar
- Karl J. Friston. 2008. Hierarchical Models in the Brain. PLOS Computational Biology 4, 11 (2008), e1000211.Google Scholar
- Adam Gaier and David Ha. 2019. Weight Agnostic Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1906.04358Google Scholar
- Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In International Conference on Machine Learning (ICML). arXiv:stat.ML/1506.02142Google Scholar
- Yarin Gal, Jiri Hron, and Alex Kendall. 2017. Concrete Dropout. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 30. arXiv:stat.ML/1705.07832Google Scholar
- Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The State of Sparsity in Deep Neural Networks. (2019). arXiv:cs.LG/1902.09574Google Scholar
- Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU Kernels for Deep Learning. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). arXiv:cs.LG/2006.10901Google Scholar
- Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. 2020. Compressing large-scale transformer-based models: A case study on BERT. (2020). arXiv:cs.LG/2002.11985Google Scholar
- Dongdong Ge, Xiaoye Jiang, and Yinyu Ye. 2011. A note on the complexity of L p minimization. Mathematical programming 129, 2 (2011), 285-299.Google Scholar
- Georgios Georgiadis. 2019. Accelerating Convolutional Neural Networks via Activation Map Compression. In Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. 2018. DropBlock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/1810.12890Google Scholar
- Joydeep Ghosh and Kagan Tumer. 1994. Structural Adaptation and Generalization in Supervised Feed-Forward Networks. J. Artif. Neural Netw. 1, 4 (Nov. 1994), 431-458.Google Scholar
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS). http://proceedings.mlr.press/v9/glorot10a.htmlGoogle Scholar
- Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS). http://proceedings.mlr.press/v15/glorot11a.htmlGoogle Scholar
- Maximilian Golub, Guy Lemieux, and Mieszko Lis. 2019. Full deep neural network training on a pruned weight budget. In Machine Learning and Systems (MLSys). arXiv:cs.LG/1806.06949Google Scholar
- Aidan N. Gomez, Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. 2019. Learning Sparse Networks Using Targeted Dropout. (2019). arXiv:cs.LG/1905.13678Google Scholar
- Ashish Gondimalla, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks. In International Symposium on Microarchitecture (MICRO).Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1406.2661Google Scholar
- Soorya Gopalakrishnan, Zhinus Marzi, Upamanyu Madhow, and Ramtin Pedarsani. 2018. Combating Adversarial Attacks Using Sparse Representations. In International Conference on Learning Representations Workshop. arXiv:stat.ML/1803.03880Google Scholar
- Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. 2018. MorphNet: Fast & simple resource-constrained structure learning of deep networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1711.06798Google Scholar
- Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In Proceedings of the 5th Workshop on Representation Learning for NLP. 143-155. arXiv:cs.CL/2002.08307Google Scholar
- Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, and Torsten Hoeer. 2020. Deep Learning for Post-Processing Ensemble Weather Forecasts. Philosophical Transactions of the Royal Society A 379, 2194 (2020), 20200092. arXiv:cs.LG/2005.08748Google Scholar
- William Gropp, Torsten Hoeer, Rajeev Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press.Google Scholar
- William Gropp, Torsten Hoeer, Rajeev Thakur, and Jesper Larsson Träff. 2011. Performance Expectations and Guidelines for MPI Derived Datatypes. In Recent Advances in the Message Passing Interface (EuroMPI'11), Vol. 6960. 150-159.Google Scholar
- Peter D Grünwald. 2007. The minimum description length principle. MIT press.Google Scholar
- Denis Gudovskiy, Alec Hodgkinson, and Luca Rigazio. 2018. DNN Feature Map Compression using Learned Representation over GF (2). In European Conference on Computer Vision (ECCV). arXiv:cs.CV/1808.05285Google Scholar
- Luis Guerra, Bohan Zhuang, Ian Reid, and Tom Drummond. 2020. Automatic Pruning for Quantized Neural Networks. (2020). arXiv:cs.CV/2002.00523Google Scholar
- Shupeng Gui, Haotao Wang, Chen Yu, Haichuan Yang, Zhangyang Wang, and Ji Liu. 2019. Model compression with adversarial robustness: A unified optimization framework. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1902.03538Google Scholar
- Demi Guo, Alexander M. Rush, and Yoon Kim. 2020. Parameter-Efficient Transfer Learning with Diff Pruning. (2020). arXiv:cs.CL/2012.07463Google Scholar
- Fu-Ming Guo, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. 2019a. Reweighted proximal pruning for large-scale language representation. (2019). arXiv:cs.LG/1909.12486Google Scholar
- Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019b. Star-Transformer. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). arXiv:cs.CL/1902.09113Google Scholar
- Yiwen Guo, Anbang Yao, and Yurong Chen. 2016. Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.NE/1608.04493Google Scholar
- Yiwen Guo, Chao Zhang, Changshui Zhang, and Yurong Chen. 2018. Sparse DNNs with improved adversarial robustness. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1810.09619Google Scholar
- Manish Gupta and Puneet Agrawal. 2020. Compression of Deep Learning Models for Text: A Survey. (2020). arXiv:cs.CL/2008.05221Google Scholar
- Udit Gupta, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander M. Rush, Gu-Yeon Wei, and David Brooks. 2019. MASR: A Modular Accelerator for Sparse RNNs. In International Conference on Parallel Architectures and Compilation Techniques (PACT). arXiv:eess.SP/1908.08976Google Scholar
- Masafumi Hagiwara. 1993. Removal of hidden units and weights for back propagation networks. In International Conference on Neural Networks.Google Scholar
- Masafumi Hagiwara. 1994. A simple and effective method for removal of hidden units and weights. Neurocomputing 6, 2 (1994), 207 - 218. Backpropagation, Part IV.Google Scholar
- Hong-Gui Han and Jun-Fei Qiao. 2013. A structure optimisation algorithm for feedforward neural network construction. Neurocomputing 99 (2013), 347-357.Google Scholar
- Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J. Dally. 2017. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. In International Symposium on Field-Programmable Gate Arrays (FPGA). arXiv:cs.CL/1612.00694Google Scholar
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016a. EIE: Efficient Inference Engine on Compressed Deep Neural Network. ACM SIGARCH Computer Architecture News 44, 3 (2016), 243-254. arXiv:cs.CV/1602.01528Google Scholar
- Song Han, Huizi Mao, and William J. Dally. 2016b. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1510.00149Google Scholar
- Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, and William J. Dally. 2017. DSD: Dense-Sparse-Dense Training for Deep Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1607.04381Google Scholar
- Lars Kai Hansen and Morten With Pedersen. 1994. Controlled growth of cascade correlation nets. In Conference on Artificial Neural Networks.Google Scholar
- Stephen Hanson and Lorien Pratt. 1989. Comparing Biases for Minimal Network Construction with Back-Propagation. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/1988/hash/1c9ac0159c94d8d0cbedc973445af2da-Abstract.htmlGoogle Scholar
- Babak Hassibi and David G. Stork. 1992. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/1992/hash/303ed4c69846ab36c2904d3ba8573050-Abstract.htmlGoogle Scholar
- Jeff Hawkins. 2017. Special report : Can we copy the brain? - What intelligent machines need to learn from the Neocortex. IEEE Spectrum 54, 6 (2017), 34-71.Google Scholar
- Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. 2021. Robust Pruning at Initialization. In International Conference on Learning Representations (ICLR). arXiv:stat.ML/2002.08797Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1703.06870Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1502.01852Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1512.03385Google Scholar
- Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In European Conference on Computer Vision (ECCV). arXiv:cs.CV/1802.03494Google Scholar
- Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. 2019. Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1811.00250Google Scholar
- Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1707.06168Google Scholar
- Donald O. Hebb. 1949. The organization of behavior: A neuropsychological theory. Wiley, New York.Google Scholar
- Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In International Symposium on Microarchitecture (MICRO).Google Scholar
- Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1903.12261Google Scholar
- Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2019. Natural adversarial examples. (2019). arXiv:cs.LG/1907.07174Google Scholar
- Suzana Herculano-Houzel, Bruno Mota, Peiyan Wong, and Jon H. Kaas. 2010. Connectivity-driven white matter scaling and folding in primate cerebral cortex. Proceedings of the National Academy of Sciences 107, 44 (2010), 19008-19013.Google Scholar
- Parker Hill, Animesh Jain, Mason Hill, Babak Zamirai, Chang-Hong Hsu, Michael A. Laurenzano, Scott Mahlke, Lingjia Tang, and Jason Mars. 2017. DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission. In International Symposium on Microarchitecture (MICRO).Google Scholar
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. In NeurIPS Deep Learning and Representation Learning Workshop. arXiv:stat.ML/1503.02531Google Scholar
- Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. (2012). arXiv:cs.NE/1207.0580Google Scholar
- Geoffrey E Hinton and Drew Van Camp. 1993. Keeping the neural networks simple by minimizing the description length of the weights. In Conference on Computational Learning Theory (COLT).Google Scholar
- Torsten Hoeer and Roberto Belli. 2015. Scientific Benchmarking of Parallel Computing Systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).Google Scholar
- Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. 2019. What Do Compressed Deep Neural Networks Forget? (2019). arXiv:cs.LG/1911.05248Google Scholar
- Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. 2020. Characterising bias in compressed models. (2020). arXiv:cs.LG/2010.03058Google Scholar
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. (2017). arXiv:cs.CV/1704.04861Google Scholar
- Patrik O Hoyer. 2004. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research 5, Nov (2004), 1457-1469.Google Scholar
- Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. (2016). arXiv:cs.NE/1607.03250Google Scholar
- Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. Deep Networks with Stochastic Depth. In European Conference on Computer Vision (ECCV). arXiv:cs.LG/1603.09382Google Scholar
- Zehao Huang and Naiyan Wang. 2018. Data-Driven Sparse Structure Selection for Deep Neural Networks. In European Conference on Computer Vision (ECCV). arXiv:cs.CV/1707.01213Google Scholar
- Ziyue Huang, Wang Yilei, Ke Yi, et al. 2019. Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2019/hash/5b970a1d9be0fd100063fd6cd688b73e-Abstract.htmlGoogle Scholar
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.htmlGoogle Scholar
- Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. (2016). arXiv:cs.CV/1602.07360Google Scholar
- Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1502.03167Google Scholar
- Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoeer. 2021. Data Movement Is All You Need: A Case Study on Optimizing Transformers. In Machine Learning and Systems (MLSys). arXiv:cs.LG/2007.00072Google Scholar
- Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman Arora, et al. 2019. Communication-efficient distributed SGD with sketching. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1903.04488Google Scholar
- Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural Computation 3, 1 (1991), 79-87.Google Scholar
- Niehues Jan, Roldano Cattoni, Stuker Sebastian, Matteo Negri, Marco Turchi, Salesky Elizabeth, Sanabria Ramon, Barrault Loic, Specia Lucia, and Marcello Federico. 2019. The IWSLT 2019 evaluation campaign. In 16th International Workshop on Spoken Language Translation 2019.Google Scholar
- Steven A Janowsky. 1989. Pruning versus clipping in neural networks. Physical Review A 39, 12 (1989), 6600.Google Scholar
- Siddhant Jayakumar, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. 2020. Top-KAST: Top-K Always Sparse Training. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2106.03517Google Scholar
- Peng Jiang and Gagan Agrawal. 2018. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/2018/hash/17326d10d511828f6b34fa6d751739e2-Abstract.htmlGoogle Scholar
- Sian Jin, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, and Franck Cappello. 2019. DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression. In International Symposium on High-Performance Parallel and Distributed Computing (HPDC). arXiv:cs.CV/1901.09124Google Scholar
- Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. 2016. Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods. (2016). arXiv:cs.CV/1607.05423Google Scholar
- Sari Jones, Lars Nyberg, Johan Sandblom, Anna Stigsdotter Neely, Martin Ingvar, Karl Magnus Petersson, and Lars Bäckman. 2006. Cognitive and neural plasticity in aging: general and task-specific limitations. Neuroscience & Biobehavioral Reviews 30, 6 (2006), 864-871.Google Scholar
- Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6, 2 (1994), 181-214.Google Scholar
- Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. In International Conference on Machine Learning (ICML). arXiv:cs.SD/1802.08435Google Scholar
- Keisuke Kameyama and Yukio Kosugi. 1991. Automatic fusion and splitting of artificial neural elements in optimizing the network size. In IEEE International Conference on Systems, Man, and Cybernetics.Google Scholar
- Minsoo Kang and Bohyung Han. 2020. Operation-Aware Soft Channel Pruning using Differentiable Masks. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2007.03938Google Scholar
- Partha P. Kanjilal, P. K. Dey, and D. N. Banerjee. 1993. Reduced-size neural networks through singular value decomposition and subset selection. Electronics Letters 29, 17 (1993), 1516-1518.Google Scholar
- Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. (2020). arXiv:cs.LG/2001.08361Google Scholar
- Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. 2019. Error feedback fixes SignSGD and other gradient compression schemes. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1901.09847Google Scholar
- Ehud D. Karnin. 1990. A simple procedure for pruning back-propagation trained neural networks. IEEE Transactions on Neural Networks 1, 2 (1990), 239-242.Google Scholar
- Jason N. D. Kerr, David Greenberg, and Fritjof Helmchen. 2005. Imaging input and output of neocortical networks in vivo. Proceedings of the National Academy of Sciences 102, 39 (2005), 14063-14068.Google Scholar
- Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. 2018. ZeNA: Zero-Aware Neural Network Accelerator. IEEE Design Test 35, 1 (2018), 39-46.Google Scholar
- Diederik P Kingma, Tim Salimans, and Max Welling. 2015. Variational Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1506.02557Google Scholar
- Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1312.6114Google Scholar
- Maxim Kodryan, Artem Grachev, Dmitry Ignatov, and Dmitry Vetrov. 2019. Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). 40-48.Google Scholar
- Jakub Konečný and Peter Richtárik. 2018. Randomized distributed mean estimation: Accuracy vs. communication. Frontiers in Applied Mathematics and Statistics 4 (2018), 62. arXiv:cs.DC/1611.07555Google Scholar
- Anders Krogh and John A. Hertz. 1991. A Simple Weight Decay Can Improve Generalization. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/1991/hash/8eefcfdf5990e441f0fb6f3fad709e21-Abstract.htmlGoogle Scholar
- David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. 2017. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. International Conference on Learning Representations (ICLR) (2017). arXiv:cs.NE/1606.01305Google Scholar
- Souvik Kundu, Mahdi Nazemi, Peter A Beerel, and Massoud Pedram. 2021. DNR: A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs. In Asia and South Pacific Design Automation Conference (ASP-DAC). arXiv:cv.CV/2011.03083Google Scholar
- Souvik Kundu, Mahdi Nazemi, Massoud Pedram, Keith M Chugg, and Peter A Beerel. 2020. Pre-defined sparsity for low-complexity convolutional neural networks. IEEE Trans. Comput. 69, 7 (2020), 1045-1058. arXiv:cs.CV/2001.10710Google Scholar
- Souvik Kundu and Sairam Sundaresan. 2021. AttentionLite: Towards Efficient Self-Attention Models for Vision. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). arXiv:cs.CV/2101.05216Google Scholar
- H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). arXiv:cs.LG/1811.04770Google Scholar
- Frederik Kunstner, Philipp Hennig, and Lukas Balles. 2019. Limitations of the empirical Fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1905.12558Google Scholar
- Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. 2020. Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks. In International Conference on Machine Learning (ICML). http://proceedings.mlr.press/v119/kurtz20a.htmlGoogle Scholar
- Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. 2020. Soft Threshold Weight Reparameterization for Learnable Sparsity. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2002.03231Google Scholar
- Andrey Kuzmin, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, and Max Welling. 2019. Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks. (2019). arXiv:cs.LG/1912.09802Google Scholar
- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453-466.Google Scholar
- Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. Large Memory Layers with Product Keys. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/1907.05242Google Scholar
- Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2017. FractalNet: Ultra-Deep Neural Networks without Residuals. International Conference on Learning Representations (ICLR) (2017). arXiv:cs.CV/1605.07648Google Scholar
- Philippe Lauret, Eric Fock, and Thierry Alex Mara. 2006. A node pruning algorithm based on a Fourier amplitude sensitivity test method. IEEE Transactions on Neural Networks 17, 2 (2006), 273-293.Google Scholar
- Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.NE/1509.09308Google Scholar
- Yann Le Cun, John S. Denker, and Sara A. Solla. 1990. Optimal Brain Damage. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
- Vadim Lebedev and Victor Lempitsky. 2016. Fast ConvNets Using Group-wise Brain Damage. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1506.02515Google Scholar
- Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. 2020. A Signal Propagation Perspective for Pruning Neural Networks at Initialization. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1906.06307Google Scholar
- Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2019. SNIP: Single-shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1810.02340Google Scholar
- Dmitry Lepikhin, Hyouk Joong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations (ICLR). arXiv:cs.CL/2006.16668Google Scholar
- Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. 2018. Measuring the Intrinsic Dimension of Objective Landscapes. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1804.08838Google Scholar
- Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning Filters for Efficient ConvNets. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1608.08710Google Scholar
- Jiajun Li, Shuhao Jiang, Shijun Gong, Jingya Wu, Junchao Yan, Guihai Yan, and Xiaowei Li. 2019. SqueezeFlow: A Sparse CNN Accelerator Exploiting Concise Convolution Rules. IEEE Trans. Comput. 68, 11 (2019), 1663-1677.Google Scholar
- Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020a. SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/2003.09833Google Scholar
- Yunqiang Li, Silvia Laura Pintea, and Jan van Gemert. 2020b. Less bits is more: How pruning deep binary networks increases weight capacity. (2020). https://openreview.net/forum?id=Hy8JM_Fvt5NGoogle Scholar
- Yuanzhi Li, Colin Wei, and Tengyu Ma. 2020d. Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1907.04595Google Scholar
- Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. 2020c. Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. In International Conference on Machine Learning (ICML). arXiv:cs.CL/2002.11794Google Scholar
- Lucas Liebenwein, Cenk Baykal, Harry Lang, Dan Feldman, and Daniela Rus. 2020. Provable Filter Pruning for Efficient Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1911.07412Google Scholar
- Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1509.02971Google Scholar
- Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. 2020. Backpropagation and the brain. Nature Reviews Neuroscience (2020), 1-12.Google Scholar
- Hyeontaek Lim, David Andersen, and Michael Kaminsky. 2019. 3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning. In Machine Learning and Systems (MLSys). arXiv:cs.LG/1802.07389Google Scholar
- Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime Neural Pruning. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.htmlGoogle Scholar
- Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. 2020. Dynamic Model Pruning with Feedback. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2006.07253Google Scholar
- Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1712.01887Google Scholar
- Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior. In Findings of the Association for Computational Linguistics: EMNLP 2020. arXiv:cs.CL/2010.01791Google Scholar
- Pierre Lison, Jörg Tiedemann, Milen Kouylekov, et al. 2019. Open subtitles 2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora. In Eleventh International Conference on Language Resources and Evaluation.Google Scholar
- Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Penksy. 2015. Sparse Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Lanlan Liu and Jia Deng. 2018. Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution. In AAAI Conference on Artificial Intelligence (AAAI). arXiv:cs.LG/1701.00299Google Scholar
- Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. 2019. Dynamic Sparse Graph for Efficient Deep Learning. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1810.00859Google Scholar
- Tianlin Liu and Friedemann Zenke. 2020. Finding trainable sparse networks through Neural Tangent Transfer. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2006.08228Google Scholar
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. RoBERTa: A robustly optimized BERT pretraining approach. (2019). arXiv:cs.CL/1907.11692Google Scholar
- Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. Learning Efficient Convolutional Networks through Network Slimming. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1708.06519Google Scholar
- Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face attributes in the wild. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1411.7766Google Scholar
- Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2019b. Rethinking the Value of Network Pruning. In International Conference on Learning Representations (ICLR). arXiv:1810.05270Google Scholar
- Ekaterina Lobacheva, Nadezhda Chirkova, and Dmitry Vetrov. 2018. Bayesian sparsification of gated recurrent neural networks. In NeurIPS Workshop on Compact Deep Neural Networks with Industrial Applications. arXiv:cs.LG/1812.05692Google Scholar
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR). arXiv:1711.05101Google Scholar
- Christos Louizos, Karen Ullrich, and Max Welling. 2017. Bayesian Compression for Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1705.08665Google Scholar
- Christos Louizos, Max Welling, and Diederik P. Kingma. 2018. Learning Sparse Neural Networks through L0 Regularization. In International Conference on Learning Representations (ICLR). arXiv:stat.ML/1712.01312Google Scholar
- Jian-Hao Luo and Jianxin Wu. 2019. AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference. Pattern Recognition 107 (2019), 107461. arXiv:cs.CV/1805.08941Google Scholar
- Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1707.06342Google Scholar
- Alexander Ly, Maarten Marsman, Josine Verhagen, Raoul Grasman, and Eric-Jan Wagenmakers. 2017. A Tutorial on Fisher Information. Journal of Mathematical Psychology 80 (2017), 40-55. arXiv:math.ST/1705.01064Google Scholar
- Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. 2019. PruneTrain: Fast Neural Network Training by Dynamic Sparse Model Reconfiguration. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). arXiv:cs.LG/1901.09290Google Scholar
- Divyam Madaan, Jinwoo Shin, and Sung Ju Hwang. 2020. Adversarial Neural Pruning with Latent Vulnerability Suppression. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1908.04355Google Scholar
- Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. International Conference on Learning Representations (ICLR) (2017). arXiv:cs.LG/1611.00712Google Scholar
- Alireza Makhzani and Brendan Frey. 2015. Winner-Take-All Autoencoders. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1409.2752Google Scholar
- Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. 2020. Proving the Lottery Ticket Hypothesis: Pruning is All You Need. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2002.00585Google Scholar
- Chaitanya Malaviya, Pedro Ferreira, and André FT Martins. 2018. Sparse and constrained attention for neural machine translation. In Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (ACL). arXiv:cs.CL/1805.08241Google Scholar
- Arun Mallya and Svetlana Lazebnik. 2018. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1711.05769Google Scholar
- Franco Manessi, Alessandro Rozza, Simone Bianco, Paolo Napoletano, and Raimondo Schettini. 2018. Automated Pruning for Deep Neural Network Compression. In International Conference on Pattern Recognition (ICPR). arXiv:cs.CV/1712.01721Google Scholar
- Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. 2017. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. (2017). arXiv:cs.LG/1705.08922Google Scholar
- Zelda Mariet and Suvrit Sra. 2016. Diversity Networks: Neural Network Compression Using Determinantal Point Processes. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1511.05077Google Scholar
- James Martens and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1503.05671Google Scholar
- Andre Martins and Ramon Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning (ICML). arXiv:cs.CL/1602.02068Google Scholar
- Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia. 2020. MLPerf Training Benchmark. In Machine Learning and Systems (MLSys). arXiv:cs.LG/1910.01500Google Scholar
- Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. An Empirical Model of Large-Batch Training. (2018). arXiv:cs.LG/1812.06162Google Scholar
- J. S. McCarley, Rishav Chakravarti, and Avirup Sil. 2020. Structured Pruning of a BERT-based Question Answering Model. (2020). arXiv:cs.CL/1910.06360Google Scholar
- Dushyant Mehta, Kwang In Kim, and Christian Theobalt. 2019. On implicit filter level sparsity in convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1811.12495Google Scholar
- Rahul Mehta. 2019. Sparse Transfer Learning via Winning Lottery Tickets. In NeurIPS Workshop on Learning Transferable Skills. arXiv:cs.LG/1905.07785Google Scholar
- Fanxu Meng, Hao Cheng, Ke Li, Huixiang Luo, Xiaowei Guo, Guangming Lu, and Xing Sun. 2020. Pruning Filter in Filter. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/2009.14410Google Scholar
- Hrushikesh Mhaskar and Tomaso Poggio. 2016. Deep vs. shallow networks : An approximation theory perspective. Analysis and Applications 14, 06 (2016), 829-848. arXiv:cs.LG/1608.03287Google Scholar
- Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One?. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/1905.10650Google Scholar
- Beren Millidge, Alexander Tschantz, and Christopher L. Buckley. 2020. Predictive Coding Approximates Backprop along Arbitrary Computation Graphs. (2020). arXiv:cs.LG/2006.04182Google Scholar
- Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2018. WRPN: Wide Reduced-Precision Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1709.01134Google Scholar
- Deepak Mittal, Shweta Bhardwaj, Mitesh M. Khapra, and Balaraman Ravindran. 2018. Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks. In Winter Conference on Applications of Computer Vision (WACV). arXiv:cs.CV/1801.10447Google Scholar
- Decebal Constantin Mocanu, Elena Mocanu, Phuong H. Nguyen, Madeleine Gibescu, and Antonio Liotta. 2016. A topological insight into restricted Boltzmann machines. Machine Learning 104, 2-3 (Jul 2016), 243270.Google Scholar
- Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. 2018. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications 9, 1 (2018), 1-12. arXiv:cs.NE/1707.04780Google Scholar
- Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. Variational Dropout Sparsifies Deep Neural Networks. In International Conference on Machine Learning (ICML). arXiv:stat.ML/1701.05369Google Scholar
- Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. 2019. Importance Estimation for Neural Network Pruning. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1906.10771Google Scholar
- Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. Pruning Convolutional Neural Networks for Resource Efficient Inference. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1611.06440Google Scholar
- John E Moody. 1991. Note on generalization, regularization and architecture selection in nonlinear learning systems. In IEEE Workshop on Neural Networks for Signal Processing.Google Scholar
- Ari S. Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. 2019. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1906.02773Google Scholar
- Hesham Mostafa and Xin Wang. 2019. Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1902.05967Google Scholar
- Michael C Mozer and Paul Smolensky. 1988. Skeletonization: A technique for trimming the fat from a network via relevance assessment. Advances in Neural Information Processing Systems (NeurIPS) (1988). https://proceedings.neurips.cc/paper/1988/hash/07e1cd7dca89a1678042477183b7ac3f-Abstract.htmlGoogle Scholar
- Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. 2006. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25, 1-3 (2006), 161-193.Google Scholar
- Ben Mussay, Daniel Feldman, Samson Zhou, Vladimir Braverman, and Margarita Osadchy. 2020. Data-Independent Structured Pruning of Neural Networks via Coresets. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2008.08316Google Scholar
- Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. 2017. Exploring Sparsity in Recurrent Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1704.05119Google Scholar
- Pramod L. Narasimha, Walter H. Delashmit, Michael T. Manry, Jiang Li, and Francisco Maldonado. 2008. An integrated growing-pruning method for feedforward network training. Neurocomputing 71, 13 (2008), 2831 - 2847.Google Scholar
- Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. Structured Bayesian Pruning via Log-Normal Multiplicative Noise. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1705.07283Google Scholar
- Behnam Neyshabur. 2020. Towards Learning Convolutions from Scratch. In Towards Learning Convolutions from Scratch. arXiv:cs.LG/2007.13657Google Scholar
- Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. 2019. The Role of Over-Parametrization in Generalization of Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1805.12076Google Scholar
- Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh, Quoc V. Le, and Andrew Y. Ng. 2010. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2010/hash/01f78be6f7cad02658508fe4616098a9-Abstract.htmlGoogle Scholar
- Vlad Niculae and Mathieu Blondel. 2017. A regularized framework for sparse and structured neural attention. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1705.07704Google Scholar
- Nils J Nilsson. 2009. The quest for artificial intelligence: A history of ideas and achievements. Cambridge University Press.Google Scholar
- Yue Niu, Rajgopal Kannan, Ajitesh Srivastava, and Viktor Prasanna. 2020. Reuse Kernels or Activations? A Flexible Dataow for Low-Latency Spectral CNN Acceleration. In International Symposium on Field-Programmable Gate Arrays (FPGA).Google Scholar
- Yue Niu, Hanqing Zeng, Ajitesh Srivastava, Kartik Lakhotia, Rajgopal Kannan, Yanzhi Wang, and Viktor Prasanna. 2019. SPEC2: SPECtral SParsE CNN Accelerator on FPGAs. (2019). arXiv:cs.CV/1910.11103Google Scholar
- Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning Deconvolution Network for Semantic Segmentation. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1505.04366Google Scholar
- Steven J Nowlan and Geoffrey E Hinton. 1992. Simplifying neural networks by soft weight-sharing. Neural Computation 4, 4 (1992), 473-493.Google Scholar
- Nvidia. 2020. NVIDIA A100 Tensor Core GPU Architecture. (2020).Google Scholar
- Bruno A Olshausen and David J Field. 1996. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 6583 (1996), 607-609.Google Scholar
- Laurent Orseau, Marcus Hutter, and Omar Rivasplata. 2020. Logarithmic Pruning is All You Need. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.12156Google Scholar
- Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1811.12019Google Scholar
- Wei Pan, Hao Dong, and Yike Guo. 2016. DropNeuron: Simplifying the Structure of Deep Neural Networks. (2016). arXiv:cs.CV/1606.07326Google Scholar
- Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27-40. arXiv:cs.NE/1708.04485Google Scholar
- Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2017. Faster CNNs with Direct Sparse Convolutions and Guided Pruning. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1608.01409Google Scholar
- Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image Transformer. In International Conference on Machine Learning (ICML). arXiv:cs.CV/1802.05751Google Scholar
- Morten Pedersen, Lars Hansen, and Jan Larsen. 1995. Pruning with generalization based weight saliencies: λOBD, λOBS. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/1995/hash/3473decccb0509fb264818a7512a8b9b-Abstract.htmlGoogle Scholar
- Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. 2020. Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.07990Google Scholar
- Bryan A. Plummer, Nikoli Dryden, Julius Frost, Torsten Hoeer, and Kate Saenko. 2020. Neural Parameter Allocation Search. (2020). arXiv:cs.LG/2006.10598Google Scholar
- Adam Polyak and Lior Wolf. 2015. Channel-level acceleration of deep face representations. IEEE Access 3 (2015), 2163-2175.Google Scholar
- Udo W. Pooch and Al Nieder. 1973. A Survey of Indexing Techniques for Sparse Matrices. ACM Comput. Surv. 5, 2 (June 1973), 109-133.Google Scholar
- Ameya Prabhu, Girish Varma, and Anoop Namboodiri. 2018. Deep Expander Networks: Efficient Deep Networks from Graph Theory. In European Conference on Computer Vision (ECCV). arXiv:cs.CV/1711.08757Google Scholar
- Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/2005.00561Google Scholar
- Lutz Prechelt. 1997. Connection pruning with static and adaptive pruning schedules. Neurocomputing 16, 1 (1997), 49 - 61.Google Scholar
- Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
- Md Aamir Raihan and Tor M. Aamodt. 2020. Sparse Weight Activation Training. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2001.01969Google Scholar
- Adnan Siraj Rakin, Zhezhi He, Li Yang, Yanzhi Wang, Liqiang Wang, and Deliang Fan. 2020. Robust Sparse Regularization: Defending Adversarial Attacks Via Regularized Sparse Network. In Great Lakes Symposium on VLSI (GLSVLSI).Google Scholar
- Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. 2020. What's Hidden in a Randomly Weighted Neural Network?. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1911.13299Google Scholar
- Carl Edward Rasmussen and Zoubin Ghahramani. 2000. Occam's razor. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2000/hash/0950ca92a4dcf426067cfd2246bb5ff3-Abstract.htmlGoogle Scholar
- Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In International Symposium on Computer Architecture (ISCA).Google Scholar
- Russell Reed. 1993. Pruning algorithms-a survey. IEEE Transactions on Neural Networks 4, 5 (1993), 740-747.Google Scholar
- Alex Renda, Jonathan Frankle, and Michael Carbin. 2020. Comparing Rewinding and Fine-tuning in Neural Network Pruning. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2003.02389Google Scholar
- Cèdric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoeer. 2019. SparCML: High-performance sparse communication for machine learning. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). arXiv:cs.DC/1802.08021Google Scholar
- Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2020. Survey of Machine Learning Accelerators. In IEEE High Performance Extreme Computing Conference (HPEC). arXiv:cs.DC/2009.00993Google Scholar
- Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and variational inference in deep generative models. In International Conference on Machine Learning (ICML).Google Scholar
- Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. 2018. Compressing DMA engine: Leveraging activation sparsity for training deep neural networks. In International Symposium on High Performance Computer Architecture (HPCA). arXiv:cs.LG/1705.01626Google Scholar
- Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8 (2021), 842-866. arXiv:cs.CL/2002.12327Google Scholar
- Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. 2017. Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning. (2017). arXiv:cs.LG/1711.01239Google Scholar
- Stuart Russell and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach (4th ed.). Prentice Hall Press.Google Scholar
- Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. 2013. Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets. In International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google Scholar
- Victor Sanh, Thomas Wolf, and Alexander M. Rush. 2020. Movement Pruning: Adaptive Sparsity by Fine-Tuning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/2005.07683Google Scholar
- Pedro Savarese, Hugo Silva, and Michael Maire. 2020. Winning the Lottery with Continuous Sparsification. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1912.04427Google Scholar
- Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. 2017. Group sparse regularization for deep neural networks. Neurocomputing 241 (2017), 81 - 89. arXiv:stat.ML/1607.00485Google Scholar
- Paul Scheffler, Florian Zaruba, Fabian Schuiki, Torsten Hoeer, and Luca Benini. 2020. Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra. (2020). arXiv:cs.AR/2011.08070Google Scholar
- Abigail See, Minh-Thang Luong, and Christopher D. Manning. 2016. Compression of Neural Machine Translation Models via Pruning. In SIGNLL Conference on Computational Natural Language Learning. arXiv:cs.AI/1606.09274Google Scholar
- Vikash Sehwag, Shiqi Wang, Prateek Mittal, and Suman Jana. 2020. HYDRA: Pruning Adversarially Robust Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/2002.10509Google Scholar
- Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association.Google Scholar
- Aditya Sharma, Nikolas Wolfe, and Bhiksha Raj. 2017. The Incredible Shrinking Neural Network: New Perspectives on Learning Representations Through The Lens of Pruning. (2017). arXiv:cs.NE/1701.04465Google Scholar
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1701.06538Google Scholar
- Shaohuai Shi, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, Xiang Huang, and Xiaowen Chu. 2019a. A distributed synchronous SGD algorithm with global Top-k sparsification for low bandwidth networks. In International Conference on Distributed Computing Systems Workshop on Networks. arXiv:cs.DC/1901.04359Google Scholar
- Shaohuai Shi, Kaiyong Zhao, Qiang Wang, Zhenheng Tang, and Xiaowen Chu. 2019b. A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification. In International Joint Conference on Artificial Intelligence.Google Scholar
- Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In ACM SIGSAC Conference on Computer and Communications Security.Google Scholar
- Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the Black Box of Deep Neural Networks via Information. (2017). arXiv:cs.LG/1703.00810Google Scholar
- Jocelyn Sietsma and Robert JF Dow. 1991. Creating artificial neural networks that generalize. Neural Networks 4, 1 (1991), 67-79.Google Scholar
- Jocelyn Sietsma and Robert J. F. Dow. 1988. Neural net pruning-why and how. In International Conference on Neural Networks.Google Scholar
- Laurent Sifre and Stéphane Mallat. 2014. Rigid-motion scattering for image classification. Ph.D. Dissertation. Ecole Polytechnique, CMAP.Google Scholar
- Sidak Pal Singh and Dan Alistarh. 2020. WoodFisher: Efficient Second-Order Approximation for Neural Network Compression. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2004.14340Google Scholar
- Samarth Sinha, Zhengli Zhao, Anirudh Goyal, Colin A Raffel, and Augustus Odena. 2020. Top-k Training of GANs: Improving GAN Performance by Throwing Away Bad Samples. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/2002.06224Google Scholar
- Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. 2018. Don't Decay the Learning Rate, Increase the Batch Size. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1711.00489Google Scholar
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
- Suraj Srinivas and R. Venkatesh Babu. 2015. Data-free parameter pruning for Deep Neural Networks. In British Machine Vision Conference (BMVC). arXiv:cs.CV/1507.06149Google Scholar
- Suraj Srinivas and R. Venkatesh Babu. 2016. Learning Neural Network Architectures using Backpropagation. In British Machine Vision Conference (BMVC). arXiv:cs.LG/1511.05497Google Scholar
- Suraj Srinivas, Akshayvarun Subramanya, and R. Venkatesh Babu. 2016. Training Sparse Neural Networks. In Conference on Computer Vision and Pattern Recognition Workshops. arXiv:cs.CV/1611.06694Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929-1958. https://jmlr.org/papers/v15/srivastava14a.htmlGoogle Scholar
- Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. Sparsified SGD with memory. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1809.07599Google Scholar
- Nikko Ström. 1997. Sparse connection and pruning in large dynamic artificial neural networks. In Fifth European Conference on Speech Communication and Technology.Google Scholar
- Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.Google Scholar
- Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, and Jason D. Lee. 2020. Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2009.11094Google Scholar
- Xavier Suau, Luca Zappella, and Nicholas Apostoloff. 2019. Filter Distillation for Network Compression. In Winter Conference on Applications of Computer Vision (WACV). arXiv:cs.CV/1807.10585Google Scholar
- Haobo Sun, Yingxia Shao, Jiawei Jiang, Bin Cui, Kai Lei, Yu Xu, and Jiang Wang. 2019. Sparse gradient compression for distributed SGD. In International Conference on Database Systems for Advanced Applications. 139-155.Google Scholar
- Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified back propagation for accelerated deep learning with reduced overfitting. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1706.06197Google Scholar
- Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2015. Sparsifying Neural Network Connections for Face Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1512.01891Google Scholar
- Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. 2017. Distributed mean estimation with limited communication. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1611.00429Google Scholar
- Kenji Suzuki, Isao Horiba, and Noboru Sugie. 2001. A simple neural network pruning algorithm with application to filter synthesis. In Neural Processing Letters. 43-53.Google Scholar
- Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (2017), 2295-2329. arXiv:cs.CV/1703.09039Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1409.4842Google Scholar
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1512.00567Google Scholar
- S. Tamura, M. Tateishi, M. Matumoto, and S. Akita. 1993. Determination of the number of redundant hidden units in a three-layered feedforward neural network. In International Conference on Neural Networks.Google Scholar
- Chong Min John Tan and Mehul Motani. 2020. DropNet: Reducing Neural Network Complexity via Iterative Pruning. In International Conference on Machine Learning (ICML). http://proceedings.mlr.press/v119/tan20a.htmlGoogle Scholar
- Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1807.11626Google Scholar
- Mingxing Tan and Quoc V. Le. 2020. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1905.11946Google Scholar
- Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. 2020. Pruning neural networks without any data by iteratively conserving synaptic ow. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.05467Google Scholar
- Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. 2019. DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning (ICML). arXiv:cs.DC/1905.05957Google Scholar
- Yehui Tang, Yunhe Wang, Yixing Xu, Dacheng Tao, Chunjing Xu, Chao Xu, and Chang Xu. 2020b. SCOP: Scientific Control for Reliable Neural Network Pruning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/2010.10732Google Scholar
- Zhenheng Tang, Shaohuai Shi, Xiaowen Chu, Wei Wang, and Bo Li. 2020a. Communication-efficient distributed deep learning: A comprehensive survey. (2020). arXiv:cs.DC/2003.06307Google Scholar
- Enzo Tartaglione, Skjalg Lepsøy, Attilio Fiandrotti, and Gianluca Francini. 2018. Learning Sparse Neural Networks via Sensitivity-Driven Regularization. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1810.11764Google Scholar
- Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2021. Long Range Arena: A Benchmark for Efficient Transformers. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2011.04006Google Scholar
- Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. (2020). arXiv:cs.LG/2009.06732Google Scholar
- Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Annual Meeting of the Association for Computational Linguistics (ACL). arXiv:cs.CL/1905.05950Google Scholar
- Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. 2018. Faster gaze prediction with dense networks and Fisher pruning. (2018). arXiv:cs.CV/1801.05787Google Scholar
- Georg Thimm and Emile Fiesler. 1995. Evaluating Pruning Methods. In Proceedings of the International Symposium on Artificial Neural Networks.Google Scholar
- Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267-288.Google Scholar
- Michael E Tipping. 2001. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, Jun (2001), 211-244.Google Scholar
- Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christopher Bregler. 2015. Efficient Object Localization Using Convolutional Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1411.4280Google Scholar
- Yusuke Tsuzuku, Hiroto Imachi, and Takuya Akiba. 2018. Variance-based gradient compression for efficient distributed deep learning. In International Conference on Learning Representations Workshops. arXiv:cs.LG/1802.06058Google Scholar
- Karen Ullrich, Edward Meeds, and Max Welling. 2017. Soft Weight-Sharing for Neural Network Compression. In International Conference on Learning Representations (ICLR). arXiv:stat.ML/1702.04008Google Scholar
- Didem Unat, Anshu Dubey, Torsten Hoeer, John Shalf, Mark Abraham, Mauro Bianco, Bradford L. Chamberlain, Romain Cledat, H. Carter Edwards, Hal Finkel, Karl Fuerlinger, Frank Hannig, Emmanuel Jeannot, Amir Kamil, Jeff Keasler, Paul H J Kelly, Vitus Leung, Hatem Ltaief, Naoya Maruyama, Chris J. Newburn, and Miquel Pericas. 2017. Trends in Data Locality Abstractions for HPC Systems. IEEE Transactions on Parallel and Distributed Systems (TPDS) 28, 10 (Oct. 2017).Google Scholar
- Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. 2020. Bayesian Bits: Unifying Quantization and Pruning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2005.07093Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/1706.03762Google Scholar
- Stijn Verdenius, Maarten Stol, and Patrick Forré. 2020. Pruning via Iterative Ranking of Sensitivity Statistics. (2020). arXiv:cs.LG/2006.00896Google Scholar
- Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Annual Meeting of the Association for Computational Linguistics (ACL). arXiv:cs.CL/1905.09418Google Scholar
- Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of Neural Networks using DropConnect. In International Conference on Machine Learning (ICML). http://proceedings.mlr.press/v28/wan13.htmlGoogle Scholar
- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR). arXiv:cs.CL/1804.07461Google Scholar
- Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. 2019. EigenDamage: Structured pruning in the kronecker-factored eigenbasis. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1905.05934Google Scholar
- Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. 2018. ATOMO: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1806.04090Google Scholar
- Linnan Wang, Wei Wu, Junyu Zhang, Hang Liu, George Bosilca, Maurice Herlihy, and Rodrigo Fonseca. 2020b. FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks. In International Symposium on High-Performance Parallel and Distributed Computing (HPDC).Google Scholar
- Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2020a. Structured pruning of large language models. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/1910.04732Google Scholar
- Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. 2018. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1710.09854Google Scholar
- Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7 (2019), 625-641. arXiv:cs.CL/1805.12471Google Scholar
- Bingzhen Wei, Xu Sun, Xuancheng Ren, and Jingjing Xu. 2017. Minimal Effort Back Propagation for Convolutional Neural Networks. (2017). arXiv:cs.LG/1709.05804Google Scholar
- Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning Structured Sparsity in Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.NE/1608.03665Google Scholar
- David White and Panos A. Ligomenides. 1993. GANNet: A Genetic Algorithm for Optimizing Topology and Weights in Neural Network Design. In Proceedings of the International Workshop on Artificial Neural Networks: New Trends in Neural Computation.Google Scholar
- D. Whitley and C. Bogart. 1990. The Evolution of Connectivity: Pruning Neural Networks Using Genetic Algorithms. In International Joint Conference on Neural Networks (IJCNN).Google Scholar
- Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). arXiv:cs.CL/1704.05426Google Scholar
- Peter M. Williams. 1995. Bayesian Regularization and Pruning Using a Laplace Prior. Neural Computation 7, 1 (1995), 117-143.Google Scholar
- Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. 2019. Discovering Neural Wirings. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1906.00586Google Scholar
- Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. 2020. Supermasks in Superposition. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.14769Google Scholar
- Yuhuai Wu, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. 2017. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Advances in Neural Information Processing Systems (NeurIPS). 5285-5294. arXiv:cs.LG/1708.05144Google Scholar
- Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. 2019. AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2019/hash/4efc9e02abdab6b6166251918570a307-Abstract.htmlGoogle Scholar
- Jinhua Xu and Daniel WC Ho. 2006. A new training and pruning algorithm based on node dependence and Jacobian rank deficiency. Neurocomputing 70, 1-3 (2006), 544-558.Google Scholar
- Atsushi Yaguchi, Taiji Suzuki, Wataru Asano, Shuhei Nitta, Yukinobu Sakata, and Akiyuki Tanizawa. 2018. Adam induces implicit weight sparsity in rectifier neural networks. In International Conference on Machine Learning and Applications (ICMLA). arXiv:cs.LG/1812.08119Google Scholar
- Dingqing Yang, Amin Ghasemazar, Xiaowei Ren, Maximilian Golub, Guy Lemieux, and Mieszko Lis. 2020a. Procrustes: a Dataow and Accelerator for Sparse Deep Neural Network Training. In International Symposium on Microarchitecture (MICRO). arXiv:cs.NE/2009.10976Google Scholar
- Huanrui Yang, Wei Wen, and Hai Li. 2020b. DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1908.09979Google Scholar
- Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1611.05128Google Scholar
- Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. 2018. Rethinking the smaller-normless-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1802.00124Google Scholar
- Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, and Qiang Liu. 2020. Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2003.01794Google Scholar
- Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, Jan-Henrik Lambrechts, Huan Zhang, Aojun Zhou, Kaisheng Ma, Yanzhi Wang, and Xue Lin. 2019. Adversarial Robustness vs. Model Compression, or Both?. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1903.12561Google Scholar
- Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. 2019. Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1903.05662Google Scholar
- Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2020. Drawing early-bird tickets: Towards more efficient training of deep networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1909.11957Google Scholar
- Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. 2019. Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/1909.08174Google Scholar
- Dong Yu, Frank Seide, Gang Li, and Li Deng. 2012. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google Scholar
- Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. ACM SIGARCH Computer Architecture News 45, 2 (2017), 548-560.Google Scholar
- Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S. Davis. 2018. NISP: Pruning Networks using Neuron Importance Score Propagation. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1711.05908Google Scholar
- Xin Yu, Zhiding Yu, and Srikumar Ramalingam. 2018. Learning strict identity mappings in deep residual networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1804.01661Google Scholar
- Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49-67.Google Scholar
- Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. 2020. O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.04862Google Scholar
- Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2007.14062Google Scholar
- Wenyuan Zeng and Raquel Urtasun. 2019. MLPrune: Multi-Layer Pruning for Automated Neural Network Compression. (2019). https://openreview.net/forum?id=r1g5b2RcKmGoogle Scholar
- Xiaoqin Zeng and Daniel S Yeung. 2006. Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure. Neurocomputing 69, 7-9 (2006), 825-837.Google Scholar
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1611.03530Google Scholar
- Jiaqi Zhang, Xiangru Chen, Mingcong Song, and Tao Li. 2019. Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks. In International Symposium on Computer Architecture (ISCA).Google Scholar
- Jie-Fang Zhang, Ching-En Lee, C. Liu, Y. Shao, Stephen W. Keckler, and Zhengya Zhang. 2019a. SNAP: A 1.67 21.55TOPS/W Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference in 16nm CMOS. In Symposium on VLSI Circuits.Google Scholar
- Jeff (Jun) Zhang, Parul Raj, Shuayb Zarar, Amol Ambardekar, and Siddharth Garg. 2019b. CompAct: On-Chip ComPression of ActIvations for Low Power Systolic Array Based CNN Acceleration. ACM Trans. Embed. Comput. Syst. 18, 5s, Article 47 (Oct. 2019), 24 pages.Google Scholar
- Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoi Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In International Symposium on Microarchitecture (MICRO).Google Scholar
- Zhekai Zhang, Hanrui Wang, Song Han, and William J. Dally. 2020. SpArch: Efficient Architecture for Sparse Matrix Multiplication. In International Symposium on High Performance Computer Architecture (HPCA). arXiv:cs.AR/2002.08947Google Scholar
- Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. 2019a. Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection. (2019). arXiv:cs.CL/1912.11637Google Scholar
- Qibin Zhao, Masashi Sugiyama, Longhao Yuan, and Andrzej Cichocki. 2019b. Learning Efficient Tensor Representations with Ring Structure Networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). arXiv:cs.NA/1705.08286Google Scholar
- Guian Zhou and Jennie Si. 1999. Subset-based training and pruning of sigmoid neural networks. Neural Networks 12, 1 (1999), 79-89.Google Scholar
- Hao Zhou, Jose M Alvarez, and Fatih Porikli. 2016. Less is more: Towards compact CNNs. In European Conference on Computer Vision (ECCV).Google Scholar
- Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. 2019. Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1905.01067Google Scholar
- X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. 2018. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In International Symposium on Microarchitecture (MICRO).Google Scholar
- Jingyang Zhu, Jingbo Jiang, Xizi Chen, and Chi-Ying Tsui. 2018. SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity. In Design, Automation & Test in Europe Conference & Exhibition (DATE). arXiv:cs.LG/1711.01263Google Scholar
- Jingyang Zhu, Zhiliang Qian, and Chi-Ying Tsui. 2016. LRADNN: High-throughput and energy-efficient Deep Neural Network accelerator using Low Rank Approximation. In Asia and South Pacific Design Automation Conference (ASP-DAC).Google Scholar
- Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. (2017). arXiv:stat.ML/1710.01878Google Scholar
- Tao Zhuang, Zhixuan Zhang, Yuheng Huang, Xiaoyi Zeng, Kai Shuang, and Xiang Li. 2020. Neuron-level Structured Pruning using Polarization Regularizer. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/2020/hash/703957b6dd9e3a7980e040bee50ded65-Abstract.htmlGoogle Scholar
- Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. 2018. Discrimination-aware Channel Pruning for Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/1810.11809Google Scholar
Index Terms
(auto-classified)Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks
Recommendations
Finding sparse solutions of systems of polynomial equations via group-sparsity optimization
The paper deals with the problem of finding sparse solutions to systems of polynomial equations possibly perturbed by noise. In particular, we show how these solutions can be recovered from group-sparse solutions of a derived system of linear equations. ...
Recovering sparse signals with a certain family of nonconvex penalties and DC programming
This paper considers the problem of recovering a sparse signal representation according to a signal dictionary. This problem could be formalized as a penalized least-squares problem in which sparsity is usually induced by a l1-norm penalty on the ...
On recovery of sparse signals via l1 minimization
This paper considers constrained l1 minimization methods in a unified framework for the recovery of high-dimensional sparse signals in three settings: noiseless, bounded error, and Gaussian noise. Both l1 minimization with an l∞ constraint (Dantzig ...





Comments