research-article
Free Access

Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks

Published:01 January 2021Publication History
Skip Abstract Section

Abstract

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, sometimes even better than, the original dense networks. Sparsity promises to reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.

References

  1. When available, we include an arXiv reference to promote open access.Google ScholarGoogle Scholar
  2. We maintain a public repository with the full bibliography of this paper to the benefit of the community at https://github.com/spcl/sparsity-in-deep-learning.Google ScholarGoogle Scholar
  3. Alessandro Achille, Matteo Rovere, and Stefano Soatto. 2019. Critical Learning Periods in Deep Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1711.08856Google ScholarGoogle Scholar
  4. Sher Afghan and Uwe Naumann. 2020. Interval Adjoint Significance Analysis for Neural Networks. In International Conference on Computational Science. 365-378.Google ScholarGoogle Scholar
  5. Alireza Aghasi, Afshin Abdi, Nam Nguyen, and Justin Romberg. 2017. Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1611.05162Google ScholarGoogle Scholar
  6. Subutai Ahmad and Luiz Scheinkman. 2019. How Can We Be So Dense? The Benefits of Using Highly Sparse Representations. (2019). arXiv:cs.LG/1903.11257Google ScholarGoogle Scholar
  7. Alham Fikriand Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/1704.05021Google ScholarGoogle Scholar
  8. Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In International Symposium on Computer Architecture (ISCA). Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1610.02132Google ScholarGoogle Scholar
  9. Dan Alistarh, Torsten Hoeer, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1809.10505Google ScholarGoogle Scholar
  10. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. 2019. A Convergence Theory for Deep Learning via Over-Parameterization. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1811.03962Google ScholarGoogle Scholar
  11. Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, and Aaron Courville. 2016. Dynamic Capacity Networks. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1511.07838Google ScholarGoogle Scholar
  12. Jose M. Alvarez and Mathieu Salzmann. 2017. Compression-aware Training of Deep Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/1711.02638Google ScholarGoogle Scholar
  13. Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  14. Shun-ichi Amari. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation 10, 2 (1998), 251-276.Google ScholarGoogle Scholar
  15. Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2017. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13, 3 (2017), 1-18.Google ScholarGoogle Scholar
  16. Zahra Atashgahi, Ghada Sokar, Tim van der Lee, Elena Mocanu, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. 2020. Quick and Robust Feature Selection: the Strength of Energy-efficient Sparse Training for Autoencoders. (2020). arXiv:cs.LG/2012.00560Google ScholarGoogle Scholar
  17. Kambiz Azarian, Yash Bhalgat, Jinwon Lee, and Tijmen Blankevoort. 2020. Learned Threshold Pruning. (2020). arXiv:cs.LG/2003.00075Google ScholarGoogle Scholar
  18. Jimmy Ba, Roger Grosse, and James Martens. 2016a. Distributed second-order optimization using Kronecker-factored approximations. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  19. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016b. Layer normalization. (2016). arXiv:cs.LG/1607.06450Google ScholarGoogle Scholar
  20. Pierre Baldi and Peter J Sadowski. 2013. Understanding Dropout. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2013/hash/71f6278d140af599e06ad9bf1ba03cb0-Abstract.htmlGoogle ScholarGoogle Scholar
  21. Brian R. Bartoldson, Ari S. Morcos, Adrian Barbu, and Gordon Erlebacher. 2020. The Generalization-Stability Tradeoff In Neural Network Pruning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1906.03728Google ScholarGoogle Scholar
  22. Debraj Basu, Deepesh Data, Can Karakus, and Suhas N Diggavi. 2020. Qsparse-local-SGD: Distributed SGD with quantization, sparsification, and local computations. IEEE Journal on Selected Areas in Information Theory 1, 1 (2020), 217-226. arXiv:stat.ML/1906.02367Google ScholarGoogle Scholar
  23. Cenk Baykal, Lucas Liebenwein, Igor Gilitschenski, Dan Feldman, and Daniela Rus. 2019. Data-dependent coresets for compressing neural networks with applications to generalization bounds. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1804.05345Google ScholarGoogle Scholar
  24. Amir Beck and Marc Teboulle. 2009. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Img. Sci. 2, 1 (March 2009), 183-202.Google ScholarGoogle Scholar
  25. Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. 2018. Deep Rewiring: Training very sparse deep networks. In International Conference on Learning Representations (ICLR). arXiv:cs.NE/1711.05136Google ScholarGoogle Scholar
  26. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. (2020). arXiv:cs.CL/2004.05150Google ScholarGoogle Scholar
  27. Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, and Torsten Hoeer. 2019. A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning. In International Parallel and Distributed Processing Symposium (IPDPS). arXiv:cs.DC/1901.10183Google ScholarGoogle Scholar
  28. Tal Ben-Nun and Torsten Hoeer. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. ACM Computing Surveys (CSUR) 52, 4 (2018), 1-43. arXiv:cs.LG/1802.09941Google ScholarGoogle Scholar
  29. Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2016. Conditional Computation in Neural Networks for faster models. (2016). arXiv:cs.LG/1511.06297Google ScholarGoogle Scholar
  30. Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. (2013). arXiv:cs.LG/1308.3432Google ScholarGoogle Scholar
  31. Richard F Betzel, John D Medaglia, Lia Papadopoulos, Graham L Baum, Ruben Gur, Raquel Gur, David Roalf, Theodore D Satterthwaite, and Danielle S Bassett. 2017. The modular organization of human anatomical brain networks: Accounting for the cost of wiring. Network Neuroscience 1, 1 (2017), 42-68. arXiv:q-bio.NC/1608.01161Google ScholarGoogle Scholar
  32. Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. 2018. Benchmark Analysis of Representative Deep Neural Network Architectures. IEEE Access 6 (2018), 64270-64277. arXiv:cs.CV/1810.00736 Google ScholarGoogle ScholarCross RefCross Ref
  33. Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. What is the state of neural network pruning?. In Machine Learning and Systems (MLSys). arXiv:cs.LG/2003.03033Google ScholarGoogle Scholar
  34. Alfred Bourely, John Patrick Boueri, and Krzysztof Choromonski. 2017. Sparse Neural Networks Topologies. (2017). arXiv:cs.LG/1706.05683Google ScholarGoogle Scholar
  35. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/2005.14165Google ScholarGoogle Scholar
  36. Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1710.10174Google ScholarGoogle Scholar
  37. P. Burrascano. 1993. A pruning technique maximizing generalization. In International Conference on Neural Networks.Google ScholarGoogle Scholar
  38. Miguel Á. Carreira-Perpinan and Yerlan Idelbayev. 2018. "Learning-Compression" Algorithms for Neural Net Pruning. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  39. Giovanna Castellano and Anna Maria Fanelli. 2000. Variable selection using neural-network models. Neurocomputing 31, 1-4 (2000), 1-13.Google ScholarGoogle Scholar
  40. Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo. 1997. An iterative pruning algorithm for feedforward neural networks. IEEE Transactions on Neural Networks 8, 3 (1997), 519-531.Google ScholarGoogle Scholar
  41. Hema Chandrasekaran, Hung-Han Chen, and Michael T. Manry. 2000. Pruning of basis functions in nonlinear approximators. Neurocomputing 34, 1 (2000), 29 - 53.Google ScholarGoogle Scholar
  42. Soravit Changpinyo, Mark Sandler, and Andrey Zhmoginov. 2017. The Power of Sparsity in Convolutional Neural Networks. (2017). arXiv:cs.CV/1702.06257Google ScholarGoogle Scholar
  43. Shih-Kang Chao, Zhanyu Wang, Yue Xing, and Guang Cheng. 2020. Directional Pruning of Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.09358Google ScholarGoogle Scholar
  44. Yves Chauvin. 1989. A Back-Propagation Algorithm with Optimal Use of Hidden Units. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/1988/hash/9fc3d7152ba9336a670e36d0ed79bc43-Abstract.htmlGoogle ScholarGoogle Scholar
  45. Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition.Google ScholarGoogle Scholar
  46. Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2017. AdaComp: Adaptive residual gradient compression for dataparallel distributed training. In AAAI Conference on Artificial Intelligence (AAAI). arXiv:cs.LG/1712.02679Google ScholarGoogle Scholar
  47. Jianda Chen, Shangyu Chen, and Sinno Jialin Pan. 2020. Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/2020/hash/a914ecef9c12ffdb9bede64bb703d877-Abstract.htmlGoogle ScholarGoogle Scholar
  48. Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020. The Lottery Ticket Hypothesis for Pretrained BERT Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2007.12223Google ScholarGoogle Scholar
  49. Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127-138.Google ScholarGoogle Scholar
  50. Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (2019), 292-308. arXiv:cs.DC/1807.07928Google ScholarGoogle Scholar
  51. Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2020. A Survey of Model Compression and Acceleration for Deep Neural Networks. (2020). arXiv:cs.LG/1710.09282Google ScholarGoogle Scholar
  52. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. (2014). arXiv:cs.NE/1410.0759Google ScholarGoogle Scholar
  53. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. (2019). arXiv:cs.LG/1904.10509Google ScholarGoogle Scholar
  54. Minsu Cho, Ameya Joshi, and Chinmay Hegde. 2020. ESPN: Extremely Sparse Pruned Networks. (2020). arXiv:cs.LG/2006.15741Google ScholarGoogle Scholar
  55. Tejalal Choudhary, Vipul Mishra, Anurag Goswami, and Jagannathan Sarangapani. 2020. A comprehensive survey on model compression and acceleration. Artificial Intelligence Review (2020), 1-43.Google ScholarGoogle Scholar
  56. Tautvydas Cibas, Françoise Fogelman Soulié, Patrick Gallinari, and Sarunas Raudys. 1996. Variable selection with neural networks. Neurocomputing 12, 2 (1996), 223 - 248.Google ScholarGoogle Scholar
  57. Joseph Paul Cohen, Henry Z. Lo, and Wei Ding. 2017. RandomOut: Using a convolutional gradient norm to rescue convolutional filters. (2017). arXiv:cs.CV/1602.05931Google ScholarGoogle Scholar
  58. Maxwell D. Collins and Pushmeet Kohli. 2014. Memory Bounded Deep Convolutional Networks. (2014). arXiv:cs.CV/1412.1442Google ScholarGoogle Scholar
  59. Gonçalo M Correia, Vlad Niculae, and André FT Martins. 2019. Adaptively sparse transformers. In Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). arXiv:cs.CL/1909.00015Google ScholarGoogle Scholar
  60. Justin Cosentino, Federico Zaiter, Dan Pei, and Jun Zhu. 2019. The Search for Sparse, Robust Neural Networks. In NeurIPS Safety and Robustness in Decision Making Workshop. arXiv:cs.LG/1912.02386Google ScholarGoogle Scholar
  61. Baiyun Cui, Yingming Li, Ming Chen, and Zhongfei Zhang. 2019. Fine-tune BERT with Sparse Self-Attention Mechanism. In Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).Google ScholarGoogle Scholar
  62. Bin Dai, Chen Zhu, and David Wipf. 2018b. Compressing Neural Networks using the Variational Information Bottleneck. In International Conference on Machine Learning (ICML). arXiv:cs.CV/1802.10399Google ScholarGoogle Scholar
  63. Xiaoliang Dai, Hongxu Yin, and Niraj K. Jha. 2018a. NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm. IEEE Trans. Comput. 68, 10 (2018), 1487- 1497. arXiv:cs.NE/1711.02017Google ScholarGoogle Scholar
  64. Stéphane d'Ascoli, Levent Sagun, Joan Bruna, and Giulio Biroli. 2020. Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1906.06766Google ScholarGoogle Scholar
  65. Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. 2020. Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights. (2020). arXiv:cs.AR/2007.00864Google ScholarGoogle Scholar
  66. Peter Davies, Vijaykrishna Gurunathan, Niusha Moshrefi, Saleh Ashkboos, and Dan Alistarh. 2021. New Bounds For Distributed Mean Estimation and Variance Reduction. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2002.09268Google ScholarGoogle Scholar
  67. Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Gregory Rogez, and Puneet K. Dokania. 2021. Progressive Skeletonization: Trimming more fat from a network at initialization. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/2006.09081Google ScholarGoogle Scholar
  68. Luisa De Vivo, Michele Bellesi, William Marshall, Eric A Bushong, Mark H Ellisman, Giulio Tononi, and Chiara Cirelli. 2017. Ultrastructural evidence for synaptic scaling across the wake/sleep cycle. Science 355, 6324 (2017), 507-510.Google ScholarGoogle Scholar
  69. Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. Proc. IEEE 108, 4 (2020), 485-532.Google ScholarGoogle Scholar
  70. Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. 2013. Predicting Parameters in Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1306.0543Google ScholarGoogle Scholar
  71. Tim Dettmers and Luke Zettlemoyer. 2019. Sparse Networks from Scratch: Faster Training without Losing Performance. (2019). arXiv:cs.LG/1907.04840Google ScholarGoogle Scholar
  72. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). arXiv:cs.CL/1810.04805Google ScholarGoogle Scholar
  73. Sourya Dey, Kuan-Wen Huang, Peter A. Beerel, and Keith M. Chugg. 2019. Pre-Defined Sparse Neural Networks With Hardware Acceleration. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (2019), 332-345. arXiv:cs.LG/1812.01164Google ScholarGoogle Scholar
  74. Graham H Diering, Raja S Nirujogi, Richard H Roth, Paul F Worley, Akhilesh Pandey, and Richard L Huganir. 2017. Homer1a drives homeostatic scaling-down of excitatory synapses during sleep. Science 355, 6324 (2017), 511-515.Google ScholarGoogle Scholar
  75. Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. 2019a. Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1904.03837Google ScholarGoogle Scholar
  76. Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. 2019b. Global Sparse Momentum SGD for Pruning Very Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1909.12778Google ScholarGoogle Scholar
  77. William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).Google ScholarGoogle Scholar
  78. Pedro Domingos. 2020. Every Model Learned by Gradient Descent Is Approximately a Kernel Machine. (2020). arXiv:cs.LG/2012.00152Google ScholarGoogle Scholar
  79. Xin Dong, Shangyu Chen, and Sinno Jialin Pan. 2017. Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.NE/1705.07565Google ScholarGoogle Scholar
  80. Xiao Dong, Lei Liu, Guangli Li, Jiansong Li, Peng Zhao, Xueying Wang, and Xiaobing Feng. 2019. Exploiting the input sparsity to accelerate deep neural networks: poster. In Symposium on Principles and Practice of Parallel Programming (PPoPP).Google ScholarGoogle Scholar
  81. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16×16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/2010.11929Google ScholarGoogle Scholar
  82. Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In Workshop on Machine Learning in HPC Environments (MLHPC).Google ScholarGoogle Scholar
  83. Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019. Gradient Descent Provably Optimizes Over-parameterized Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1810.02054Google ScholarGoogle Scholar
  84. Aritra Dutta, El Houcine Bergou, Ahmed M Abdelmoniem, Chen-Yu Ho, Atal Narayan Sahu, Marco Canini, and Panos Kalnis. 2020. On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In AAAI Conference on Artificial Intelligence (AAAI). arXiv:cs.DC/1911.08250Google ScholarGoogle Scholar
  85. Erich Elsen, Marat Dukhan, Trevor Gale, and Karen Simonyan. 2020. Fast Sparse ConvNets. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1911.09723Google ScholarGoogle Scholar
  86. Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural Architecture Search: A Survey. Journal of Machine Learning Research 20, 55 (2019), 1-21. arXiv:stat.ML/1808.05377Google ScholarGoogle Scholar
  87. Andries P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12, 6 (2001), 1386-1399.Google ScholarGoogle Scholar
  88. Andries Petrus Engelbrecht and Ian Cloete. 1996. A sensitivity analysis algorithm for pruning feedforward neural networks. In International Conference on Neural Networks.Google ScholarGoogle Scholar
  89. Andries Petrus Engelbrecht, Ian Cloete, and Jacek M Zurada. 1995. Determining the significance of input parameters using sensitivity analysis. In International Workshop on Artificial Neural Networks.Google ScholarGoogle Scholar
  90. Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020a. Rigging the Lottery: Making All Tickets Winners. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1911.11134Google ScholarGoogle Scholar
  91. Utku Evci, Yani A. Ioannou, Cem Keskin, and Yann Dauphin. 2020b. Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win. (2020). arXiv:cs.LG/2010.03533Google ScholarGoogle Scholar
  92. Angela Fan, Edouard Grave, and Armand Joulin. 2020. Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1909.11556Google ScholarGoogle Scholar
  93. William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. (2021). arXiv:cs.LG/2101.03961Google ScholarGoogle Scholar
  94. William Finnoff, Ferdinand Hergert, and Hans Georg Zimmermann. 1993. Improving model selection by nonconvergent methods. Neural Networks 6, 6 (1993), 771-783.Google ScholarGoogle Scholar
  95. L. Fletcher, V. Katkovnik, F. E. Steffens, and A. P. Engelbrecht. 1998. Optimizing the number of hidden nodes of a feedforward artificial neural network. In International Joint Conference on Neural Networks (IJCNN).Google ScholarGoogle Scholar
  96. Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1803.03635Google ScholarGoogle Scholar
  97. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2020a. Linear Mode Connectivity and the Lottery Ticket Hypothesis. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1912.05671Google ScholarGoogle Scholar
  98. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2020b. Stabilizing the Lottery Ticket Hypothesis. (2020). arXiv:cs.LG/1903.01611Google ScholarGoogle Scholar
  99. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2021. Pruning Neural Networks at Initialization: Why are We Missing the Mark?. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2009.08576Google ScholarGoogle Scholar
  100. Jonathan Frankle, David J. Schwab, and Ari S. Morcos. 2020. The Early Phase of Neural Network Training. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2002.10365Google ScholarGoogle Scholar
  101. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. A note on the group lasso and a sparse group lasso. (2010). arXiv:math.ST/1001.0736Google ScholarGoogle Scholar
  102. Karl J. Friston. 2008. Hierarchical Models in the Brain. PLOS Computational Biology 4, 11 (2008), e1000211.Google ScholarGoogle Scholar
  103. Adam Gaier and David Ha. 2019. Weight Agnostic Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1906.04358Google ScholarGoogle Scholar
  104. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In International Conference on Machine Learning (ICML). arXiv:stat.ML/1506.02142Google ScholarGoogle Scholar
  105. Yarin Gal, Jiri Hron, and Alex Kendall. 2017. Concrete Dropout. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 30. arXiv:stat.ML/1705.07832Google ScholarGoogle Scholar
  106. Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The State of Sparsity in Deep Neural Networks. (2019). arXiv:cs.LG/1902.09574Google ScholarGoogle Scholar
  107. Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU Kernels for Deep Learning. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). arXiv:cs.LG/2006.10901Google ScholarGoogle Scholar
  108. Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. 2020. Compressing large-scale transformer-based models: A case study on BERT. (2020). arXiv:cs.LG/2002.11985Google ScholarGoogle Scholar
  109. Dongdong Ge, Xiaoye Jiang, and Yinyu Ye. 2011. A note on the complexity of L p minimization. Mathematical programming 129, 2 (2011), 285-299.Google ScholarGoogle Scholar
  110. Georgios Georgiadis. 2019. Accelerating Convolutional Neural Networks via Activation Map Compression. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  111. Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. 2018. DropBlock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/1810.12890Google ScholarGoogle Scholar
  112. Joydeep Ghosh and Kagan Tumer. 1994. Structural Adaptation and Generalization in Supervised Feed-Forward Networks. J. Artif. Neural Netw. 1, 4 (Nov. 1994), 431-458.Google ScholarGoogle Scholar
  113. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS). http://proceedings.mlr.press/v9/glorot10a.htmlGoogle ScholarGoogle Scholar
  114. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS). http://proceedings.mlr.press/v15/glorot11a.htmlGoogle ScholarGoogle Scholar
  115. Maximilian Golub, Guy Lemieux, and Mieszko Lis. 2019. Full deep neural network training on a pruned weight budget. In Machine Learning and Systems (MLSys). arXiv:cs.LG/1806.06949Google ScholarGoogle Scholar
  116. Aidan N. Gomez, Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. 2019. Learning Sparse Networks Using Targeted Dropout. (2019). arXiv:cs.LG/1905.13678Google ScholarGoogle Scholar
  117. Ashish Gondimalla, Noah Chesnut, Mithuna Thottethodi, and T. N. Vijaykumar. 2019. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks. In International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  118. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1406.2661Google ScholarGoogle Scholar
  119. Soorya Gopalakrishnan, Zhinus Marzi, Upamanyu Madhow, and Ramtin Pedarsani. 2018. Combating Adversarial Attacks Using Sparse Representations. In International Conference on Learning Representations Workshop. arXiv:stat.ML/1803.03880Google ScholarGoogle Scholar
  120. Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. 2018. MorphNet: Fast & simple resource-constrained structure learning of deep networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1711.06798Google ScholarGoogle Scholar
  121. Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In Proceedings of the 5th Workshop on Representation Learning for NLP. 143-155. arXiv:cs.CL/2002.08307Google ScholarGoogle Scholar
  122. Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, and Torsten Hoeer. 2020. Deep Learning for Post-Processing Ensemble Weather Forecasts. Philosophical Transactions of the Royal Society A 379, 2194 (2020), 20200092. arXiv:cs.LG/2005.08748Google ScholarGoogle Scholar
  123. William Gropp, Torsten Hoeer, Rajeev Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press.Google ScholarGoogle Scholar
  124. William Gropp, Torsten Hoeer, Rajeev Thakur, and Jesper Larsson Träff. 2011. Performance Expectations and Guidelines for MPI Derived Datatypes. In Recent Advances in the Message Passing Interface (EuroMPI'11), Vol. 6960. 150-159.Google ScholarGoogle Scholar
  125. Peter D Grünwald. 2007. The minimum description length principle. MIT press.Google ScholarGoogle Scholar
  126. Denis Gudovskiy, Alec Hodgkinson, and Luca Rigazio. 2018. DNN Feature Map Compression using Learned Representation over GF (2). In European Conference on Computer Vision (ECCV). arXiv:cs.CV/1808.05285Google ScholarGoogle Scholar
  127. Luis Guerra, Bohan Zhuang, Ian Reid, and Tom Drummond. 2020. Automatic Pruning for Quantized Neural Networks. (2020). arXiv:cs.CV/2002.00523Google ScholarGoogle Scholar
  128. Shupeng Gui, Haotao Wang, Chen Yu, Haichuan Yang, Zhangyang Wang, and Ji Liu. 2019. Model compression with adversarial robustness: A unified optimization framework. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1902.03538Google ScholarGoogle Scholar
  129. Demi Guo, Alexander M. Rush, and Yoon Kim. 2020. Parameter-Efficient Transfer Learning with Diff Pruning. (2020). arXiv:cs.CL/2012.07463Google ScholarGoogle Scholar
  130. Fu-Ming Guo, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. 2019a. Reweighted proximal pruning for large-scale language representation. (2019). arXiv:cs.LG/1909.12486Google ScholarGoogle Scholar
  131. Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019b. Star-Transformer. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). arXiv:cs.CL/1902.09113Google ScholarGoogle Scholar
  132. Yiwen Guo, Anbang Yao, and Yurong Chen. 2016. Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.NE/1608.04493Google ScholarGoogle Scholar
  133. Yiwen Guo, Chao Zhang, Changshui Zhang, and Yurong Chen. 2018. Sparse DNNs with improved adversarial robustness. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1810.09619Google ScholarGoogle Scholar
  134. Manish Gupta and Puneet Agrawal. 2020. Compression of Deep Learning Models for Text: A Survey. (2020). arXiv:cs.CL/2008.05221Google ScholarGoogle Scholar
  135. Udit Gupta, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander M. Rush, Gu-Yeon Wei, and David Brooks. 2019. MASR: A Modular Accelerator for Sparse RNNs. In International Conference on Parallel Architectures and Compilation Techniques (PACT). arXiv:eess.SP/1908.08976Google ScholarGoogle Scholar
  136. Masafumi Hagiwara. 1993. Removal of hidden units and weights for back propagation networks. In International Conference on Neural Networks.Google ScholarGoogle Scholar
  137. Masafumi Hagiwara. 1994. A simple and effective method for removal of hidden units and weights. Neurocomputing 6, 2 (1994), 207 - 218. Backpropagation, Part IV.Google ScholarGoogle Scholar
  138. Hong-Gui Han and Jun-Fei Qiao. 2013. A structure optimisation algorithm for feedforward neural network construction. Neurocomputing 99 (2013), 347-357.Google ScholarGoogle Scholar
  139. Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and William J. Dally. 2017. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. In International Symposium on Field-Programmable Gate Arrays (FPGA). arXiv:cs.CL/1612.00694Google ScholarGoogle Scholar
  140. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016a. EIE: Efficient Inference Engine on Compressed Deep Neural Network. ACM SIGARCH Computer Architecture News 44, 3 (2016), 243-254. arXiv:cs.CV/1602.01528Google ScholarGoogle Scholar
  141. Song Han, Huizi Mao, and William J. Dally. 2016b. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1510.00149Google ScholarGoogle Scholar
  142. Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, and William J. Dally. 2017. DSD: Dense-Sparse-Dense Training for Deep Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1607.04381Google ScholarGoogle Scholar
  143. Lars Kai Hansen and Morten With Pedersen. 1994. Controlled growth of cascade correlation nets. In Conference on Artificial Neural Networks.Google ScholarGoogle Scholar
  144. Stephen Hanson and Lorien Pratt. 1989. Comparing Biases for Minimal Network Construction with Back-Propagation. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/1988/hash/1c9ac0159c94d8d0cbedc973445af2da-Abstract.htmlGoogle ScholarGoogle Scholar
  145. Babak Hassibi and David G. Stork. 1992. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/1992/hash/303ed4c69846ab36c2904d3ba8573050-Abstract.htmlGoogle ScholarGoogle Scholar
  146. Jeff Hawkins. 2017. Special report : Can we copy the brain? - What intelligent machines need to learn from the Neocortex. IEEE Spectrum 54, 6 (2017), 34-71.Google ScholarGoogle Scholar
  147. Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. 2021. Robust Pruning at Initialization. In International Conference on Learning Representations (ICLR). arXiv:stat.ML/2002.08797Google ScholarGoogle Scholar
  148. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1703.06870Google ScholarGoogle Scholar
  149. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1502.01852Google ScholarGoogle Scholar
  150. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1512.03385Google ScholarGoogle Scholar
  151. Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In European Conference on Computer Vision (ECCV). arXiv:cs.CV/1802.03494Google ScholarGoogle Scholar
  152. Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. 2019. Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1811.00250Google ScholarGoogle Scholar
  153. Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1707.06168Google ScholarGoogle Scholar
  154. Donald O. Hebb. 1949. The organization of behavior: A neuropsychological theory. Wiley, New York.Google ScholarGoogle Scholar
  155. Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  156. Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1903.12261Google ScholarGoogle Scholar
  157. Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2019. Natural adversarial examples. (2019). arXiv:cs.LG/1907.07174Google ScholarGoogle Scholar
  158. Suzana Herculano-Houzel, Bruno Mota, Peiyan Wong, and Jon H. Kaas. 2010. Connectivity-driven white matter scaling and folding in primate cerebral cortex. Proceedings of the National Academy of Sciences 107, 44 (2010), 19008-19013.Google ScholarGoogle Scholar
  159. Parker Hill, Animesh Jain, Mason Hill, Babak Zamirai, Chang-Hong Hsu, Michael A. Laurenzano, Scott Mahlke, Lingjia Tang, and Jason Mars. 2017. DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission. In International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  160. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. In NeurIPS Deep Learning and Representation Learning Workshop. arXiv:stat.ML/1503.02531Google ScholarGoogle Scholar
  161. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. (2012). arXiv:cs.NE/1207.0580Google ScholarGoogle Scholar
  162. Geoffrey E Hinton and Drew Van Camp. 1993. Keeping the neural networks simple by minimizing the description length of the weights. In Conference on Computational Learning Theory (COLT).Google ScholarGoogle Scholar
  163. Torsten Hoeer and Roberto Belli. 2015. Scientific Benchmarking of Parallel Computing Systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarGoogle Scholar
  164. Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. 2019. What Do Compressed Deep Neural Networks Forget? (2019). arXiv:cs.LG/1911.05248Google ScholarGoogle Scholar
  165. Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. 2020. Characterising bias in compressed models. (2020). arXiv:cs.LG/2010.03058Google ScholarGoogle Scholar
  166. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. (2017). arXiv:cs.CV/1704.04861Google ScholarGoogle Scholar
  167. Patrik O Hoyer. 2004. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research 5, Nov (2004), 1457-1469.Google ScholarGoogle Scholar
  168. Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. (2016). arXiv:cs.NE/1607.03250Google ScholarGoogle Scholar
  169. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. Deep Networks with Stochastic Depth. In European Conference on Computer Vision (ECCV). arXiv:cs.LG/1603.09382Google ScholarGoogle Scholar
  170. Zehao Huang and Naiyan Wang. 2018. Data-Driven Sparse Structure Selection for Deep Neural Networks. In European Conference on Computer Vision (ECCV). arXiv:cs.CV/1707.01213Google ScholarGoogle Scholar
  171. Ziyue Huang, Wang Yilei, Ke Yi, et al. 2019. Optimal Sparsity-Sensitive Bounds for Distributed Mean Estimation. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2019/hash/5b970a1d9be0fd100063fd6cd688b73e-Abstract.htmlGoogle ScholarGoogle Scholar
  172. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2016/hash/d8330f857a17c53d217014ee776bfd50-Abstract.htmlGoogle ScholarGoogle Scholar
  173. Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. (2016). arXiv:cs.CV/1602.07360Google ScholarGoogle Scholar
  174. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1502.03167Google ScholarGoogle Scholar
  175. Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoeer. 2021. Data Movement Is All You Need: A Case Study on Optimizing Transformers. In Machine Learning and Systems (MLSys). arXiv:cs.LG/2007.00072Google ScholarGoogle Scholar
  176. Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Ion Stoica, Raman Arora, et al. 2019. Communication-efficient distributed SGD with sketching. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1903.04488Google ScholarGoogle Scholar
  177. Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural Computation 3, 1 (1991), 79-87.Google ScholarGoogle Scholar
  178. Niehues Jan, Roldano Cattoni, Stuker Sebastian, Matteo Negri, Marco Turchi, Salesky Elizabeth, Sanabria Ramon, Barrault Loic, Specia Lucia, and Marcello Federico. 2019. The IWSLT 2019 evaluation campaign. In 16th International Workshop on Spoken Language Translation 2019.Google ScholarGoogle Scholar
  179. Steven A Janowsky. 1989. Pruning versus clipping in neural networks. Physical Review A 39, 12 (1989), 6600.Google ScholarGoogle Scholar
  180. Siddhant Jayakumar, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. 2020. Top-KAST: Top-K Always Sparse Training. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2106.03517Google ScholarGoogle Scholar
  181. Peng Jiang and Gagan Agrawal. 2018. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/2018/hash/17326d10d511828f6b34fa6d751739e2-Abstract.htmlGoogle ScholarGoogle Scholar
  182. Sian Jin, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, and Franck Cappello. 2019. DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression. In International Symposium on High-Performance Parallel and Distributed Computing (HPDC). arXiv:cs.CV/1901.09124Google ScholarGoogle Scholar
  183. Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. 2016. Training Skinny Deep Neural Networks with Iterative Hard Thresholding Methods. (2016). arXiv:cs.CV/1607.05423Google ScholarGoogle Scholar
  184. Sari Jones, Lars Nyberg, Johan Sandblom, Anna Stigsdotter Neely, Martin Ingvar, Karl Magnus Petersson, and Lars Bäckman. 2006. Cognitive and neural plasticity in aging: general and task-specific limitations. Neuroscience & Biobehavioral Reviews 30, 6 (2006), 864-871.Google ScholarGoogle Scholar
  185. Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6, 2 (1994), 181-214.Google ScholarGoogle Scholar
  186. Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. In International Conference on Machine Learning (ICML). arXiv:cs.SD/1802.08435Google ScholarGoogle Scholar
  187. Keisuke Kameyama and Yukio Kosugi. 1991. Automatic fusion and splitting of artificial neural elements in optimizing the network size. In IEEE International Conference on Systems, Man, and Cybernetics.Google ScholarGoogle Scholar
  188. Minsoo Kang and Bohyung Han. 2020. Operation-Aware Soft Channel Pruning using Differentiable Masks. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2007.03938Google ScholarGoogle Scholar
  189. Partha P. Kanjilal, P. K. Dey, and D. N. Banerjee. 1993. Reduced-size neural networks through singular value decomposition and subset selection. Electronics Letters 29, 17 (1993), 1516-1518.Google ScholarGoogle Scholar
  190. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. (2020). arXiv:cs.LG/2001.08361Google ScholarGoogle Scholar
  191. Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. 2019. Error feedback fixes SignSGD and other gradient compression schemes. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1901.09847Google ScholarGoogle Scholar
  192. Ehud D. Karnin. 1990. A simple procedure for pruning back-propagation trained neural networks. IEEE Transactions on Neural Networks 1, 2 (1990), 239-242.Google ScholarGoogle Scholar
  193. Jason N. D. Kerr, David Greenberg, and Fritjof Helmchen. 2005. Imaging input and output of neocortical networks in vivo. Proceedings of the National Academy of Sciences 102, 39 (2005), 14063-14068.Google ScholarGoogle Scholar
  194. Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. 2018. ZeNA: Zero-Aware Neural Network Accelerator. IEEE Design Test 35, 1 (2018), 39-46.Google ScholarGoogle Scholar
  195. Diederik P Kingma, Tim Salimans, and Max Welling. 2015. Variational Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1506.02557Google ScholarGoogle Scholar
  196. Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1312.6114Google ScholarGoogle Scholar
  197. Maxim Kodryan, Artem Grachev, Dmitry Ignatov, and Dmitry Vetrov. 2019. Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). 40-48.Google ScholarGoogle Scholar
  198. Jakub Konečný and Peter Richtárik. 2018. Randomized distributed mean estimation: Accuracy vs. communication. Frontiers in Applied Mathematics and Statistics 4 (2018), 62. arXiv:cs.DC/1611.07555Google ScholarGoogle Scholar
  199. Anders Krogh and John A. Hertz. 1991. A Simple Weight Decay Can Improve Generalization. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/1991/hash/8eefcfdf5990e441f0fb6f3fad709e21-Abstract.htmlGoogle ScholarGoogle Scholar
  200. David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. 2017. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. International Conference on Learning Representations (ICLR) (2017). arXiv:cs.NE/1606.01305Google ScholarGoogle Scholar
  201. Souvik Kundu, Mahdi Nazemi, Peter A Beerel, and Massoud Pedram. 2021. DNR: A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs. In Asia and South Pacific Design Automation Conference (ASP-DAC). arXiv:cv.CV/2011.03083Google ScholarGoogle Scholar
  202. Souvik Kundu, Mahdi Nazemi, Massoud Pedram, Keith M Chugg, and Peter A Beerel. 2020. Pre-defined sparsity for low-complexity convolutional neural networks. IEEE Trans. Comput. 69, 7 (2020), 1045-1058. arXiv:cs.CV/2001.10710Google ScholarGoogle Scholar
  203. Souvik Kundu and Sairam Sundaresan. 2021. AttentionLite: Towards Efficient Self-Attention Models for Vision. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). arXiv:cs.CV/2101.05216Google ScholarGoogle Scholar
  204. H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). arXiv:cs.LG/1811.04770Google ScholarGoogle Scholar
  205. Frederik Kunstner, Philipp Hennig, and Lukas Balles. 2019. Limitations of the empirical Fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1905.12558Google ScholarGoogle Scholar
  206. Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. 2020. Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks. In International Conference on Machine Learning (ICML). http://proceedings.mlr.press/v119/kurtz20a.htmlGoogle ScholarGoogle Scholar
  207. Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. 2020. Soft Threshold Weight Reparameterization for Learnable Sparsity. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2002.03231Google ScholarGoogle Scholar
  208. Andrey Kuzmin, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, and Max Welling. 2019. Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks. (2019). arXiv:cs.LG/1912.09802Google ScholarGoogle Scholar
  209. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453-466.Google ScholarGoogle Scholar
  210. Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. Large Memory Layers with Product Keys. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/1907.05242Google ScholarGoogle Scholar
  211. Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. 2017. FractalNet: Ultra-Deep Neural Networks without Residuals. International Conference on Learning Representations (ICLR) (2017). arXiv:cs.CV/1605.07648Google ScholarGoogle Scholar
  212. Philippe Lauret, Eric Fock, and Thierry Alex Mara. 2006. A node pruning algorithm based on a Fourier amplitude sensitivity test method. IEEE Transactions on Neural Networks 17, 2 (2006), 273-293.Google ScholarGoogle Scholar
  213. Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.NE/1509.09308Google ScholarGoogle Scholar
  214. Yann Le Cun, John S. Denker, and Sara A. Solla. 1990. Optimal Brain Damage. In Advances in Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  215. Vadim Lebedev and Victor Lempitsky. 2016. Fast ConvNets Using Group-wise Brain Damage. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1506.02515Google ScholarGoogle Scholar
  216. Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. 2020. A Signal Propagation Perspective for Pruning Neural Networks at Initialization. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1906.06307Google ScholarGoogle Scholar
  217. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2019. SNIP: Single-shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1810.02340Google ScholarGoogle Scholar
  218. Dmitry Lepikhin, Hyouk Joong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations (ICLR). arXiv:cs.CL/2006.16668Google ScholarGoogle Scholar
  219. Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. 2018. Measuring the Intrinsic Dimension of Objective Landscapes. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1804.08838Google ScholarGoogle Scholar
  220. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning Filters for Efficient ConvNets. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1608.08710Google ScholarGoogle Scholar
  221. Jiajun Li, Shuhao Jiang, Shijun Gong, Jingya Wu, Junchao Yan, Guihai Yan, and Xiaowei Li. 2019. SqueezeFlow: A Sparse CNN Accelerator Exploiting Concise Convolution Rules. IEEE Trans. Comput. 68, 11 (2019), 1663-1677.Google ScholarGoogle Scholar
  222. Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020a. SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/2003.09833Google ScholarGoogle Scholar
  223. Yunqiang Li, Silvia Laura Pintea, and Jan van Gemert. 2020b. Less bits is more: How pruning deep binary networks increases weight capacity. (2020). https://openreview.net/forum?id=Hy8JM_Fvt5NGoogle ScholarGoogle Scholar
  224. Yuanzhi Li, Colin Wei, and Tengyu Ma. 2020d. Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1907.04595Google ScholarGoogle Scholar
  225. Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. 2020c. Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. In International Conference on Machine Learning (ICML). arXiv:cs.CL/2002.11794Google ScholarGoogle Scholar
  226. Lucas Liebenwein, Cenk Baykal, Harry Lang, Dan Feldman, and Daniela Rus. 2020. Provable Filter Pruning for Efficient Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1911.07412Google ScholarGoogle Scholar
  227. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1509.02971Google ScholarGoogle Scholar
  228. Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. 2020. Backpropagation and the brain. Nature Reviews Neuroscience (2020), 1-12.Google ScholarGoogle Scholar
  229. Hyeontaek Lim, David Andersen, and Michael Kaminsky. 2019. 3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning. In Machine Learning and Systems (MLSys). arXiv:cs.LG/1802.07389Google ScholarGoogle Scholar
  230. Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime Neural Pruning. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.htmlGoogle ScholarGoogle Scholar
  231. Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. 2020. Dynamic Model Pruning with Feedback. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2006.07253Google ScholarGoogle Scholar
  232. Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1712.01887Google ScholarGoogle Scholar
  233. Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, and Dan Roth. 2020. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior. In Findings of the Association for Computational Linguistics: EMNLP 2020. arXiv:cs.CL/2010.01791Google ScholarGoogle Scholar
  234. Pierre Lison, Jörg Tiedemann, Milen Kouylekov, et al. 2019. Open subtitles 2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora. In Eleventh International Conference on Language Resources and Evaluation.Google ScholarGoogle Scholar
  235. Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Penksy. 2015. Sparse Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  236. Lanlan Liu and Jia Deng. 2018. Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution. In AAAI Conference on Artificial Intelligence (AAAI). arXiv:cs.LG/1701.00299Google ScholarGoogle Scholar
  237. Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. 2019. Dynamic Sparse Graph for Efficient Deep Learning. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1810.00859Google ScholarGoogle Scholar
  238. Tianlin Liu and Friedemann Zenke. 2020. Finding trainable sparse networks through Neural Tangent Transfer. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2006.08228Google ScholarGoogle Scholar
  239. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. RoBERTa: A robustly optimized BERT pretraining approach. (2019). arXiv:cs.CL/1907.11692Google ScholarGoogle Scholar
  240. Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. Learning Efficient Convolutional Networks through Network Slimming. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1708.06519Google ScholarGoogle Scholar
  241. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face attributes in the wild. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1411.7766Google ScholarGoogle Scholar
  242. Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2019b. Rethinking the Value of Network Pruning. In International Conference on Learning Representations (ICLR). arXiv:1810.05270Google ScholarGoogle Scholar
  243. Ekaterina Lobacheva, Nadezhda Chirkova, and Dmitry Vetrov. 2018. Bayesian sparsification of gated recurrent neural networks. In NeurIPS Workshop on Compact Deep Neural Networks with Industrial Applications. arXiv:cs.LG/1812.05692Google ScholarGoogle Scholar
  244. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR). arXiv:1711.05101Google ScholarGoogle Scholar
  245. Christos Louizos, Karen Ullrich, and Max Welling. 2017. Bayesian Compression for Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1705.08665Google ScholarGoogle Scholar
  246. Christos Louizos, Max Welling, and Diederik P. Kingma. 2018. Learning Sparse Neural Networks through L0 Regularization. In International Conference on Learning Representations (ICLR). arXiv:stat.ML/1712.01312Google ScholarGoogle Scholar
  247. Jian-Hao Luo and Jianxin Wu. 2019. AutoPruner: An End-to-End Trainable Filter Pruning Method for Efficient Deep Model Inference. Pattern Recognition 107 (2019), 107461. arXiv:cs.CV/1805.08941Google ScholarGoogle Scholar
  248. Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1707.06342Google ScholarGoogle Scholar
  249. Alexander Ly, Maarten Marsman, Josine Verhagen, Raoul Grasman, and Eric-Jan Wagenmakers. 2017. A Tutorial on Fisher Information. Journal of Mathematical Psychology 80 (2017), 40-55. arXiv:math.ST/1705.01064Google ScholarGoogle Scholar
  250. Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. 2019. PruneTrain: Fast Neural Network Training by Dynamic Sparse Model Reconfiguration. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). arXiv:cs.LG/1901.09290Google ScholarGoogle Scholar
  251. Divyam Madaan, Jinwoo Shin, and Sung Ju Hwang. 2020. Adversarial Neural Pruning with Latent Vulnerability Suppression. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1908.04355Google ScholarGoogle Scholar
  252. Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. International Conference on Learning Representations (ICLR) (2017). arXiv:cs.LG/1611.00712Google ScholarGoogle Scholar
  253. Alireza Makhzani and Brendan Frey. 2015. Winner-Take-All Autoencoders. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1409.2752Google ScholarGoogle Scholar
  254. Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. 2020. Proving the Lottery Ticket Hypothesis: Pruning is All You Need. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2002.00585Google ScholarGoogle Scholar
  255. Chaitanya Malaviya, Pedro Ferreira, and André FT Martins. 2018. Sparse and constrained attention for neural machine translation. In Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (ACL). arXiv:cs.CL/1805.08241Google ScholarGoogle Scholar
  256. Arun Mallya and Svetlana Lazebnik. 2018. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1711.05769Google ScholarGoogle Scholar
  257. Franco Manessi, Alessandro Rozza, Simone Bianco, Paolo Napoletano, and Raimondo Schettini. 2018. Automated Pruning for Deep Neural Network Compression. In International Conference on Pattern Recognition (ICPR). arXiv:cs.CV/1712.01721Google ScholarGoogle Scholar
  258. Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. 2017. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. (2017). arXiv:cs.LG/1705.08922Google ScholarGoogle Scholar
  259. Zelda Mariet and Suvrit Sra. 2016. Diversity Networks: Neural Network Compression Using Determinantal Point Processes. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1511.05077Google ScholarGoogle Scholar
  260. James Martens and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1503.05671Google ScholarGoogle Scholar
  261. Andre Martins and Ramon Astudillo. 2016. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International Conference on Machine Learning (ICML). arXiv:cs.CL/1602.02068Google ScholarGoogle Scholar
  262. Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia. 2020. MLPerf Training Benchmark. In Machine Learning and Systems (MLSys). arXiv:cs.LG/1910.01500Google ScholarGoogle Scholar
  263. Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. An Empirical Model of Large-Batch Training. (2018). arXiv:cs.LG/1812.06162Google ScholarGoogle Scholar
  264. J. S. McCarley, Rishav Chakravarti, and Avirup Sil. 2020. Structured Pruning of a BERT-based Question Answering Model. (2020). arXiv:cs.CL/1910.06360Google ScholarGoogle Scholar
  265. Dushyant Mehta, Kwang In Kim, and Christian Theobalt. 2019. On implicit filter level sparsity in convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1811.12495Google ScholarGoogle Scholar
  266. Rahul Mehta. 2019. Sparse Transfer Learning via Winning Lottery Tickets. In NeurIPS Workshop on Learning Transferable Skills. arXiv:cs.LG/1905.07785Google ScholarGoogle Scholar
  267. Fanxu Meng, Hao Cheng, Ke Li, Huixiang Luo, Xiaowei Guo, Guangming Lu, and Xing Sun. 2020. Pruning Filter in Filter. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/2009.14410Google ScholarGoogle Scholar
  268. Hrushikesh Mhaskar and Tomaso Poggio. 2016. Deep vs. shallow networks : An approximation theory perspective. Analysis and Applications 14, 06 (2016), 829-848. arXiv:cs.LG/1608.03287Google ScholarGoogle Scholar
  269. Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One?. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/1905.10650Google ScholarGoogle Scholar
  270. Beren Millidge, Alexander Tschantz, and Christopher L. Buckley. 2020. Predictive Coding Approximates Backprop along Arbitrary Computation Graphs. (2020). arXiv:cs.LG/2006.04182Google ScholarGoogle Scholar
  271. Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2018. WRPN: Wide Reduced-Precision Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1709.01134Google ScholarGoogle Scholar
  272. Deepak Mittal, Shweta Bhardwaj, Mitesh M. Khapra, and Balaraman Ravindran. 2018. Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks. In Winter Conference on Applications of Computer Vision (WACV). arXiv:cs.CV/1801.10447Google ScholarGoogle Scholar
  273. Decebal Constantin Mocanu, Elena Mocanu, Phuong H. Nguyen, Madeleine Gibescu, and Antonio Liotta. 2016. A topological insight into restricted Boltzmann machines. Machine Learning 104, 2-3 (Jul 2016), 243270.Google ScholarGoogle Scholar
  274. Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. 2018. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications 9, 1 (2018), 1-12. arXiv:cs.NE/1707.04780Google ScholarGoogle Scholar
  275. Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. Variational Dropout Sparsifies Deep Neural Networks. In International Conference on Machine Learning (ICML). arXiv:stat.ML/1701.05369Google ScholarGoogle Scholar
  276. Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. 2019. Importance Estimation for Neural Network Pruning. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1906.10771Google ScholarGoogle Scholar
  277. Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. Pruning Convolutional Neural Networks for Resource Efficient Inference. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1611.06440Google ScholarGoogle Scholar
  278. John E Moody. 1991. Note on generalization, regularization and architecture selection in nonlinear learning systems. In IEEE Workshop on Neural Networks for Signal Processing.Google ScholarGoogle Scholar
  279. Ari S. Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. 2019. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1906.02773Google ScholarGoogle Scholar
  280. Hesham Mostafa and Xin Wang. 2019. Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1902.05967Google ScholarGoogle Scholar
  281. Michael C Mozer and Paul Smolensky. 1988. Skeletonization: A technique for trimming the fat from a network via relevance assessment. Advances in Neural Information Processing Systems (NeurIPS) (1988). https://proceedings.neurips.cc/paper/1988/hash/07e1cd7dca89a1678042477183b7ac3f-Abstract.htmlGoogle ScholarGoogle Scholar
  282. Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. 2006. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25, 1-3 (2006), 161-193.Google ScholarGoogle Scholar
  283. Ben Mussay, Daniel Feldman, Samson Zhou, Vladimir Braverman, and Margarita Osadchy. 2020. Data-Independent Structured Pruning of Neural Networks via Coresets. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2008.08316Google ScholarGoogle Scholar
  284. Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. 2017. Exploring Sparsity in Recurrent Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1704.05119Google ScholarGoogle Scholar
  285. Pramod L. Narasimha, Walter H. Delashmit, Michael T. Manry, Jiang Li, and Francisco Maldonado. 2008. An integrated growing-pruning method for feedforward network training. Neurocomputing 71, 13 (2008), 2831 - 2847.Google ScholarGoogle Scholar
  286. Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. Structured Bayesian Pruning via Log-Normal Multiplicative Noise. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1705.07283Google ScholarGoogle Scholar
  287. Behnam Neyshabur. 2020. Towards Learning Convolutions from Scratch. In Towards Learning Convolutions from Scratch. arXiv:cs.LG/2007.13657Google ScholarGoogle Scholar
  288. Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. 2019. The Role of Over-Parametrization in Generalization of Neural Networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1805.12076Google ScholarGoogle Scholar
  289. Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh, Quoc V. Le, and Andrew Y. Ng. 2010. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2010/hash/01f78be6f7cad02658508fe4616098a9-Abstract.htmlGoogle ScholarGoogle Scholar
  290. Vlad Niculae and Mathieu Blondel. 2017. A regularized framework for sparse and structured neural attention. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1705.07704Google ScholarGoogle Scholar
  291. Nils J Nilsson. 2009. The quest for artificial intelligence: A history of ideas and achievements. Cambridge University Press.Google ScholarGoogle Scholar
  292. Yue Niu, Rajgopal Kannan, Ajitesh Srivastava, and Viktor Prasanna. 2020. Reuse Kernels or Activations? A Flexible Dataow for Low-Latency Spectral CNN Acceleration. In International Symposium on Field-Programmable Gate Arrays (FPGA).Google ScholarGoogle Scholar
  293. Yue Niu, Hanqing Zeng, Ajitesh Srivastava, Kartik Lakhotia, Rajgopal Kannan, Yanzhi Wang, and Viktor Prasanna. 2019. SPEC2: SPECtral SParsE CNN Accelerator on FPGAs. (2019). arXiv:cs.CV/1910.11103Google ScholarGoogle Scholar
  294. Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning Deconvolution Network for Semantic Segmentation. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1505.04366Google ScholarGoogle Scholar
  295. Steven J Nowlan and Geoffrey E Hinton. 1992. Simplifying neural networks by soft weight-sharing. Neural Computation 4, 4 (1992), 473-493.Google ScholarGoogle Scholar
  296. Nvidia. 2020. NVIDIA A100 Tensor Core GPU Architecture. (2020).Google ScholarGoogle Scholar
  297. Bruno A Olshausen and David J Field. 1996. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 6583 (1996), 607-609.Google ScholarGoogle Scholar
  298. Laurent Orseau, Marcus Hutter, and Omar Rivasplata. 2020. Logarithmic Pruning is All You Need. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.12156Google ScholarGoogle Scholar
  299. Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.LG/1811.12019Google ScholarGoogle Scholar
  300. Wei Pan, Hao Dong, and Yike Guo. 2016. DropNeuron: Simplifying the Structure of Deep Neural Networks. (2016). arXiv:cs.CV/1606.07326Google ScholarGoogle Scholar
  301. Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27-40. arXiv:cs.NE/1708.04485Google ScholarGoogle Scholar
  302. Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2017. Faster CNNs with Direct Sparse Convolutions and Guided Pruning. In International Conference on Learning Representations (ICLR). arXiv:cs.CV/1608.01409Google ScholarGoogle Scholar
  303. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image Transformer. In International Conference on Machine Learning (ICML). arXiv:cs.CV/1802.05751Google ScholarGoogle Scholar
  304. Morten Pedersen, Lars Hansen, and Jan Larsen. 1995. Pruning with generalization based weight saliencies: λOBD, λOBS. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/1995/hash/3473decccb0509fb264818a7512a8b9b-Abstract.htmlGoogle ScholarGoogle Scholar
  305. Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. 2020. Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.07990Google ScholarGoogle Scholar
  306. Bryan A. Plummer, Nikoli Dryden, Julius Frost, Torsten Hoeer, and Kate Saenko. 2020. Neural Parameter Allocation Search. (2020). arXiv:cs.LG/2006.10598Google ScholarGoogle Scholar
  307. Adam Polyak and Lior Wolf. 2015. Channel-level acceleration of deep face representations. IEEE Access 3 (2015), 2163-2175.Google ScholarGoogle Scholar
  308. Udo W. Pooch and Al Nieder. 1973. A Survey of Indexing Techniques for Sparse Matrices. ACM Comput. Surv. 5, 2 (June 1973), 109-133.Google ScholarGoogle Scholar
  309. Ameya Prabhu, Girish Varma, and Anoop Namboodiri. 2018. Deep Expander Networks: Efficient Deep Networks from Graph Theory. In European Conference on Computer Vision (ECCV). arXiv:cs.CV/1711.08757Google ScholarGoogle Scholar
  310. Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/2005.00561Google ScholarGoogle Scholar
  311. Lutz Prechelt. 1997. Connection pruning with static and adaptive pruning schedules. Neurocomputing 16, 1 (1997), 49 - 61.Google ScholarGoogle Scholar
  312. Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  313. Md Aamir Raihan and Tor M. Aamodt. 2020. Sparse Weight Activation Training. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2001.01969Google ScholarGoogle Scholar
  314. Adnan Siraj Rakin, Zhezhi He, Li Yang, Yanzhi Wang, Liqiang Wang, and Deliang Fan. 2020. Robust Sparse Regularization: Defending Adversarial Attacks Via Regularized Sparse Network. In Great Lakes Symposium on VLSI (GLSVLSI).Google ScholarGoogle Scholar
  315. Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. 2020. What's Hidden in a Randomly Weighted Neural Network?. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1911.13299Google ScholarGoogle Scholar
  316. Carl Edward Rasmussen and Zoubin Ghahramani. 2000. Occam's razor. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2000/hash/0950ca92a4dcf426067cfd2246bb5ff3-Abstract.htmlGoogle ScholarGoogle Scholar
  317. Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  318. Russell Reed. 1993. Pruning algorithms-a survey. IEEE Transactions on Neural Networks 4, 5 (1993), 740-747.Google ScholarGoogle Scholar
  319. Alex Renda, Jonathan Frankle, and Michael Carbin. 2020. Comparing Rewinding and Fine-tuning in Neural Network Pruning. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2003.02389Google ScholarGoogle Scholar
  320. Cèdric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoeer. 2019. SparCML: High-performance sparse communication for machine learning. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). arXiv:cs.DC/1802.08021Google ScholarGoogle Scholar
  321. Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2020. Survey of Machine Learning Accelerators. In IEEE High Performance Extreme Computing Conference (HPEC). arXiv:cs.DC/2009.00993Google ScholarGoogle Scholar
  322. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and variational inference in deep generative models. In International Conference on Machine Learning (ICML).Google ScholarGoogle Scholar
  323. Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. 2018. Compressing DMA engine: Leveraging activation sparsity for training deep neural networks. In International Symposium on High Performance Computer Architecture (HPCA). arXiv:cs.LG/1705.01626Google ScholarGoogle Scholar
  324. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8 (2021), 842-866. arXiv:cs.CL/2002.12327Google ScholarGoogle Scholar
  325. Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. 2017. Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning. (2017). arXiv:cs.LG/1711.01239Google ScholarGoogle Scholar
  326. Stuart Russell and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach (4th ed.). Prentice Hall Press.Google ScholarGoogle Scholar
  327. Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. 2013. Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets. In International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  328. Victor Sanh, Thomas Wolf, and Alexander M. Rush. 2020. Movement Pruning: Adaptive Sparsity by Fine-Tuning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/2005.07683Google ScholarGoogle Scholar
  329. Pedro Savarese, Hugo Silva, and Michael Maire. 2020. Winning the Lottery with Continuous Sparsification. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1912.04427Google ScholarGoogle Scholar
  330. Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. 2017. Group sparse regularization for deep neural networks. Neurocomputing 241 (2017), 81 - 89. arXiv:stat.ML/1607.00485Google ScholarGoogle Scholar
  331. Paul Scheffler, Florian Zaruba, Fabian Schuiki, Torsten Hoeer, and Luca Benini. 2020. Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra. (2020). arXiv:cs.AR/2011.08070Google ScholarGoogle Scholar
  332. Abigail See, Minh-Thang Luong, and Christopher D. Manning. 2016. Compression of Neural Machine Translation Models via Pruning. In SIGNLL Conference on Computational Natural Language Learning. arXiv:cs.AI/1606.09274Google ScholarGoogle Scholar
  333. Vikash Sehwag, Shiqi Wang, Prateek Mittal, and Suman Jana. 2020. HYDRA: Pruning Adversarially Robust Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/2002.10509Google ScholarGoogle Scholar
  334. Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association.Google ScholarGoogle Scholar
  335. Aditya Sharma, Nikolas Wolfe, and Bhiksha Raj. 2017. The Incredible Shrinking Neural Network: New Perspectives on Learning Representations Through The Lens of Pruning. (2017). arXiv:cs.NE/1701.04465Google ScholarGoogle Scholar
  336. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1701.06538Google ScholarGoogle Scholar
  337. Shaohuai Shi, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, Xiang Huang, and Xiaowen Chu. 2019a. A distributed synchronous SGD algorithm with global Top-k sparsification for low bandwidth networks. In International Conference on Distributed Computing Systems Workshop on Networks. arXiv:cs.DC/1901.04359Google ScholarGoogle Scholar
  338. Shaohuai Shi, Kaiyong Zhao, Qiang Wang, Zhenheng Tang, and Xiaowen Chu. 2019b. A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification. In International Joint Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  339. Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In ACM SIGSAC Conference on Computer and Communications Security.Google ScholarGoogle Scholar
  340. Ravid Shwartz-Ziv and Naftali Tishby. 2017. Opening the Black Box of Deep Neural Networks via Information. (2017). arXiv:cs.LG/1703.00810Google ScholarGoogle Scholar
  341. Jocelyn Sietsma and Robert JF Dow. 1991. Creating artificial neural networks that generalize. Neural Networks 4, 1 (1991), 67-79.Google ScholarGoogle Scholar
  342. Jocelyn Sietsma and Robert J. F. Dow. 1988. Neural net pruning-why and how. In International Conference on Neural Networks.Google ScholarGoogle Scholar
  343. Laurent Sifre and Stéphane Mallat. 2014. Rigid-motion scattering for image classification. Ph.D. Dissertation. Ecole Polytechnique, CMAP.Google ScholarGoogle Scholar
  344. Sidak Pal Singh and Dan Alistarh. 2020. WoodFisher: Efficient Second-Order Approximation for Neural Network Compression. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2004.14340Google ScholarGoogle Scholar
  345. Samarth Sinha, Zhengli Zhao, Anirudh Goyal, Colin A Raffel, and Augustus Odena. 2020. Top-k Training of GANs: Improving GAN Performance by Throwing Away Bad Samples. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/2002.06224Google ScholarGoogle Scholar
  346. Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. 2018. Don't Decay the Learning Rate, Increase the Batch Size. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1711.00489Google ScholarGoogle Scholar
  347. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing (EMNLP).Google ScholarGoogle Scholar
  348. Suraj Srinivas and R. Venkatesh Babu. 2015. Data-free parameter pruning for Deep Neural Networks. In British Machine Vision Conference (BMVC). arXiv:cs.CV/1507.06149Google ScholarGoogle Scholar
  349. Suraj Srinivas and R. Venkatesh Babu. 2016. Learning Neural Network Architectures using Backpropagation. In British Machine Vision Conference (BMVC). arXiv:cs.LG/1511.05497Google ScholarGoogle Scholar
  350. Suraj Srinivas, Akshayvarun Subramanya, and R. Venkatesh Babu. 2016. Training Sparse Neural Networks. In Conference on Computer Vision and Pattern Recognition Workshops. arXiv:cs.CV/1611.06694Google ScholarGoogle Scholar
  351. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929-1958. https://jmlr.org/papers/v15/srivastava14a.htmlGoogle ScholarGoogle Scholar
  352. Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. Sparsified SGD with memory. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1809.07599Google ScholarGoogle Scholar
  353. Nikko Ström. 1997. Sparse connection and pruning in large dynamic artificial neural networks. In Fifth European Conference on Speech Communication and Technology.Google ScholarGoogle Scholar
  354. Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.Google ScholarGoogle Scholar
  355. Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, and Jason D. Lee. 2020. Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2009.11094Google ScholarGoogle Scholar
  356. Xavier Suau, Luca Zappella, and Nicholas Apostoloff. 2019. Filter Distillation for Network Compression. In Winter Conference on Applications of Computer Vision (WACV). arXiv:cs.CV/1807.10585Google ScholarGoogle Scholar
  357. Haobo Sun, Yingxia Shao, Jiawei Jiang, Bin Cui, Kai Lei, Yu Xu, and Jiang Wang. 2019. Sparse gradient compression for distributed SGD. In International Conference on Database Systems for Advanced Applications. 139-155.Google ScholarGoogle Scholar
  358. Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified back propagation for accelerated deep learning with reduced overfitting. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1706.06197Google ScholarGoogle Scholar
  359. Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2015. Sparsifying Neural Network Connections for Face Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1512.01891Google ScholarGoogle Scholar
  360. Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. 2017. Distributed mean estimation with limited communication. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1611.00429Google ScholarGoogle Scholar
  361. Kenji Suzuki, Isao Horiba, and Noboru Sugie. 2001. A simple neural network pruning algorithm with application to filter synthesis. In Neural Processing Letters. 43-53.Google ScholarGoogle Scholar
  362. Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (2017), 2295-2329. arXiv:cs.CV/1703.09039Google ScholarGoogle Scholar
  363. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1409.4842Google ScholarGoogle Scholar
  364. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1512.00567Google ScholarGoogle Scholar
  365. S. Tamura, M. Tateishi, M. Matumoto, and S. Akita. 1993. Determination of the number of redundant hidden units in a three-layered feedforward neural network. In International Conference on Neural Networks.Google ScholarGoogle Scholar
  366. Chong Min John Tan and Mehul Motani. 2020. DropNet: Reducing Neural Network Complexity via Iterative Pruning. In International Conference on Machine Learning (ICML). http://proceedings.mlr.press/v119/tan20a.htmlGoogle ScholarGoogle Scholar
  367. Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1807.11626Google ScholarGoogle Scholar
  368. Mingxing Tan and Quoc V. Le. 2020. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1905.11946Google ScholarGoogle Scholar
  369. Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. 2020. Pruning neural networks without any data by iteratively conserving synaptic ow. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.05467Google ScholarGoogle Scholar
  370. Hanlin Tang, Chen Yu, Xiangru Lian, Tong Zhang, and Ji Liu. 2019. DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning (ICML). arXiv:cs.DC/1905.05957Google ScholarGoogle Scholar
  371. Yehui Tang, Yunhe Wang, Yixing Xu, Dacheng Tao, Chunjing Xu, Chao Xu, and Chang Xu. 2020b. SCOP: Scientific Control for Reliable Neural Network Pruning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/2010.10732Google ScholarGoogle Scholar
  372. Zhenheng Tang, Shaohuai Shi, Xiaowen Chu, Wei Wang, and Bo Li. 2020a. Communication-efficient distributed deep learning: A comprehensive survey. (2020). arXiv:cs.DC/2003.06307Google ScholarGoogle Scholar
  373. Enzo Tartaglione, Skjalg Lepsøy, Attilio Fiandrotti, and Gianluca Francini. 2018. Learning Sparse Neural Networks via Sensitivity-Driven Regularization. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1810.11764Google ScholarGoogle Scholar
  374. Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2021. Long Range Arena: A Benchmark for Efficient Transformers. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/2011.04006Google ScholarGoogle Scholar
  375. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. (2020). arXiv:cs.LG/2009.06732Google ScholarGoogle Scholar
  376. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Annual Meeting of the Association for Computational Linguistics (ACL). arXiv:cs.CL/1905.05950Google ScholarGoogle Scholar
  377. Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. 2018. Faster gaze prediction with dense networks and Fisher pruning. (2018). arXiv:cs.CV/1801.05787Google ScholarGoogle Scholar
  378. Georg Thimm and Emile Fiesler. 1995. Evaluating Pruning Methods. In Proceedings of the International Symposium on Artificial Neural Networks.Google ScholarGoogle Scholar
  379. Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267-288.Google ScholarGoogle Scholar
  380. Michael E Tipping. 2001. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1, Jun (2001), 211-244.Google ScholarGoogle Scholar
  381. Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christopher Bregler. 2015. Efficient Object Localization Using Convolutional Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1411.4280Google ScholarGoogle Scholar
  382. Yusuke Tsuzuku, Hiroto Imachi, and Takuya Akiba. 2018. Variance-based gradient compression for efficient distributed deep learning. In International Conference on Learning Representations Workshops. arXiv:cs.LG/1802.06058Google ScholarGoogle Scholar
  383. Karen Ullrich, Edward Meeds, and Max Welling. 2017. Soft Weight-Sharing for Neural Network Compression. In International Conference on Learning Representations (ICLR). arXiv:stat.ML/1702.04008Google ScholarGoogle Scholar
  384. Didem Unat, Anshu Dubey, Torsten Hoeer, John Shalf, Mark Abraham, Mauro Bianco, Bradford L. Chamberlain, Romain Cledat, H. Carter Edwards, Hal Finkel, Karl Fuerlinger, Frank Hannig, Emmanuel Jeannot, Amir Kamil, Jeff Keasler, Paul H J Kelly, Vitus Leung, Hatem Ltaief, Naoya Maruyama, Chris J. Newburn, and Miquel Pericas. 2017. Trends in Data Locality Abstractions for HPC Systems. IEEE Transactions on Parallel and Distributed Systems (TPDS) 28, 10 (Oct. 2017).Google ScholarGoogle Scholar
  385. Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. 2020. Bayesian Bits: Unifying Quantization and Pruning. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2005.07093Google ScholarGoogle Scholar
  386. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CL/1706.03762Google ScholarGoogle Scholar
  387. Stijn Verdenius, Maarten Stol, and Patrick Forré. 2020. Pruning via Iterative Ranking of Sensitivity Statistics. (2020). arXiv:cs.LG/2006.00896Google ScholarGoogle Scholar
  388. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Annual Meeting of the Association for Computational Linguistics (ACL). arXiv:cs.CL/1905.09418Google ScholarGoogle Scholar
  389. Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of Neural Networks using DropConnect. In International Conference on Machine Learning (ICML). http://proceedings.mlr.press/v28/wan13.htmlGoogle ScholarGoogle Scholar
  390. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR). arXiv:cs.CL/1804.07461Google ScholarGoogle Scholar
  391. Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. 2019. EigenDamage: Structured pruning in the kronecker-factored eigenbasis. In International Conference on Machine Learning (ICML). arXiv:cs.LG/1905.05934Google ScholarGoogle Scholar
  392. Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. 2018. ATOMO: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:stat.ML/1806.04090Google ScholarGoogle Scholar
  393. Linnan Wang, Wei Wu, Junyu Zhang, Hang Liu, George Bosilca, Maurice Herlihy, and Rodrigo Fonseca. 2020b. FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks. In International Symposium on High-Performance Parallel and Distributed Computing (HPDC).Google ScholarGoogle Scholar
  394. Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2020a. Structured pruning of large language models. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/1910.04732Google ScholarGoogle Scholar
  395. Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. 2018. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1710.09854Google ScholarGoogle Scholar
  396. Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7 (2019), 625-641. arXiv:cs.CL/1805.12471Google ScholarGoogle Scholar
  397. Bingzhen Wei, Xu Sun, Xuancheng Ren, and Jingjing Xu. 2017. Minimal Effort Back Propagation for Convolutional Neural Networks. (2017). arXiv:cs.LG/1709.05804Google ScholarGoogle Scholar
  398. Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning Structured Sparsity in Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.NE/1608.03665Google ScholarGoogle Scholar
  399. David White and Panos A. Ligomenides. 1993. GANNet: A Genetic Algorithm for Optimizing Topology and Weights in Neural Network Design. In Proceedings of the International Workshop on Artificial Neural Networks: New Trends in Neural Computation.Google ScholarGoogle Scholar
  400. D. Whitley and C. Bogart. 1990. The Evolution of Connectivity: Pruning Neural Networks Using Genetic Algorithms. In International Joint Conference on Neural Networks (IJCNN).Google ScholarGoogle Scholar
  401. Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). arXiv:cs.CL/1704.05426Google ScholarGoogle Scholar
  402. Peter M. Williams. 1995. Bayesian Regularization and Pruning Using a Laplace Prior. Neural Computation 7, 1 (1995), 117-143.Google ScholarGoogle Scholar
  403. Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. 2019. Discovering Neural Wirings. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1906.00586Google ScholarGoogle Scholar
  404. Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. 2020. Supermasks in Superposition. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.14769Google ScholarGoogle Scholar
  405. Yuhuai Wu, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. 2017. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In Advances in Neural Information Processing Systems (NeurIPS). 5285-5294. arXiv:cs.LG/1708.05144Google ScholarGoogle Scholar
  406. Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. 2019. AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters. In Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2019/hash/4efc9e02abdab6b6166251918570a307-Abstract.htmlGoogle ScholarGoogle Scholar
  407. Jinhua Xu and Daniel WC Ho. 2006. A new training and pruning algorithm based on node dependence and Jacobian rank deficiency. Neurocomputing 70, 1-3 (2006), 544-558.Google ScholarGoogle Scholar
  408. Atsushi Yaguchi, Taiji Suzuki, Wataru Asano, Shuhei Nitta, Yukinobu Sakata, and Akiyuki Tanizawa. 2018. Adam induces implicit weight sparsity in rectifier neural networks. In International Conference on Machine Learning and Applications (ICMLA). arXiv:cs.LG/1812.08119Google ScholarGoogle Scholar
  409. Dingqing Yang, Amin Ghasemazar, Xiaowei Ren, Maximilian Golub, Guy Lemieux, and Mieszko Lis. 2020a. Procrustes: a Dataow and Accelerator for Sparse Deep Neural Network Training. In International Symposium on Microarchitecture (MICRO). arXiv:cs.NE/2009.10976Google ScholarGoogle Scholar
  410. Huanrui Yang, Wei Wen, and Hai Li. 2020b. DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1908.09979Google ScholarGoogle Scholar
  411. Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1611.05128Google ScholarGoogle Scholar
  412. Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. 2018. Rethinking the smaller-normless-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1802.00124Google ScholarGoogle Scholar
  413. Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, and Qiang Liu. 2020. Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection. In International Conference on Machine Learning (ICML). arXiv:cs.LG/2003.01794Google ScholarGoogle Scholar
  414. Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, Jan-Henrik Lambrechts, Huan Zhang, Aojun Zhou, Kaisheng Ma, Yanzhi Wang, and Xue Lin. 2019. Adversarial Robustness vs. Model Compression, or Both?. In International Conference on Computer Vision (ICCV). arXiv:cs.CV/1903.12561Google ScholarGoogle Scholar
  415. Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. 2019. Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1903.05662Google ScholarGoogle Scholar
  416. Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2020. Drawing early-bird tickets: Towards more efficient training of deep networks. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1909.11957Google ScholarGoogle Scholar
  417. Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. 2019. Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/1909.08174Google ScholarGoogle Scholar
  418. Dong Yu, Frank Seide, Gang Li, and Li Deng. 2012. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  419. Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. ACM SIGARCH Computer Architecture News 45, 2 (2017), 548-560.Google ScholarGoogle Scholar
  420. Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S. Davis. 2018. NISP: Pruning Networks using Neuron Importance Score Propagation. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1711.05908Google ScholarGoogle Scholar
  421. Xin Yu, Zhiding Yu, and Srikumar Ramalingam. 2018. Learning strict identity mappings in deep residual networks. In Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:cs.CV/1804.01661Google ScholarGoogle Scholar
  422. Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49-67.Google ScholarGoogle Scholar
  423. Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. 2020. O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2006.04862Google ScholarGoogle Scholar
  424. Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/2007.14062Google ScholarGoogle Scholar
  425. Wenyuan Zeng and Raquel Urtasun. 2019. MLPrune: Multi-Layer Pruning for Automated Neural Network Compression. (2019). https://openreview.net/forum?id=r1g5b2RcKmGoogle ScholarGoogle Scholar
  426. Xiaoqin Zeng and Daniel S Yeung. 2006. Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure. Neurocomputing 69, 7-9 (2006), 825-837.Google ScholarGoogle Scholar
  427. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR). arXiv:cs.LG/1611.03530Google ScholarGoogle Scholar
  428. Jiaqi Zhang, Xiangru Chen, Mingcong Song, and Tao Li. 2019. Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks. In International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  429. Jie-Fang Zhang, Ching-En Lee, C. Liu, Y. Shao, Stephen W. Keckler, and Zhengya Zhang. 2019a. SNAP: A 1.67 21.55TOPS/W Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference in 16nm CMOS. In Symposium on VLSI Circuits.Google ScholarGoogle Scholar
  430. Jeff (Jun) Zhang, Parul Raj, Shuayb Zarar, Amol Ambardekar, and Siddharth Garg. 2019b. CompAct: On-Chip ComPression of ActIvations for Low Power Systolic Array Based CNN Acceleration. ACM Trans. Embed. Comput. Syst. 18, 5s, Article 47 (Oct. 2019), 24 pages.Google ScholarGoogle Scholar
  431. Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoi Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  432. Zhekai Zhang, Hanrui Wang, Song Han, and William J. Dally. 2020. SpArch: Efficient Architecture for Sparse Matrix Multiplication. In International Symposium on High Performance Computer Architecture (HPCA). arXiv:cs.AR/2002.08947Google ScholarGoogle Scholar
  433. Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. 2019a. Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection. (2019). arXiv:cs.CL/1912.11637Google ScholarGoogle Scholar
  434. Qibin Zhao, Masashi Sugiyama, Longhao Yuan, and Andrzej Cichocki. 2019b. Learning Efficient Tensor Representations with Ring Structure Networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). arXiv:cs.NA/1705.08286Google ScholarGoogle Scholar
  435. Guian Zhou and Jennie Si. 1999. Subset-based training and pruning of sigmoid neural networks. Neural Networks 12, 1 (1999), 79-89.Google ScholarGoogle Scholar
  436. Hao Zhou, Jose M Alvarez, and Fatih Porikli. 2016. Less is more: Towards compact CNNs. In European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  437. Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. 2019. Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.LG/1905.01067Google ScholarGoogle Scholar
  438. X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. 2018. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  439. Jingyang Zhu, Jingbo Jiang, Xizi Chen, and Chi-Ying Tsui. 2018. SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity. In Design, Automation & Test in Europe Conference & Exhibition (DATE). arXiv:cs.LG/1711.01263Google ScholarGoogle Scholar
  440. Jingyang Zhu, Zhiliang Qian, and Chi-Ying Tsui. 2016. LRADNN: High-throughput and energy-efficient Deep Neural Network accelerator using Low Rank Approximation. In Asia and South Pacific Design Automation Conference (ASP-DAC).Google ScholarGoogle Scholar
  441. Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. (2017). arXiv:stat.ML/1710.01878Google ScholarGoogle Scholar
  442. Tao Zhuang, Zhixuan Zhang, Yuheng Huang, Xiaoyi Zeng, Kai Shuang, and Xiang Li. 2020. Neuron-level Structured Pruning using Polarization Regularizer. In Advances in Neural Information Processing Systems (NeurIPS). https://proceedings.neurips.cc/paper/2020/hash/703957b6dd9e3a7980e040bee50ded65-Abstract.htmlGoogle ScholarGoogle Scholar
  443. Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. 2018. Discrimination-aware Channel Pruning for Deep Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS). arXiv:cs.CV/1810.11809Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image The Journal of Machine Learning Research
          The Journal of Machine Learning Research  Volume 22, Issue 1
          January 2021
          13310 pages
          ISSN:1532-4435
          EISSN:1533-7928
          Issue’s Table of Contents

          Copyright © 2021

          Publisher

          JMLR.org

          Publication History

          • Accepted: 1 September 2021
          • Revised: 1 June 2021
          • Received: 1 April 2021
          • Published: 1 January 2021
          Published in jmlr Volume 22, Issue 1

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!