Abstract
Convolutional Neural Networks (CNN) are a class of Ar- tificial Neural Networks (ANN) that are highly efficient at the pattern recognition tasks that underlie difficult AI prob- lems in a variety of domains, such as speech recognition, object recognition, and natural language processing. CNNs are, however, computationally intensive to train. This paper presents the first characterization of the per- formance optimization opportunities for training CNNs on CPUs. Our characterization includes insights based on the structure of the network itself (i.e., intrinsic arithmetic inten- sity of the convolution and its scalability under parallelism) as well as dynamic properties of its execution (i.e., sparsity of the computation).
Given this characterization, we present an automatic framework called spg-CNN for optimizing CNN training on CPUs. It comprises of a computation scheduler for efficient parallel execution, and two code generators: one that opti- mizes for sparsity, and the other that optimizes for spatial reuse in convolutions.
We evaluate spg-CNN using convolutions from a variety of real world benchmarks, and show that spg-CNN can train CNNs faster than state-of-the-art approaches by an order of magnitude.
- http://chainer.org.Google Scholar
- http://www.cntk.ai.Google Scholar
- https://developers.google.com/protocol-buffers/.Google Scholar
- https://01.org/intel-deep-learning-framework.Google Scholar
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org.Google Scholar
- F. Abuzaid, S. Hadjis, C. Zhang, and C. Ré. Caffe con troll: Shallow ideas to speed up deep learning. pharXiv preprint arXiv:1504.04343, 2015.Google Scholar
- F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio. Theano: new features and speed improvements, 2012.Google Scholar
- K. Chellapilla, S. Puri, and P. Simard. High performance convolutional neural networks for document processing. In Suvisoft Tenth International Workshop on Frontiers in Handwriting Recognition, 2006.Google Scholar
- Chen, Du, Sun, Wang, Wu, Chen, and Temam]Chen14T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2014Google Scholar
Digital Library
- Chen, Luo, Liu, Zhang, He, Wang, Li, Chen, Xu, Sun, et al.]Chen2014dadiannaoY. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al. Dadiannao: A machine-learning supercomputer. In IEEE/ACM International Symposium on Microarchitecture, 2014\natexlabb.Google Scholar
- S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. pharXiv preprint arXiv:1410.0759, 2014.Google Scholar
- T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation, 2014.Google Scholar
Digital Library
- D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2012. Google Scholar
Cross Ref
- A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep learning with cots hpc systems. In Proceedings of the 30th International Conference on Machine Learning, 2013.Google Scholar
- R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ACM Proceedings of the 25th international conference on Machine learning, 2008.Google Scholar
Digital Library
- R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, Neural Information Processing Systems Workshop, number EPFL-CONF-192376, 2011Google Scholar
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 2011\natexlabb.Google Scholar
- J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. In Springer International Conference on Artificial Neural Networks. 2014. Google Scholar
Cross Ref
- K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008. Google Scholar
Cross Ref
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 2012.Google Scholar
- M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, 2013.Google Scholar
- E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, 2014.Google Scholar
Digital Library
- C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. Cnp: An fpga-based processor for convolutional networks. In International Conference on Field Programmable Logic and Applications, 2009. Google Scholar
Cross Ref
- C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2011. Google Scholar
Cross Ref
- Y. Gao, Y. Liu, R. Zhao, and S. Chiu. An efficient sparse matrix multiplication for deep neural network-based applications. 2014.Google Scholar
- K. Goto and R. A. Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software, 2008. Google Scholar
Digital Library
- J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge, R. G. Dreslinski, J. Mars, and L. Tang. Djinn and tonic: Dnn as a service and its implications for future warehouse scale computers. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015. Google Scholar
Digital Library
- T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short-vector simd architectures. In Springer International Conference on Compiler Construction, 2011. Google Scholar
Cross Ref
- J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In Proceedings of the 26th ACM international conference on Supercomputing, 2012. Google Scholar
Digital Library
- M. Intel. Intel math kernel library, 2007.Google Scholar
- M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. pharXiv preprint arXiv:1405.3866, 2014.Google Scholar
- S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. Google Scholar
Digital Library
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675--678, 2014. Google Scholar
Digital Library
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2014. Google Scholar
Digital Library
- S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In ACM Sigplan Notices, 2007. Google Scholar
Digital Library
- A. Krizhevskey. Cuda-convnet, 2014.Google Scholar
- A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. pharXiv preprint arXiv:1404.5997, 2014.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.Google Scholar
Digital Library
- Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. 2011.Google Scholar
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. Google Scholar
Cross Ref
- H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ACM Proceedings of the 26th Annual International Conference on Machine Learning, 2009. Google Scholar
Digital Library
- B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.Google Scholar
- W. Liu and B. Vinter. A framework for general sparse matrix-matrix multiplication on gpus and heterogeneous processors. pharXiv preprint arXiv:1504.05022, 2015.Google Scholar
- M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. pharXiv preprint arXiv:1312.5851, 2013.Google Scholar
- K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung. Accelerating deep convolutional neural networks using specialized hardware, 2015.Google Scholar
- L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IEEE International Symposium on Parallel & Distributed Processing, 2009. Google Scholar
Digital Library
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 2013. Google Scholar
Digital Library
- P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In IEEE International Conference on Document Analysis and Recognition, 2003. Google Scholar
Cross Ref
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. pharXiv preprint arXiv:1409.1556, 2014.Google Scholar
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.Google Scholar
Digital Library
- F. Wang and Zhang. An adaptive and fully sparse training approach for multilayer perceptrons, 1996.Google Scholar
- F. Wang and Q. Zhang. A sparse matrix approach to neural network training. In Proceedings of the IEEE International Conference on Neural Network, 1995. Google Scholar
Cross Ref
- R. C. Whaley. Atlas (automatically tuned linear algebra software). In Springer Encyclopedia of Parallel Computing. 2011.Google Scholar
- Z. Xianyi, W. Qian, and Z. Chothia. Openblas. URL: http://xianyi. github. io/OpenBLAS, 2012.Google Scholar
- C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015. Google Scholar
Digital Library
Index Terms
(auto-classified)Optimizing CNNs on Multicores for Scalability, Performance and Goodput
Recommendations
Optimizing CNNs on Multicores for Scalability, Performance and Goodput
Asplos'17Convolutional Neural Networks (CNN) are a class of Ar- tificial Neural Networks (ANN) that are highly efficient at the pattern recognition tasks that underlie difficult AI prob- lems in a variety of domains, such as speech recognition, object ...
Optimizing CNNs on Multicores for Scalability, Performance and Goodput
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsConvolutional Neural Networks (CNN) are a class of Ar- tificial Neural Networks (ANN) that are highly efficient at the pattern recognition tasks that underlie difficult AI prob- lems in a variety of domains, such as speech recognition, object ...
On the Programmability and Performance of Heterogeneous Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed SystemsGeneral-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures ...







Comments