skip to main content
research-article

Optimizing CNNs on Multicores for Scalability, Performance and Goodput

Published:04 April 2017Publication History
Skip Abstract Section

Abstract

Convolutional Neural Networks (CNN) are a class of Ar- tificial Neural Networks (ANN) that are highly efficient at the pattern recognition tasks that underlie difficult AI prob- lems in a variety of domains, such as speech recognition, object recognition, and natural language processing. CNNs are, however, computationally intensive to train. This paper presents the first characterization of the per- formance optimization opportunities for training CNNs on CPUs. Our characterization includes insights based on the structure of the network itself (i.e., intrinsic arithmetic inten- sity of the convolution and its scalability under parallelism) as well as dynamic properties of its execution (i.e., sparsity of the computation).

Given this characterization, we present an automatic framework called spg-CNN for optimizing CNN training on CPUs. It comprises of a computation scheduler for efficient parallel execution, and two code generators: one that opti- mizes for sparsity, and the other that optimizes for spatial reuse in convolutions.

We evaluate spg-CNN using convolutions from a variety of real world benchmarks, and show that spg-CNN can train CNNs faster than state-of-the-art approaches by an order of magnitude.

References

  1. http://chainer.org.Google ScholarGoogle Scholar
  2. http://www.cntk.ai.Google ScholarGoogle Scholar
  3. https://developers.google.com/protocol-buffers/.Google ScholarGoogle Scholar
  4. https://01.org/intel-deep-learning-framework.Google ScholarGoogle Scholar
  5. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org.Google ScholarGoogle Scholar
  6. F. Abuzaid, S. Hadjis, C. Zhang, and C. Ré. Caffe con troll: Shallow ideas to speed up deep learning. pharXiv preprint arXiv:1504.04343, 2015.Google ScholarGoogle Scholar
  7. F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio. Theano: new features and speed improvements, 2012.Google ScholarGoogle Scholar
  8. K. Chellapilla, S. Puri, and P. Simard. High performance convolutional neural networks for document processing. In Suvisoft Tenth International Workshop on Frontiers in Handwriting Recognition, 2006.Google ScholarGoogle Scholar
  9. Chen, Du, Sun, Wang, Wu, Chen, and Temam]Chen14T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2014Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen, Luo, Liu, Zhang, He, Wang, Li, Chen, Xu, Sun, et al.]Chen2014dadiannaoY. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al. Dadiannao: A machine-learning supercomputer. In IEEE/ACM International Symposium on Microarchitecture, 2014\natexlabb.Google ScholarGoogle Scholar
  11. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. pharXiv preprint arXiv:1410.0759, 2014.Google ScholarGoogle Scholar
  12. T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2012. Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep learning with cots hpc systems. In Proceedings of the 30th International Conference on Machine Learning, 2013.Google ScholarGoogle Scholar
  15. R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ACM Proceedings of the 25th international conference on Machine learning, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, Neural Information Processing Systems Workshop, number EPFL-CONF-192376, 2011Google ScholarGoogle Scholar
  17. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 2011\natexlabb.Google ScholarGoogle Scholar
  18. J. Cong and B. Xiao. Minimizing computation in convolutional neural networks. In Springer International Conference on Artificial Neural Networks. 2014. Google ScholarGoogle ScholarCross RefCross Ref
  19. K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  20. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 2012.Google ScholarGoogle Scholar
  21. M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, 2013.Google ScholarGoogle Scholar
  22. E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. Cnp: An fpga-based processor for convolutional networks. In International Conference on Field Programmable Logic and Applications, 2009. Google ScholarGoogle ScholarCross RefCross Ref
  24. C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  25. Y. Gao, Y. Liu, R. Zhao, and S. Chiu. An efficient sparse matrix multiplication for deep neural network-based applications. 2014.Google ScholarGoogle Scholar
  26. K. Goto and R. A. Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge, R. G. Dreslinski, J. Mars, and L. Tang. Djinn and tonic: Dnn as a service and its implications for future warehouse scale computers. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short-vector simd architectures. In Springer International Conference on Compiler Construction, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  29. J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In Proceedings of the 26th ACM international conference on Supercomputing, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Intel. Intel math kernel library, 2007.Google ScholarGoogle Scholar
  31. M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. pharXiv preprint arXiv:1405.3866, 2014.Google ScholarGoogle Scholar
  32. S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675--678, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In ACM Sigplan Notices, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Krizhevskey. Cuda-convnet, 2014.Google ScholarGoogle Scholar
  37. A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. pharXiv preprint arXiv:1404.5997, 2014.Google ScholarGoogle Scholar
  38. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. 2011.Google ScholarGoogle Scholar
  40. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. Google ScholarGoogle ScholarCross RefCross Ref
  41. H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ACM Proceedings of the 26th Annual International Conference on Machine Learning, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.Google ScholarGoogle Scholar
  43. W. Liu and B. Vinter. A framework for general sparse matrix-matrix multiplication on gpus and heterogeneous processors. pharXiv preprint arXiv:1504.05022, 2015.Google ScholarGoogle Scholar
  44. M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional networks through ffts. pharXiv preprint arXiv:1312.5851, 2013.Google ScholarGoogle Scholar
  45. K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung. Accelerating deep convolutional neural networks using specialized hardware, 2015.Google ScholarGoogle Scholar
  46. L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IEEE International Symposium on Parallel & Distributed Processing, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In IEEE International Conference on Document Analysis and Recognition, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  49. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. pharXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  50. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. F. Wang and Zhang. An adaptive and fully sparse training approach for multilayer perceptrons, 1996.Google ScholarGoogle Scholar
  52. F. Wang and Q. Zhang. A sparse matrix approach to neural network training. In Proceedings of the IEEE International Conference on Neural Network, 1995. Google ScholarGoogle ScholarCross RefCross Ref
  53. R. C. Whaley. Atlas (automatically tuned linear algebra software). In Springer Encyclopedia of Parallel Computing. 2011.Google ScholarGoogle Scholar
  54. Z. Xianyi, W. Qian, and Z. Chothia. Openblas. URL: http://xianyi. github. io/OpenBLAS, 2012.Google ScholarGoogle Scholar
  55. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. Optimizing CNNs on Multicores for Scalability, Performance and Goodput

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!