Abstract
Deep neural networks (DNN) have recently achieved extraordinary results in domains like computer vision and speech recognition. An essential element for this success has been the introduction of high performance computing (HPC) techniques in the critical step of training the neural network. This paper describes the implementation and analysis of a network-agnostic and convergence-invariant coarse-grain parallelization of the DNN training algorithm. The coarse-grain parallelization is achieved through the exploitation of the batch-level parallelism. This strategy is independent from the support of specialized and optimized libraries. Therefore, the optimization is immediately available for accelerating the DNN training. The proposal is compatible with multi-GPU execution without altering the algorithm convergence rate. The parallelization has been implemented in Caffe, a state-of-the-art DNN framework. The paper describes the code transformations for the parallelization and we also identify the limiting performance factors of the approach. We show competitive performance results for two state-of-the-art computer vision datasets, MNIST and CIFAR-10. In particular, on a 16-core Xeon E5-2667v2 at 3.30GHz we observe speedups of 8× over the sequential execution, at similar performance levels of those obtained by the GPU optimized Caffe version in a NVIDIA K40 GPU.
- F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.Google Scholar
- A. Basumallik, S.-J. Min, and R. Eigenmann. Programming distributed memory sytems using openmp. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8. IEEE, 2007.Google Scholar
Cross Ref
- L. S. Blackford, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, et al. An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28(2):135--151, 2002. Google Scholar
Digital Library
- L. Bottou. The tradeoffs of large scale learning. Advances in Neural Information Processing Systems, 2008.Google Scholar
Digital Library
- S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.Google Scholar
- T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. 11th USENIX Symposium on Operating Systems Design and Implementation, 2014. Google Scholar
Digital Library
- D. C. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn deep neural networks for image classification. Computer Vision and Pattern Recognition. CVPR12., 2012.Google Scholar
- A. Coates, B. Huval, T. Wang, D. J. Wu, B. C. Catanzaro, and A. Y. Ng. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1337--1345, 2013.Google Scholar
- R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011. URL https://publidiap.idiap.ch/downloads//papers/2011/Collobert_NIPSWORKSHOP_2011.pdf.Google Scholar
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, et al. Large scale distributed deep networks. In NIPS, pages 1232--1240, 2012.Google Scholar
Digital Library
- L. Deng. The MNIST Database of Handwritten Digit Images for Machine Learning Research {Best of the Web}. Signal Processing Magazine, IEEE, 29(6):141--142, Nov 2012. ISSN 1053-5888.Google Scholar
Cross Ref
- J. Dongarra. Preface: basic linear algebra subprograms technical (blast) forum standard. International Journal of High Performance Computing Applications, 16(2):115--115, 2002. Google Scholar
Digital Library
- J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 2011. Google Scholar
Digital Library
- Google. Protocol buffers. https://developers.google.com/protocol-buffers/, 2015.Google Scholar
- A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng. Deep speech: Scaling up end-to-end speech recognition. CoRR, abs/1412.5567, 2014. URL http://arxiv.org/abs/1412.5567.Google Scholar
- G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.Google Scholar
- A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 2012.Google Scholar
Digital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, pages 1097--1105, 2012.Google Scholar
Digital Library
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient Based Learning Applied to Document Recognition. IEEE Press, pages 306--351, 2001.Google Scholar
- Y. LeCun, C. Cortes, and C. Burges. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 2015.Google Scholar
- Y. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k). Soviet Mathematics Doklady, 1983.Google Scholar
- R. Raina, A. Madhavan, and A. Ng. Large-scale deep unsupervised learning using graphics processors. International Conference on Machine Learning, 2009. Google Scholar
Digital Library
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pages 1--42, April 2015. Google Scholar
Digital Library
- M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Arxiv 1311.2901. http://arxiv.org/abs/1311.2901, 2013.Google Scholar
Recommendations
Coarse grain parallelization of deep neural networks
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingDeep neural networks (DNN) have recently achieved extraordinary results in domains like computer vision and speech recognition. An essential element for this success has been the introduction of high performance computing (HPC) techniques in the ...
NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks
Convolution Neural Networks (CNNs), a special subcategory of Deep Learning Neural Networks (DNNs), have become increasingly popular in industry and academia for their powerful capability in pattern classification, image processing, and speech ...
Scaling Up the Training of Deep CNNs for Human Action Recognition
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium WorkshopConvolutional deep neural networks (CNNs) has been shown to perform well in difficult learning tasks such as object recognition. They are gaining huge importance in recent times but are computationally intensive. Typically trained on massive datasets, ...






Comments