Abstract
Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. In order to scale out DL frameworks and bring HPC capabilities to the DL arena, we propose, S-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters. With an in-depth analysis of new requirements brought forward by the DL frameworks and limitations of current communication runtimes, we present a co-design of the Caffe framework and the MVAPICH2-GDR MPI runtime. Using the co-design methodology, we modify Caffe's workflow to maximize the overlap of computation and communication with multi-stage data propagation and gradient aggregation schemes. We bring DL-Awareness to the MPI runtime by proposing a hierarchical reduction design that benefits from CUDA-Aware features and provides up to a massive 133x speedup over OpenMPI and 2.6x speedup over MVAPICH2 for 160 GPUs. S-Caffe successfully scales up to 160 K-80 GPUs for GoogLeNet (ImageNet) with a speedup of 2.5x over 32 GPUs. To the best of our knowledge, this is the first framework that scales up to 160 GPUs. Furthermore, even for single node training, S-Caffe shows an improvement of 14\% and 9\% over Nvidia's optimized Caffe for 8 and 16 GPUs, respectively. In addition, S-Caffe achieves up to 1395 samples per second for the AlexNet model, which is comparable to the performance of Microsoft CNTK.
- Caffe: Multi-GPU Usage and Performance. https://github.com/yahoo/caffe/blob/master/docs/multigpu.md.Google Scholar
- KESCH: Cray CS-Storm System. http://www.cscs.ch/computers/kesch_escha/index.html.Google Scholar
- Intel Caffe. https://github.com/intelcaffe.Google Scholar
- A Unified Runtime System for Heterogeneous Multicore Architectures. http://starpu.gforge.inria.fr.Google Scholar
- ILSVRC2012 Dataset. http://image-net.org/challenges/LSVRC/2012/index, 2012. [Online; accessed Dec-2016].Google Scholar
- Caffe Website. http://caffe.berkeleyvision.org/, 2015\natexlaba. [Online; accessed Dec-2016].Google Scholar
- CaffeNet. http://papers.nips.cc/book/advances-in-neural-information-processing-systems-25--2012, 2015\natexlabb. [Online; accessed Dec-2016].Google Scholar
- GPU Direct RDMA. http://docs.nvidia.com/cuda/gpudirect-rdma/, 2015. [Online; accessed Dec-2016].Google Scholar
- HPC: Powering Deep Learning. http://computing.ornl.gov/workshops/SMC15/docs/bcatanzaro_smcc.pdf, 2015. [Online; accessed Dec-2016].Google Scholar
- LMDB. http://symas.com/mdb/, 2015. [Online; accessed Dec-2016].Google Scholar
- Nvidia Development Platform for Autonomous Cars. http://www.nvidia.com/object/drive-px.html, 2016. [Online; accessed Dec-2016].Google Scholar
- CNTK. http://www.cntk.ai/, 2016. [Online; accessed Dec-2016].Google Scholar
- Nvidia GPUs Comparison. http://www.extremetech.com/computing/194391-nvidias-new-tesla-k80-doubles-up-on-gpu-horsepower, 2016. [Online; accessed Dec-2016].Google Scholar
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from tensorflow. org.Google Scholar
- J. A. Anderson, C. D. Lorenz, and A. Travesset. General Purpose Molecular Dynamics Simulations Fully Implemented on Graphics Processing Units. Journal of Computational Physics, 227 (10): 5342--5359, 2008. Google Scholar
Digital Library
- S. Bahrampour, N. Ramakrishnan, L. Schott, and M. Shah. Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning. CoRR, abs/1511.06435, 2016.Google Scholar
- F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio. Theano: New Features and Speed Improvements. arXiv preprint arXiv:1211.5590, 2012.Google Scholar
- D. Case, J. Berryman, R. Betz, D. Cerutti, T. Cheatham III, T. Darden, R. Duke, T. Giese, H. Gohlke, A. Goetz, et al. AMBER 2015. University of California, San Francisco, 2015.Google Scholar
- T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 571--582, Berkeley, CA, USA, 2014. USENIX Association. ISBN 978--1--931971--16--4. URL http://dl.acm.org/citation.cfm?id=2685048.2685094.Google Scholar
Digital Library
- A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep Learning with COTS HPC Systems. In Proceedings of the 30th international conference on machine learning, pages 1337--1345, 2013.Google Scholar
Digital Library
- R. Collobert, S. Bengio, and J. Mariéthoz. Torch: A Modular Machine Learning Software Library. Technical report, IDIAP, 2002.Google Scholar
- Cray. http://docs.cray.com/books/004--3689-001/html-004--3689-001/004--3689-001-toc.html, 2016. [Online; accessed Dec-2016].Google Scholar
- H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 4:1--4:16, New York, NY, USA, 2016. ACM. ISBN 978--1--4503--4240--7. 10.1145/2901318.2901323. URL http://doi.acm.org/10.1145/2901318.2901323.Google Scholar
Digital Library
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems, pages 1223--1231, 2012.Google Scholar
Digital Library
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A Large-Scale Hierarchical Image Database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.Google Scholar
Cross Ref
- Google. Google's Remote Procedure Call Library (gRPC). http://www.grpc.io,natexlaba.Google Scholar
- Google. Distributed TensorFlow: Github Issues. https://github.com/tensorflow/models/issues/698,natexlabb.Google Scholar
- R. L. Graham, S. Poole, P. Shamis, G. Bloch, N. Bloch, H. Chapman, M. Kagan, A. Shahar, I. Rabinovitz, and G. Shainer. Overlapping Computation and Communication: Barrier Algorithms and ConnectX-2 CORE-Direct Capabilities. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pages 1--8. IEEE, 2010.Google Scholar
- T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Supercomputing, 2007. SC'07. Proceedings of the 2007 ACM/IEEE Conference on, pages 1--10. IEEE, 2007. Google Scholar
Digital Library
- F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. arXiv preprint arXiv:1511.00175, 2015.Google Scholar
- Inspur. https://github.com/Caffe-MPI/Caffe-MPI.github.io, 2016.Google Scholar
- J. Dean. Keynote: Large Scale Deep Learning.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014.Google Scholar
- A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.Google Scholar
- A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny Images, 2009.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097--1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.Google Scholar
Digital Library
- S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks. arXiv, 2015. URL http://arxiv.org/abs/1511.06314.Google Scholar
- M. Lin, Q. Chen, and S. Yan. Network in Network. arXiv preprint arXiv:1312.4400, 2013.Google Scholar
- Lustre. Parallel File System. http://lustre.org.Google Scholar
- H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. TOP 500 Supercomputer Sites. http://www.top500.org.Google Scholar
- MPI over InfiniBand, 10GigE/iWARP and RoCE. https://mvapich.cse.ohio-state.edu/.Google Scholar
- Network Based Computing Laboratory. OSU Micro-Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/, 2016.Google Scholar
- C. Nvidia. Programming Guide, 2008.Google Scholar
- K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1--9, 2015. Google Scholar
Cross Ref
- The HiDL Team. High Performance Deep Learning (HiDL) Project. http://hidl.cse.ohio-state.edu.Google Scholar
- The Open MPI Development Team. Open MPI : Open Source High Performance Computing. http://www.open-mpi.org.Google Scholar
- A. Vishnu, C. Siegel, and J. Daily. Distributed TensorFlow with MPI. arXiv preprint arXiv:1603.02339, 2016.Google Scholar
- t al.(2016)Wang, Khosla, Gargeya, Irshad, and Beck]wang-breast-cancerD. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck. Deep Learning for Identifying Metastatic Breast Cancer. ArXiv e-prints, June 2016.Google Scholar
Index Terms
S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters
Recommendations
S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingAvailability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like ...
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures
MLHPC'17: Proceedings of the Machine Learning on HPC EnvironmentsTraditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through ...
NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks
Convolution Neural Networks (CNNs), a special subcategory of Deep Learning Neural Networks (DNNs), have become increasingly popular in industry and academia for their powerful capability in pattern classification, image processing, and speech ...







Comments