skip to main content
research-article

S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. In order to scale out DL frameworks and bring HPC capabilities to the DL arena, we propose, S-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters. With an in-depth analysis of new requirements brought forward by the DL frameworks and limitations of current communication runtimes, we present a co-design of the Caffe framework and the MVAPICH2-GDR MPI runtime. Using the co-design methodology, we modify Caffe's workflow to maximize the overlap of computation and communication with multi-stage data propagation and gradient aggregation schemes. We bring DL-Awareness to the MPI runtime by proposing a hierarchical reduction design that benefits from CUDA-Aware features and provides up to a massive 133x speedup over OpenMPI and 2.6x speedup over MVAPICH2 for 160 GPUs. S-Caffe successfully scales up to 160 K-80 GPUs for GoogLeNet (ImageNet) with a speedup of 2.5x over 32 GPUs. To the best of our knowledge, this is the first framework that scales up to 160 GPUs. Furthermore, even for single node training, S-Caffe shows an improvement of 14\% and 9\% over Nvidia's optimized Caffe for 8 and 16 GPUs, respectively. In addition, S-Caffe achieves up to 1395 samples per second for the AlexNet model, which is comparable to the performance of Microsoft CNTK.

References

  1. Caffe: Multi-GPU Usage and Performance. https://github.com/yahoo/caffe/blob/master/docs/multigpu.md.Google ScholarGoogle Scholar
  2. KESCH: Cray CS-Storm System. http://www.cscs.ch/computers/kesch_escha/index.html.Google ScholarGoogle Scholar
  3. Intel Caffe. https://github.com/intelcaffe.Google ScholarGoogle Scholar
  4. A Unified Runtime System for Heterogeneous Multicore Architectures. http://starpu.gforge.inria.fr.Google ScholarGoogle Scholar
  5. ILSVRC2012 Dataset. http://image-net.org/challenges/LSVRC/2012/index, 2012. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  6. Caffe Website. http://caffe.berkeleyvision.org/, 2015\natexlaba. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  7. CaffeNet. http://papers.nips.cc/book/advances-in-neural-information-processing-systems-25--2012, 2015\natexlabb. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  8. GPU Direct RDMA. http://docs.nvidia.com/cuda/gpudirect-rdma/, 2015. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  9. HPC: Powering Deep Learning. http://computing.ornl.gov/workshops/SMC15/docs/bcatanzaro_smcc.pdf, 2015. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  10. LMDB. http://symas.com/mdb/, 2015. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  11. Nvidia Development Platform for Autonomous Cars. http://www.nvidia.com/object/drive-px.html, 2016. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  12. CNTK. http://www.cntk.ai/, 2016. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  13. Nvidia GPUs Comparison. http://www.extremetech.com/computing/194391-nvidias-new-tesla-k80-doubles-up-on-gpu-horsepower, 2016. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  14. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from tensorflow. org.Google ScholarGoogle Scholar
  15. J. A. Anderson, C. D. Lorenz, and A. Travesset. General Purpose Molecular Dynamics Simulations Fully Implemented on Graphics Processing Units. Journal of Computational Physics, 227 (10): 5342--5359, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Bahrampour, N. Ramakrishnan, L. Schott, and M. Shah. Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning. CoRR, abs/1511.06435, 2016.Google ScholarGoogle Scholar
  17. F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio. Theano: New Features and Speed Improvements. arXiv preprint arXiv:1211.5590, 2012.Google ScholarGoogle Scholar
  18. D. Case, J. Berryman, R. Betz, D. Cerutti, T. Cheatham III, T. Darden, R. Duke, T. Giese, H. Gohlke, A. Goetz, et al. AMBER 2015. University of California, San Francisco, 2015.Google ScholarGoogle Scholar
  19. T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 571--582, Berkeley, CA, USA, 2014. USENIX Association. ISBN 978--1--931971--16--4. URL http://dl.acm.org/citation.cfm?id=2685048.2685094.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew. Deep Learning with COTS HPC Systems. In Proceedings of the 30th international conference on machine learning, pages 1337--1345, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Collobert, S. Bengio, and J. Mariéthoz. Torch: A Modular Machine Learning Software Library. Technical report, IDIAP, 2002.Google ScholarGoogle Scholar
  22. Cray. http://docs.cray.com/books/004--3689-001/html-004--3689-001/004--3689-001-toc.html, 2016. [Online; accessed Dec-2016].Google ScholarGoogle Scholar
  23. H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 4:1--4:16, New York, NY, USA, 2016. ACM. ISBN 978--1--4503--4240--7. 10.1145/2901318.2901323. URL http://doi.acm.org/10.1145/2901318.2901323.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems, pages 1223--1231, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A Large-Scale Hierarchical Image Database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  26. Google. Google's Remote Procedure Call Library (gRPC). http://www.grpc.io,natexlaba.Google ScholarGoogle Scholar
  27. Google. Distributed TensorFlow: Github Issues. https://github.com/tensorflow/models/issues/698,natexlabb.Google ScholarGoogle Scholar
  28. R. L. Graham, S. Poole, P. Shamis, G. Bloch, N. Bloch, H. Chapman, M. Kagan, A. Shahar, I. Rabinovitz, and G. Shainer. Overlapping Computation and Communication: Barrier Algorithms and ConnectX-2 CORE-Direct Capabilities. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pages 1--8. IEEE, 2010.Google ScholarGoogle Scholar
  29. T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Supercomputing, 2007. SC'07. Proceedings of the 2007 ACM/IEEE Conference on, pages 1--10. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. arXiv preprint arXiv:1511.00175, 2015.Google ScholarGoogle Scholar
  31. Inspur. https://github.com/Caffe-MPI/Caffe-MPI.github.io, 2016.Google ScholarGoogle Scholar
  32. J. Dean. Keynote: Large Scale Deep Learning.Google ScholarGoogle Scholar
  33. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014.Google ScholarGoogle Scholar
  34. A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.Google ScholarGoogle Scholar
  35. A. Krizhevsky and G. Hinton. Learning Multiple Layers of Features from Tiny Images, 2009.Google ScholarGoogle Scholar
  36. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097--1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks. arXiv, 2015. URL http://arxiv.org/abs/1511.06314.Google ScholarGoogle Scholar
  38. M. Lin, Q. Chen, and S. Yan. Network in Network. arXiv preprint arXiv:1312.4400, 2013.Google ScholarGoogle Scholar
  39. Lustre. Parallel File System. http://lustre.org.Google ScholarGoogle Scholar
  40. H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. TOP 500 Supercomputer Sites. http://www.top500.org.Google ScholarGoogle Scholar
  41. MPI over InfiniBand, 10GigE/iWARP and RoCE. https://mvapich.cse.ohio-state.edu/.Google ScholarGoogle Scholar
  42. Network Based Computing Laboratory. OSU Micro-Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/, 2016.Google ScholarGoogle Scholar
  43. C. Nvidia. Programming Guide, 2008.Google ScholarGoogle Scholar
  44. K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  45. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1--9, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  46. The HiDL Team. High Performance Deep Learning (HiDL) Project. http://hidl.cse.ohio-state.edu.Google ScholarGoogle Scholar
  47. The Open MPI Development Team. Open MPI : Open Source High Performance Computing. http://www.open-mpi.org.Google ScholarGoogle Scholar
  48. A. Vishnu, C. Siegel, and J. Daily. Distributed TensorFlow with MPI. arXiv preprint arXiv:1603.02339, 2016.Google ScholarGoogle Scholar
  49. t al.(2016)Wang, Khosla, Gargeya, Irshad, and Beck]wang-breast-cancerD. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck. Deep Learning for Identifying Metastatic Breast Cancer. ArXiv e-prints, June 2016.Google ScholarGoogle Scholar

Index Terms

  1. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 52, Issue 8
            PPoPP '17
            August 2017
            442 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/3155284
            Issue’s Table of Contents
            • cover image ACM Conferences
              PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
              January 2017
              476 pages
              ISBN:9781450344937
              DOI:10.1145/3018743

            Copyright © 2017 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 January 2017

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!