skip to main content
10.5555/2999134.2999271guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

Large scale distributed deep networks

Published:03 December 2012Publication History

ABSTRACT

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.

References

  1. G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2012. Google ScholarGoogle Scholar
  2. G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012.Google ScholarGoogle Scholar
  3. D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010.Google ScholarGoogle Scholar
  4. A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised feature learning. In AISTATS 14, 2011.Google ScholarGoogle Scholar
  5. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003. Google ScholarGoogle Scholar
  6. R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, 2008. Google ScholarGoogle Scholar
  7. Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A.Y. Ng. On optimization methods for deep learning. In ICML, 2011.Google ScholarGoogle Scholar
  8. R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors. In ICML, 2009. Google ScholarGoogle Scholar
  9. J. Martens. Deep learning via hessian-free optimization. In ICML, 2010.Google ScholarGoogle Scholar
  10. J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121-2159, 2011. Google ScholarGoogle Scholar
  11. Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. Hash kernels. In AISTATS, 2009.Google ScholarGoogle Scholar
  12. J. Langford, A. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, 2009. Google ScholarGoogle Scholar
  13. G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, 2009. Google ScholarGoogle Scholar
  14. R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In NAACL, 2010. Google ScholarGoogle Scholar
  15. M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, 2010.Google ScholarGoogle Scholar
  16. A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A reliable effective terascale linear learning system. In AISTATS, 2011.Google ScholarGoogle Scholar
  17. A. Agarwal and J. Duchi. Distributed delayed stochastic optimization. In NIPS, 2011.Google ScholarGoogle Scholar
  18. F. Niu, B. Retcht, C. Re, and S. J. Wright. Hogwild! A lock-free approach to parallelizing stochastic gradient descent. In NIPS, 2011.Google ScholarGoogle Scholar
  19. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In SciPy, 2010.Google ScholarGoogle Scholar
  20. D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. Technical report, IDSIA, 2012.Google ScholarGoogle Scholar
  21. L. Deng, D. Yu, and J. Platt. Scalable stacking and learning for building deep architectures. In ICASSP, 2012.Google ScholarGoogle Scholar
  22. A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, U. Toronto, 2009.Google ScholarGoogle Scholar
  23. J. Dean and S. Ghemawat. Map-Reduce: simplified data processing on large clusters. CACM, 2008. Google ScholarGoogle Scholar
  24. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. Distributed GraphLab: A framework for machine learning in the cloud. In VLDB, 2012. Google ScholarGoogle Scholar
  25. L. Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Nîmes 91, 1991.Google ScholarGoogle Scholar
  26. Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In Neural Networks: Tricks of the trade. Springer, 1998. Google ScholarGoogle Scholar
  27. V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.Google ScholarGoogle Scholar
  28. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.Google ScholarGoogle Scholar
  29. Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.Google ScholarGoogle Scholar

Index Terms

  1. Large scale distributed deep networks
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image Guide Proceedings
      NIPS'12: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1
      December 2012
      3328 pages

      Publisher

      Curran Associates Inc.

      Red Hook, NY, United States

      Publication History

      • Published: 3 December 2012

      Qualifiers

      • Article