skip to main content
research-article
Open Access

A domain-specific supercomputer for training deep neural networks

Published:18 June 2020Publication History
Skip Abstract Section

Abstract

Google's TPU supercomputers train deep neural networks 50x faster than general-purpose supercomputers running a high-performance computing benchmark.

References

  1. Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. 2016; arXiv preprint arXiv:1603.04467.Google ScholarGoogle Scholar
  2. Amodei, D. and Hernandez, D. AI and compute, 2018; https://blog.openai.com/aiandcompute.Google ScholarGoogle Scholar
  3. Asanović, K. Programmable neurocomputing. The Handbook of Brain Theory and Neural Networks, 2nd Edition, M.A. Arbib, ed. MIT Press, 2002.Google ScholarGoogle Scholar
  4. Bahdanau, D., Cho, K. and Bengio, Y. Neural machine translation by jointly learning to align and translate. 2014; arXiv preprint arXiv:1409.0473.Google ScholarGoogle Scholar
  5. Chen, J. et al. Revisiting distributed synchronous SGD. 2016; arXiv preprint arXiv:1604.00981.Google ScholarGoogle Scholar
  6. Chen, M.X. et al. The best of both worlds: Combining recent advances in neural machine translation. 2018; arXiv preprint arXiv:1804.09849.Google ScholarGoogle Scholar
  7. Chen, Y. et al. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Int'l Symp. on Microarchitecture, (2014), 609--622.Google ScholarGoogle Scholar
  8. Chiu, C.C. et al. State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of the IEEE Int'l Conference on Acoustics, Speech and Signal Processing, (Apr. 2018), 4774--4778.Google ScholarGoogle Scholar
  9. Clark, J. Google turning its lucrative Web search over to AI machines. Bloomberg Technology, Oct. 26, 2015.Google ScholarGoogle Scholar
  10. De Sa, C. et al. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Int'l Symp, on Computer Architecture, (2017), 561--574.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. De Sa, C. et al. High-accuracy low-precision training. 2018; arXiv preprint arXiv:1803.03383.Google ScholarGoogle Scholar
  12. Dean, J. et al. Large scale distributed deep networks. Advances in Neural Information Processing Systems, (2012), 1223--1231.Google ScholarGoogle Scholar
  13. Dongarra, J. The HPC challenge benchmark: a candidate for replacing Linpack in the Top500? In Proceedings of the SPEC Benchmark Workshop, (Jan. 2007); www.spec.org/workshops/2007/austin/slides/Keynote_Jack_Dongarra.pdf.Google ScholarGoogle Scholar
  14. Duchi, J., Hazan, E. and Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization. J. Machine Learning Research 12 (July 2011), 2121--2159.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Graphcore Intelligence Processing Unit. (https://www.graphcore.ai/products/ipuGoogle ScholarGoogle Scholar
  16. Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach, 6th Edition. Elsevier, 2019.Google ScholarGoogle Scholar
  17. Hennessy, J.L. and Patterson, D.A. A new golden age for computer architecture. Commun. ACM 62, 2 (Feb. 2019), 48--60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015; arXiv preprint arXiv:1502.03167.Google ScholarGoogle Scholar
  19. Jouppi, N.P. et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Int'l Symp. on Computer Architecture, (June 2017), 1--12.Google ScholarGoogle Scholar
  20. Jouppi, N.P., Young, C., Patil, N. and Patterson, D. A domain- specific architecture for deep neural networks. Commun. ACM 61, 9 (Sept. 2018), 50--59.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kalamkar, D. et al. A study of Bfloat16 for deep learning training. 2019; arXiv preprint arXiv:1905.12322.Google ScholarGoogle Scholar
  22. Köster, U. et al. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Proceedings of the 31st Conf. on Neural Information Processing Systems, (2017).Google ScholarGoogle Scholar
  23. Kung, H.T. and Leiserson, C.E. Algorithms for VLSI processor arrays. Introduction to VLSI Systems, 1980.Google ScholarGoogle Scholar
  24. Lie, S. Wafer scale deep learning. In Proceedings of the IEEE Hot Chips 31 Symp., (Aug 2019).Google ScholarGoogle Scholar
  25. Mellempudi, N. et al. Mixed precision training with 8-bit floating point. 2019; arXiv preprint arXiv:1905.12334.Google ScholarGoogle Scholar
  26. Micikevicius, P. et al. Mixed precision training. 2017; arXiv preprint arXiv:1710.03740.Google ScholarGoogle Scholar
  27. Mikolov, T. et al. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (2013), 3111--3119.Google ScholarGoogle Scholar
  28. Nicol, C. A dataflow processing chip for training deep neural networks. In Proceedings of the IEEE Hot Chips 29 Symp., (Aug 2017I.Google ScholarGoogle Scholar
  29. Olah, C. Deep learning, NLP, and representations. Colah's blog, 2014; http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.Google ScholarGoogle Scholar
  30. Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4, 5 (1964), 1--17.Google ScholarGoogle ScholarCross RefCross Ref
  31. Robbins, H. and Monro, S. A Stochastic approximation method. The Annals of Mathematical Statistics 22, 3 (Sept. 1951), 400--407.Google ScholarGoogle ScholarCross RefCross Ref
  32. Shallue, C.J. et al. Measuring the effects of data parallelism on neural network training. 2018; arXiv preprint arXiv:1811.03600.Google ScholarGoogle Scholar
  33. Shaw, D.E. et al. Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM 51, 7 (July 2008), 91--97.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Silver, D. et al. A general reinforcement learning algorithm that Master's chess, Shogi, and Go through self-play. Science 362, 6419 (2018), 1140--1144.Google ScholarGoogle ScholarCross RefCross Ref
  35. Thottethodi, M. and Vijaykumar, T. Why the GPGPU is less efficient than the TPU for DNNs. Computer Architecture Today Blog, 2019; www.sigarch.org/why-the-gpgpu-is-less-efficientthan-the-tpu-for-dnns/Google ScholarGoogle Scholar
  36. Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems (2017), 5998--6008.Google ScholarGoogle Scholar
  37. Venkataramani, S. et al. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 45th Int'l Symp. on Computer Architecture, (2017), 13--26.Google ScholarGoogle Scholar
  38. Ward-Foxton, S. Habana debuts record-breaking AI training chip, (June 2019); https://www.eetimes.com/document.asp?doc_id=1334816.Google ScholarGoogle Scholar
  39. Wilkinson, J.H. Rounding Errors in Algebraic Processes, 1st Edition. Prentice Hall, Englewood Cliffs, NJ, 1963.Google ScholarGoogle Scholar
  40. Yang, A. Deep learning training at scale Spring Crest Deep Learning Accelerator (Intel® Nervana NNP-T). In Proceedings of the Hot Chips, (Aug. 2019); www.hotchips.org/hc31/HC31_1.12_Intel_Intel.AndrewYang.v0.92.pdf.Google ScholarGoogle Scholar
  41. Ying, C. et al. Image classification at supercomputer scale. 2018; arXiv preprint arXiv:1811.06992.Google ScholarGoogle Scholar
  42. Zoph, B. and Le, Q.V. Neural architecture search with reinforcement learning. 2019; arXiv preprint arXiv:1611.01578.Google ScholarGoogle Scholar

Index Terms

  1. A domain-specific supercomputer for training deep neural networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 63, Issue 7
          July 2020
          102 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/3407166
          Issue’s Table of Contents

          Copyright © 2020 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 June 2020
          • Online First: 18 June 2020

          Check for updates

          Qualifiers

          • research-article
          • Popular
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format