Abstract
Google's TPU supercomputers train deep neural networks 50x faster than general-purpose supercomputers running a high-performance computing benchmark.
- Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. 2016; arXiv preprint arXiv:1603.04467.Google Scholar
- Amodei, D. and Hernandez, D. AI and compute, 2018; https://blog.openai.com/aiandcompute.Google Scholar
- Asanović, K. Programmable neurocomputing. The Handbook of Brain Theory and Neural Networks, 2nd Edition, M.A. Arbib, ed. MIT Press, 2002.Google Scholar
- Bahdanau, D., Cho, K. and Bengio, Y. Neural machine translation by jointly learning to align and translate. 2014; arXiv preprint arXiv:1409.0473.Google Scholar
- Chen, J. et al. Revisiting distributed synchronous SGD. 2016; arXiv preprint arXiv:1604.00981.Google Scholar
- Chen, M.X. et al. The best of both worlds: Combining recent advances in neural machine translation. 2018; arXiv preprint arXiv:1804.09849.Google Scholar
- Chen, Y. et al. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Int'l Symp. on Microarchitecture, (2014), 609--622.Google Scholar
- Chiu, C.C. et al. State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of the IEEE Int'l Conference on Acoustics, Speech and Signal Processing, (Apr. 2018), 4774--4778.Google Scholar
- Clark, J. Google turning its lucrative Web search over to AI machines. Bloomberg Technology, Oct. 26, 2015.Google Scholar
- De Sa, C. et al. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Int'l Symp, on Computer Architecture, (2017), 561--574.Google Scholar
Digital Library
- De Sa, C. et al. High-accuracy low-precision training. 2018; arXiv preprint arXiv:1803.03383.Google Scholar
- Dean, J. et al. Large scale distributed deep networks. Advances in Neural Information Processing Systems, (2012), 1223--1231.Google Scholar
- Dongarra, J. The HPC challenge benchmark: a candidate for replacing Linpack in the Top500? In Proceedings of the SPEC Benchmark Workshop, (Jan. 2007); www.spec.org/workshops/2007/austin/slides/Keynote_Jack_Dongarra.pdf.Google Scholar
- Duchi, J., Hazan, E. and Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization. J. Machine Learning Research 12 (July 2011), 2121--2159.Google Scholar
Digital Library
- Graphcore Intelligence Processing Unit. (https://www.graphcore.ai/products/ipuGoogle Scholar
- Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach, 6th Edition. Elsevier, 2019.Google Scholar
- Hennessy, J.L. and Patterson, D.A. A new golden age for computer architecture. Commun. ACM 62, 2 (Feb. 2019), 48--60.Google Scholar
Digital Library
- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015; arXiv preprint arXiv:1502.03167.Google Scholar
- Jouppi, N.P. et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Int'l Symp. on Computer Architecture, (June 2017), 1--12.Google Scholar
- Jouppi, N.P., Young, C., Patil, N. and Patterson, D. A domain- specific architecture for deep neural networks. Commun. ACM 61, 9 (Sept. 2018), 50--59.Google Scholar
Digital Library
- Kalamkar, D. et al. A study of Bfloat16 for deep learning training. 2019; arXiv preprint arXiv:1905.12322.Google Scholar
- Köster, U. et al. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Proceedings of the 31st Conf. on Neural Information Processing Systems, (2017).Google Scholar
- Kung, H.T. and Leiserson, C.E. Algorithms for VLSI processor arrays. Introduction to VLSI Systems, 1980.Google Scholar
- Lie, S. Wafer scale deep learning. In Proceedings of the IEEE Hot Chips 31 Symp., (Aug 2019).Google Scholar
- Mellempudi, N. et al. Mixed precision training with 8-bit floating point. 2019; arXiv preprint arXiv:1905.12334.Google Scholar
- Micikevicius, P. et al. Mixed precision training. 2017; arXiv preprint arXiv:1710.03740.Google Scholar
- Mikolov, T. et al. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (2013), 3111--3119.Google Scholar
- Nicol, C. A dataflow processing chip for training deep neural networks. In Proceedings of the IEEE Hot Chips 29 Symp., (Aug 2017I.Google Scholar
- Olah, C. Deep learning, NLP, and representations. Colah's blog, 2014; http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.Google Scholar
- Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4, 5 (1964), 1--17.Google Scholar
Cross Ref
- Robbins, H. and Monro, S. A Stochastic approximation method. The Annals of Mathematical Statistics 22, 3 (Sept. 1951), 400--407.Google Scholar
Cross Ref
- Shallue, C.J. et al. Measuring the effects of data parallelism on neural network training. 2018; arXiv preprint arXiv:1811.03600.Google Scholar
- Shaw, D.E. et al. Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM 51, 7 (July 2008), 91--97.Google Scholar
Digital Library
- Silver, D. et al. A general reinforcement learning algorithm that Master's chess, Shogi, and Go through self-play. Science 362, 6419 (2018), 1140--1144.Google Scholar
Cross Ref
- Thottethodi, M. and Vijaykumar, T. Why the GPGPU is less efficient than the TPU for DNNs. Computer Architecture Today Blog, 2019; www.sigarch.org/why-the-gpgpu-is-less-efficientthan-the-tpu-for-dnns/Google Scholar
- Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems (2017), 5998--6008.Google Scholar
- Venkataramani, S. et al. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 45th Int'l Symp. on Computer Architecture, (2017), 13--26.Google Scholar
- Ward-Foxton, S. Habana debuts record-breaking AI training chip, (June 2019); https://www.eetimes.com/document.asp?doc_id=1334816.Google Scholar
- Wilkinson, J.H. Rounding Errors in Algebraic Processes, 1st Edition. Prentice Hall, Englewood Cliffs, NJ, 1963.Google Scholar
- Yang, A. Deep learning training at scale Spring Crest Deep Learning Accelerator (Intel® Nervana™ NNP-T). In Proceedings of the Hot Chips, (Aug. 2019); www.hotchips.org/hc31/HC31_1.12_Intel_Intel.AndrewYang.v0.92.pdf.Google Scholar
- Ying, C. et al. Image classification at supercomputer scale. 2018; arXiv preprint arXiv:1811.06992.Google Scholar
- Zoph, B. and Le, Q.V. Neural architecture search with reinforcement learning. 2019; arXiv preprint arXiv:1611.01578.Google Scholar
Index Terms
A domain-specific supercomputer for training deep neural networks
Recommendations
Towards dropout training for convolutional neural networks
Recently, dropout has seen increasing use in deep learning. For deep convolutional neural networks, dropout is known to work well in fully-connected layers. However, its effect in convolutional and pooling layers is still not clear. This paper ...
A fast and efficient pre-training method based on layer-by-layer maximum discrimination for deep neural networks
In this paper, through extension of the present methods and based on error minimization, two fast and efficient layer-by-layer pre-training methods are proposed for initializing deep neural network (DNN) weights. Due to confrontation with a large number ...
Adaptive Normalized Risk-Averting Training for deep neural networks
AAAI'16: Proceedings of the Thirtieth AAAI Conference on Artificial IntelligenceThis paper proposes a set of new error criteria and a learning approach, called Adaptive Normalized Risk-Averting Training (ANRAT) to attack the non-convex optimization problem in training deep neural networks without pretraining. Theoretically, we ...





Comments