skip to main content
10.1145/3567955.3567959acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open Access

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Published:21 December 2022Publication History

ABSTRACT

Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the size of these models. Intra-layer model parallelism is an approach to address the issues by partitioning individual layers or operators across multiple devices in a distributed accelerator cluster. But, the data communications generated by intra-layer model parallelism can contribute to a significant proportion of the overall execution time and severely hurt the computational efficiency.

As intra-layer model parallelism is critical to enable large deep learning models, this paper proposes a novel technique to effectively reduce its data communication overheads by overlapping communication with computation. With the proposed technique, an identified original communication collective is decomposed along with the dependent computation operation into a sequence of finer-grained operations. By creating more overlapping opportunities and executing the newly created, finer-grained communication and computation operations in parallel, it effectively hides the data transfer latency and achieves a better system utilization. Evaluated on TPU v4 Pods using different types of large models that have 10 billion to 1 trillion parameters, the proposed technique improves system throughput by 1.14 - 1.38x. The achieved highest peak FLOPS utilization is 72% on 1024 TPU chips with a large language model that has 500 billion parameters.

References

  1. 2020. Google breaks AI performance records in MLPerf with world’s fastest training supercomputer. https://cloud.google.com/blog/products/ai-machine-learning/google-breaks-ai-performance-records-in-mlperf-with-worlds-fastest-training-supercomputer Google ScholarGoogle Scholar
  2. 2021. MLPerf Training v1.1. https://mlcommons.org/en/training-normal-11/ Google ScholarGoogle Scholar
  3. 2021. XLA: Optimizing Compiler for TensorFlow. https://www.tensorflow.org/xla Google ScholarGoogle Scholar
  4. 2022. NVIDIA H100 Tensor Core GPU Architecture. https://www.hpctech.co.jp/catalog/gtc22-whitepaper-hopper_v1.01.pdf Google ScholarGoogle Scholar
  5. 2022. XLA DynamicSlice Semantics. https://www.tensorflow.org/xla/operation_semantics##dynamicslice Google ScholarGoogle Scholar
  6. 2022. XLA DynamicUpdateSlice Semantics. https://www.tensorflow.org/xla/operation_semantics##dynamicupdateslice Google ScholarGoogle Scholar
  7. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA. 265–283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a Human-like Open-Domain Chatbot. CoRR, abs/2001.09977 (2020), arXiv:2001.09977. arxiv:2001.09977 Google ScholarGoogle Scholar
  9. R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A Three-Dimensional Approach to Parallel Matrix Multiplication. IBM J. Res. Dev., 39, 5 (1995), sep, 575–582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2014. Julia: A Fresh Approach to Numerical Computing. CoRR, abs/1411.1607 (2014), arXiv:1411.1607. arxiv:1411.1607 Google ScholarGoogle Scholar
  11. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax Google ScholarGoogle Scholar
  12. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR, abs/2005.14165 (2020), arXiv:2005.14165. arxiv:2005.14165 Google ScholarGoogle Scholar
  13. Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph. D. Dissertation. USA. AAI7010025 Google ScholarGoogle Scholar
  14. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-End Optimization Stack for Deep Learning. abs/1802.04799 (2018), arXiv:1802.04799. arxiv:1802.04799 Google ScholarGoogle Scholar
  15. A. Danalis, Ki-Yong Kim, L. Pollock, and M. Swany. 2005. Transformations to Parallel Codes for Communication-Computation Overlap. In SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 58–58. https://doi.org/10.1109/SC.2005.75 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Anthony Danalis, Lori Pollock, Martin Swany, and John Cavazos. 2009. MPI-Aware Compiler Optimizations for Improving Communication-Computation Overlap. In Proceedings of the 23rd International Conference on Supercomputing (ICS ’09). 316–325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Darte, D. Chavarria-Miranda, R. Fowler, and J. Mellor-Crummey. 2002. Generalized multipartitioning for multi-dimensional arrays. In Proceedings 16th International Parallel and Distributed Processing Symposium. 10 pp–. Google ScholarGoogle Scholar
  18. Eliezer Dekel, David Nassimi, and Sartaj Sahni. 1981. Parallel Matrix and Graph Algorithms. SIAM J. Comput., 10, 4 (1981), 657–675. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805 (2018), arXiv:1810.04805. arxiv:1810.04805 Google ScholarGoogle Scholar
  20. Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. 2021. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. arxiv:2112.06905. Google ScholarGoogle Scholar
  21. Message P Forum. 1994. MPI: A Message-Passing Interface Standard. USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Evangelos Georganas, Jorge González-Domínguez, Edgar Solomonik, Yili Zheng, Juan Touriño, and Katherine Yelick. 2012. Communication Avoiding and Overlapping for Numerical Linear Algebra. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). Article 100, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jichi Guo, Qing Yi, Jiayuan Meng, Junchao Zhang, and Pavan Balaji. 2016. Compiler-Assisted Overlapping of Communication and Computation in MPI Applications. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). 60–69. https://doi.org/10.1109/CLUSTER.2016.62 Google ScholarGoogle ScholarCross RefCross Ref
  24. Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. 2018. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. CoRR, abs/1811.06965 (2018), arXiv:1811.06965. arxiv:1811.06965 Google ScholarGoogle Scholar
  25. Kazuaki Ishizaki, H. Komatsu, and T. Nakatani. 2004. A Loop Transformation Algorithm for Communication Overlapping. International Journal of Parallel Programming, 28 (2004), 135–154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. 2020. A Domain-Specific Supercomputer for Training Deep Neural Networks. Commun. ACM, 63, 7 (2020), jun, 67–78. Google ScholarGoogle Scholar
  27. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR, abs/2001.08361 (2020), arXiv:2001.08361. arxiv:2001.08361 Google ScholarGoogle Scholar
  28. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR, abs/1609.04836 (2016), arXiv:1609.04836. arxiv:1609.04836 Google ScholarGoogle Scholar
  29. Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, and Urs Köster. 2020. Pipelined Backpropagation at Scale: Training Large Models without Batches. CoRR, abs/2003.11666 (2020), arXiv:2003.11666. arxiv:2003.11666 Google ScholarGoogle Scholar
  30. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. CoRR, abs/2006.16668 (2020), arXiv:2006.16668. arxiv:2006.16668 Google ScholarGoogle Scholar
  31. Nilesh Mahajan, Sajith Sasidharan, Arun Chauhan, and Andrew Lumsdaine. 2012. Automatically Generating Coarse Grained Software Pipelining from Declaratively Specified Communication. 05. Google ScholarGoogle Scholar
  32. Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, K. R. Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2021. High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models. CoRR, abs/2104.05158 (2021), arXiv:2104.05158. arxiv:2104.05158 Google ScholarGoogle Scholar
  33. Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. SOSP ’19. New York, NY, USA. 1–15. https://doi.org/10.1145/3341301.3359646 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2020. Memory-Efficient Pipeline-Parallel DNN Training. CoRR, abs/2006.09503 (2020), arXiv:2006.09503. arxiv:2006.09503 Google ScholarGoogle Scholar
  35. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters. CoRR, abs/2104.04473 (2021), arXiv:2104.04473. arxiv:2104.04473 Google ScholarGoogle Scholar
  36. Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021. The Design Process for Google’s Training Chips: TPUv2 and TPUv3. IEEE Micro, 41, 2 (2021), 56–63. Google ScholarGoogle ScholarCross RefCross Ref
  37. Ali Alvi Paresh Kharya. 2021. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model. https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ Google ScholarGoogle Scholar
  38. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 8024–8035. Google ScholarGoogle Scholar
  39. David A. Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training. CoRR, abs/2104.10350 (2021), arXiv:2104.10350. arxiv:2104.10350 Google ScholarGoogle Scholar
  40. Simone Pellegrini, Torsten Hoefler, and Thomas Fahringer. 2012. Exact Dependence Analysis for Increased Communication Overlap. In Recent Advances in the Message Passing Interface. Springer Berlin Heidelberg, Berlin, Heidelberg. 89–99. isbn:978-3-642-33518-1 Google ScholarGoogle Scholar
  41. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR, abs/1910.10683 (2019), arXiv:1910.10683. arxiv:1910.10683 Google ScholarGoogle Scholar
  42. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arxiv:2102.12092 Google ScholarGoogle Scholar
  43. Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Nie, and Tushar Krishna. 2021. Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms. In Proceedings of the 48th Annual International Symposium on Computer Architecture (ISCA ’21). 540–553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. 2018. Glow: Graph Lowering Compiler Techniques for Neural Networks. abs/1805.00907 (2018), arXiv:1805.00907. arxiv:1805.00907 Google ScholarGoogle Scholar
  45. Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake A. Hechtman. 2018. Mesh-TensorFlow: Deep Learning for Supercomputers. CoRR, abs/1811.02084 (2018), arXiv:1811.02084. arxiv:1811.02084 Google ScholarGoogle Scholar
  46. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, abs/1909.08053 (2019), arXiv:1909.08053. arxiv:1909.08053 Google ScholarGoogle Scholar
  47. Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II (Euro-Par’11). 90–109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix Multiplication Algorithm. USA. Google ScholarGoogle Scholar
  49. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems. 30, https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Google ScholarGoogle Scholar
  50. Boxiang Wang, Qifan Xu, Zhengda Bian, and Yang You. 2021. 2.5-dimensional distributed model training. CoRR, abs/2105.14500 (2021), arXiv:2105.14500. arxiv:2105.14500 Google ScholarGoogle Scholar
  51. Wikipedia. 2022. Einstein notation — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Einstein%20notation&oldid=1083457917 [Online; accessed 21-June-2022] Google ScholarGoogle Scholar
  52. Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. 2021. GSPMD: General and Scalable Parallelization for ML Computation Graphs. arxiv:2105.04663. Google ScholarGoogle Scholar
  53. Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher R. Aberger, and Christopher De Sa. 2019. PipeMare: Asynchronous Pipeline Parallel DNN Training. CoRR, abs/1910.05124 (2019), arXiv:1910.05124. arxiv:1910.05124 Google ScholarGoogle Scholar
  54. Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2021. Scaling Vision Transformers. arxiv:2106.04560. Google ScholarGoogle Scholar
  55. Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and Yonghui Wu. 2021. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition. arxiv:2109.13226. Google ScholarGoogle Scholar

Index Terms

  1. Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader