skip to main content
10.1145/3416315.3416316acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurompiConference Proceedingsconference-collections
research-article

Using Advanced Vector Extensions AVX-512 for MPI Reductions

Published:07 October 2020Publication History

ABSTRACT

As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance.

In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.

References

  1. [n. d.]. A benchmark framework for Tensorflow. https://github.com/tensorflow/benchmarksGoogle ScholarGoogle Scholar
  2. ARM. 2018. Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile. https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profileGoogle ScholarGoogle Scholar
  3. Adrià Armejach, Helena Caminal, Juan M. Cebrian, Rekai González-Alberquilla, Chris Adeniyi-Jones, Mateo Valero, Marc Casas, and Miquel Moretó. 2018. Stencil Codes on a Vector Length Agnostic Architecture. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques(PACT ’18). ACM, New York, NY, USA, Article 13, 12 pages. https://doi.org/10.1145/3243176.3243192Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Adrià Armejach, Helena Caminal, Juan M. Cebrian, Rubén Langarita, Rekai González-Alberquilla, Chris Adeniyi-Jones, Mateo Valero, Marc Casas, and Miquel Moretó. 2019. Using Arm’s scalable vector extension on stencil codes. The Journal of Supercomputing (Apr 2019).Google ScholarGoogle Scholar
  5. M. Boettcher, B. M. Al-Hashimi, M. Eyole, G. Gabrielli, and A. Reid. 2014. Advanced SIMD: Extending the reach of contemporary SIMD architectures. In 2014 Design, Automation Test in Europe Conference Exhibition (DATE). 1–4. https://doi.org/10.7873/DATE.2014.037Google ScholarGoogle Scholar
  6. Léon Bottou. 2010. Large-Scale Machine Learning with Stochastic Gradient Descent. In Proceedings of COMPSTAT’2010, Yves Lechevallier and Gilbert Saporta (Eds.). Physica-Verlag HD, Heidelberg, 177–186.Google ScholarGoogle ScholarCross RefCross Ref
  7. Berenger Bramas. 2017. A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake. International Journal of Advanced Computer Science and Applications 8, 10(2017). https://doi.org/10.14569/ijacsa.2017.081044Google ScholarGoogle ScholarCross RefCross Ref
  8. D. Callahan, J. Dongarra, and D. Levine. 1988. Vectorizing Compilers: A Test Suite and Results. In Proceedings of the 1988 ACM/IEEE Conference on Supercomputing(Supercomputing ’88). IEEE Computer Society Press, Washington, DC, USA, 98–105.Google ScholarGoogle Scholar
  9. Helena Caminal, Diego Caballero, Juan M. Cebrián, Roger Ferrer, Marc Casas, Miquel Moretó, Xavier Martorell, and Mateo Valero. 2018. Performance and energy effects on task-based parallelized applications. The Journal of Supercomputing 74, 6 (2018), 2627–2637.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Chu, K. Hamidouche, A. Venkatesh, A. A. Awan, and D. K. Panda. 2016. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 726–735. https://doi.org/10.1109/CCGrid.2016.111Google ScholarGoogle Scholar
  11. M. G. F. Dosanjh, W. Schonbein, R. E. Grant, P. G. Bridges, S. M. Gazimirsaeed, and A. Afsahi. 2019. Fuzzy Matching: Hardware Accelerated MPI Communication Middleware. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 210–220. https://doi.org/10.1109/CCGRID.2019.00035Google ScholarGoogle Scholar
  12. Roger Espasa, Mateo Valero, and James E Smith. 1998. Vector architectures: past, present and future. In Proceedings of the 12th international conference on Supercomputing. 425–432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 Architecture, Operationally: Concurrency and ISA. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(POPL ’16). ACM, New York, NY, USA, 608–621. https://doi.org/10.1145/2837614.2837615Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Message Passing Interface Forum. September,2012. MPI: A Message-Passing Interface Standard. https://www.mpi-forum.orgGoogle ScholarGoogle Scholar
  15. Michael Hofmann and Gudula Rünger. 2008. MPI Reduction Operations for Sparse Floating-point Data. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Alexey Lastovetsky, Tahar Kechadi, and Jack Dongarra(Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 94–101.Google ScholarGoogle Scholar
  16. Dan Andrei Iliescu. 2018. Arm Scalable Vector Extension and application to Machine Learning. Retrieved October, 2018 from https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learningGoogle ScholarGoogle Scholar
  17. Intel. 2016. Intel 64 and IA-32 Architectures Software Developer Manuals. Retrieved November 11, 2019 from https://software.intel.com/en-us/articles/intel-sdmGoogle ScholarGoogle Scholar
  18. Intel. 2019. 64-ia-32-architectures instruction set extensions reference manual. https://software.intel.com/en-us/articles/intel-sdmGoogle ScholarGoogle Scholar
  19. Intel. 2019. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture. https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architectureGoogle ScholarGoogle Scholar
  20. Raehyun Kim, Jaeyoung Choi, and Myungho Lee. 2019. Optimizing Parallel GEMM Routines Using Auto-Tuning with Intel AVX-512. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region(HPC Asia 2019). Association for Computing Machinery, New York, NY, USA, 101–110. https://doi.org/10.1145/3293320.3293334Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdfGoogle ScholarGoogle Scholar
  22. David Levine, David Callahan, and Jack Dongarra. 1991. A comparative study of automatic vectorizing compilers. Parallel Comput. 17, 10 (1991), 1223 – 1244. https://doi.org/10.1016/S0167-8191(05)80035-3 Benchmarking of high performance supercomputers.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zhenyu Li, James Davis, and Stephen Jarvis. 2017. An Efficient Task-based All-Reduce for Machine Learning Applications. 1–8. https://doi.org/10.1145/3146347.3146350Google ScholarGoogle Scholar
  24. Roktaek Lim, Yeongha Lee, Raehyun Kim, and Jaeyoung Choi. 2018. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512. Cluster Computing 21, 4 (Dec 2018), 1785–1795.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xi Luo, Wei Wu, George Bosilca, Thananon Patinyasakdikul, Linnan Wang, and Jack Dongarra. 2018. ADAPT: An Event-based Adaptive Collective Communication Framework. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing(HPDC ’18). ACM, New York, NY, USA, 118–130. https://doi.org/10.1145/3208040.3208054Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Maleki, Y. Gao, M. J. Garzar’n, T. Wong, and D. A. Padua. 2011. An Evaluation of Vectorizing Compilers. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 372–382.Google ScholarGoogle Scholar
  27. Daniel S. McFarlin, Volodymyr Arbatov, Franz Franchetti, and Markus Püschel. 2011. Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets. In Proceedings of the International Conference on Supercomputing(ICS’11). Association for Computing Machinery, New York, NY, USA, 265–274. https://doi.org/10.1145/1995896.1995938Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Mitra, B. Johnston, A. P. Rendell, E. McCreath, and J. Zhou. 2013. Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms. In 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum. 1107–1116.Google ScholarGoogle Scholar
  29. Daniel Molka, Daniel Hackenberg, Robert Schöne, Timo Minartz, and Wolfgang E. Nagel. 2012. Flexible workload generation for HPC cluster efficiency benchmarking. Computer Science - Research and Development 27, 4 (2012), 235–243.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I. Jordan. 2015. SparkNet: Training Deep Networks in Spark. arxiv:stat.ML/1511.06051Google ScholarGoogle Scholar
  31. Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117–124. https://doi.org/10.1016/j.jpdc.2008.09.002Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Thomas Röhl, Jan Eitzinger, Georg Hager, and Gerhard Wellein. 2016. Validation of Hardware Events for Successful Performance Pattern Identification in High Performance Computing. In Tools for High Performance Computing 2015, Andreas Knüpfer, Tobias Hilbrich, Christoph Niethammer, José Gracia, Wolfgang E. Nagel, and Michael M. Resch (Eds.). Springer International Publishing, Cham, 17–28.Google ScholarGoogle Scholar
  33. Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799(2018).Google ScholarGoogle Scholar
  34. H. Shan, S. Williams, and C. W. Johnson. 2018. Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 1–11.Google ScholarGoogle Scholar
  35. A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (Mar 2016), 34–46. https://doi.org/10.1109/MM.2016.25Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting Performance Data with PAPI-C. In Tools for High Performance Computing 2009, Matthias S. Müller, Michael M. Resch, Alexander Schulz, and Wolfgang E. Nagel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 157–173.Google ScholarGoogle Scholar
  37. Jesper Larsson Träff. 2010. Transparent Neutral Element Elimination in MPI Reduction Operations. In Recent Advances in the Message Passing Interface, Rainer Keller, Edgar Gabriel, Michael Resch, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 275–284.Google ScholarGoogle Scholar
  38. W. J. Watson. 1972. The TI ASC: a highly modular and flexible super computer architecture. In AFIPS ’72 (Fall, part I).Google ScholarGoogle Scholar
  39. Wikipedia contributors. [n. d.]. Duff’s device — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Duff%27s_device [Online; accessed 2-May-2020].Google ScholarGoogle Scholar
  40. Dong Zhong, Aurelien Bouteiller, Xi Luo, and George Bosilca. 2019. Runtime Level Failure Detection and Propagation in HPC Systems. In Proceedings of the 26th European MPI Users’ Group Meeting(EuroMPI ’19). Association for Computing Machinery, New York, NY, USA, Article 14, 11 pages. https://doi.org/10.1145/3343211.3343225Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. D. Zhong, P. Shamis, Q. Cao, G. Bosilca, S. Sumimoto, K. Miura, and J. Dongarra. 2020. Using Arm Scalable Vector Extension to Optimize OPEN MPI. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). 222–231.Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Using Advanced Vector Extensions AVX-512 for MPI Reductions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!