ABSTRACT
As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance.
In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.
- [n. d.]. A benchmark framework for Tensorflow. https://github.com/tensorflow/benchmarksGoogle Scholar
- ARM. 2018. Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile. https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profileGoogle Scholar
- Adrià Armejach, Helena Caminal, Juan M. Cebrian, Rekai González-Alberquilla, Chris Adeniyi-Jones, Mateo Valero, Marc Casas, and Miquel Moretó. 2018. Stencil Codes on a Vector Length Agnostic Architecture. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques(PACT ’18). ACM, New York, NY, USA, Article 13, 12 pages. https://doi.org/10.1145/3243176.3243192Google Scholar
Digital Library
- Adrià Armejach, Helena Caminal, Juan M. Cebrian, Rubén Langarita, Rekai González-Alberquilla, Chris Adeniyi-Jones, Mateo Valero, Marc Casas, and Miquel Moretó. 2019. Using Arm’s scalable vector extension on stencil codes. The Journal of Supercomputing (Apr 2019).Google Scholar
- M. Boettcher, B. M. Al-Hashimi, M. Eyole, G. Gabrielli, and A. Reid. 2014. Advanced SIMD: Extending the reach of contemporary SIMD architectures. In 2014 Design, Automation Test in Europe Conference Exhibition (DATE). 1–4. https://doi.org/10.7873/DATE.2014.037Google Scholar
- Léon Bottou. 2010. Large-Scale Machine Learning with Stochastic Gradient Descent. In Proceedings of COMPSTAT’2010, Yves Lechevallier and Gilbert Saporta (Eds.). Physica-Verlag HD, Heidelberg, 177–186.Google Scholar
Cross Ref
- Berenger Bramas. 2017. A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake. International Journal of Advanced Computer Science and Applications 8, 10(2017). https://doi.org/10.14569/ijacsa.2017.081044Google Scholar
Cross Ref
- D. Callahan, J. Dongarra, and D. Levine. 1988. Vectorizing Compilers: A Test Suite and Results. In Proceedings of the 1988 ACM/IEEE Conference on Supercomputing(Supercomputing ’88). IEEE Computer Society Press, Washington, DC, USA, 98–105.Google Scholar
- Helena Caminal, Diego Caballero, Juan M. Cebrián, Roger Ferrer, Marc Casas, Miquel Moretó, Xavier Martorell, and Mateo Valero. 2018. Performance and energy effects on task-based parallelized applications. The Journal of Supercomputing 74, 6 (2018), 2627–2637.Google Scholar
Digital Library
- C. Chu, K. Hamidouche, A. Venkatesh, A. A. Awan, and D. K. Panda. 2016. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 726–735. https://doi.org/10.1109/CCGrid.2016.111Google Scholar
- M. G. F. Dosanjh, W. Schonbein, R. E. Grant, P. G. Bridges, S. M. Gazimirsaeed, and A. Afsahi. 2019. Fuzzy Matching: Hardware Accelerated MPI Communication Middleware. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 210–220. https://doi.org/10.1109/CCGRID.2019.00035Google Scholar
- Roger Espasa, Mateo Valero, and James E Smith. 1998. Vector architectures: past, present and future. In Proceedings of the 12th international conference on Supercomputing. 425–432.Google Scholar
Digital Library
- Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 Architecture, Operationally: Concurrency and ISA. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages(POPL ’16). ACM, New York, NY, USA, 608–621. https://doi.org/10.1145/2837614.2837615Google Scholar
Digital Library
- Message Passing Interface Forum. September,2012. MPI: A Message-Passing Interface Standard. https://www.mpi-forum.orgGoogle Scholar
- Michael Hofmann and Gudula Rünger. 2008. MPI Reduction Operations for Sparse Floating-point Data. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Alexey Lastovetsky, Tahar Kechadi, and Jack Dongarra(Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 94–101.Google Scholar
- Dan Andrei Iliescu. 2018. Arm Scalable Vector Extension and application to Machine Learning. Retrieved October, 2018 from https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learningGoogle Scholar
- Intel. 2016. Intel 64 and IA-32 Architectures Software Developer Manuals. Retrieved November 11, 2019 from https://software.intel.com/en-us/articles/intel-sdmGoogle Scholar
- Intel. 2019. 64-ia-32-architectures instruction set extensions reference manual. https://software.intel.com/en-us/articles/intel-sdmGoogle Scholar
- Intel. 2019. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture. https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architectureGoogle Scholar
- Raehyun Kim, Jaeyoung Choi, and Myungho Lee. 2019. Optimizing Parallel GEMM Routines Using Auto-Tuning with Intel AVX-512. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region(HPC Asia 2019). Association for Computing Machinery, New York, NY, USA, 101–110. https://doi.org/10.1145/3293320.3293334Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdfGoogle Scholar
- David Levine, David Callahan, and Jack Dongarra. 1991. A comparative study of automatic vectorizing compilers. Parallel Comput. 17, 10 (1991), 1223 – 1244. https://doi.org/10.1016/S0167-8191(05)80035-3 Benchmarking of high performance supercomputers.Google Scholar
Digital Library
- Zhenyu Li, James Davis, and Stephen Jarvis. 2017. An Efficient Task-based All-Reduce for Machine Learning Applications. 1–8. https://doi.org/10.1145/3146347.3146350Google Scholar
- Roktaek Lim, Yeongha Lee, Raehyun Kim, and Jaeyoung Choi. 2018. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512. Cluster Computing 21, 4 (Dec 2018), 1785–1795.Google Scholar
Digital Library
- Xi Luo, Wei Wu, George Bosilca, Thananon Patinyasakdikul, Linnan Wang, and Jack Dongarra. 2018. ADAPT: An Event-based Adaptive Collective Communication Framework. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing(HPDC ’18). ACM, New York, NY, USA, 118–130. https://doi.org/10.1145/3208040.3208054Google Scholar
Digital Library
- S. Maleki, Y. Gao, M. J. Garzar’n, T. Wong, and D. A. Padua. 2011. An Evaluation of Vectorizing Compilers. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 372–382.Google Scholar
- Daniel S. McFarlin, Volodymyr Arbatov, Franz Franchetti, and Markus Püschel. 2011. Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets. In Proceedings of the International Conference on Supercomputing(ICS’11). Association for Computing Machinery, New York, NY, USA, 265–274. https://doi.org/10.1145/1995896.1995938Google Scholar
Digital Library
- G. Mitra, B. Johnston, A. P. Rendell, E. McCreath, and J. Zhou. 2013. Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms. In 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum. 1107–1116.Google Scholar
- Daniel Molka, Daniel Hackenberg, Robert Schöne, Timo Minartz, and Wolfgang E. Nagel. 2012. Flexible workload generation for HPC cluster efficiency benchmarking. Computer Science - Research and Development 27, 4 (2012), 235–243.Google Scholar
Digital Library
- Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I. Jordan. 2015. SparkNet: Training Deep Networks in Spark. arxiv:stat.ML/1511.06051Google Scholar
- Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations. J. Parallel Distrib. Comput. 69, 2 (Feb. 2009), 117–124. https://doi.org/10.1016/j.jpdc.2008.09.002Google Scholar
Digital Library
- Thomas Röhl, Jan Eitzinger, Georg Hager, and Gerhard Wellein. 2016. Validation of Hardware Events for Successful Performance Pattern Identification in High Performance Computing. In Tools for High Performance Computing 2015, Andreas Knüpfer, Tobias Hilbrich, Christoph Niethammer, José Gracia, Wolfgang E. Nagel, and Michael M. Resch (Eds.). Springer International Publishing, Cham, 17–28.Google Scholar
- Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799(2018).Google Scholar
- H. Shan, S. Williams, and C. W. Johnson. 2018. Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 1–11.Google Scholar
- A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (Mar 2016), 34–46. https://doi.org/10.1109/MM.2016.25Google Scholar
Digital Library
- Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting Performance Data with PAPI-C. In Tools for High Performance Computing 2009, Matthias S. Müller, Michael M. Resch, Alexander Schulz, and Wolfgang E. Nagel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 157–173.Google Scholar
- Jesper Larsson Träff. 2010. Transparent Neutral Element Elimination in MPI Reduction Operations. In Recent Advances in the Message Passing Interface, Rainer Keller, Edgar Gabriel, Michael Resch, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 275–284.Google Scholar
- W. J. Watson. 1972. The TI ASC: a highly modular and flexible super computer architecture. In AFIPS ’72 (Fall, part I).Google Scholar
- Wikipedia contributors. [n. d.]. Duff’s device — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Duff%27s_device [Online; accessed 2-May-2020].Google Scholar
- Dong Zhong, Aurelien Bouteiller, Xi Luo, and George Bosilca. 2019. Runtime Level Failure Detection and Propagation in HPC Systems. In Proceedings of the 26th European MPI Users’ Group Meeting(EuroMPI ’19). Association for Computing Machinery, New York, NY, USA, Article 14, 11 pages. https://doi.org/10.1145/3343211.3343225Google Scholar
Digital Library
- D. Zhong, P. Shamis, Q. Cao, G. Bosilca, S. Sumimoto, K. Miura, and J. Dongarra. 2020. Using Arm Scalable Vector Extension to Optimize OPEN MPI. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). 222–231.Google Scholar
Index Terms
(auto-classified)Using Advanced Vector Extensions AVX-512 for MPI Reductions
Recommendations
Automatic Core Specialization for AVX-512 Applications
SYSTOR '20: Proceedings of the 13th ACM International Systems and Storage ConferenceAdvanced Vector Extension (AVX) instructions operate on wide SIMD vectors. Due to the resulting high power consumption, recent Intel processors reduce their frequency when executing complex AVX2 and AVX-512 instructions. Following non-AVX code is slowed ...
Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingEmerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such ...
Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512
HPC Asia 2019: Proceedings of the International Conference on High Performance Computing in Asia-Pacific RegionThis paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an ...





Comments