Abstract
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-oblivious algorithms for MPI all-to-all operations, in which data blocks are copied into the receive buffers in Morton order to exploit data locality. Experimental results on different many-core architectures show that our cache-oblivious implementations significantly outperform the naive implementations based on shared heap and the highly optimized MPI libraries.
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. ACM Transactions on Algorithms (TALG), 8 (1): 4, 2012. Google Scholar
Digital Library
- S. Li, T. Hoefler, and M. Snir. NUMA-aware shared-memory collective communication for MPI. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pages 85--96. ACM, 2013. Google Scholar
Digital Library
- S. Li, T. Hoefler, C. Hu, and M. Snir. Improved MPI collectives for MPI processes in shared address spaces. Cluster Computing, 17 (4): 1139--1155, 2014. Google Scholar
Digital Library
- G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company New York, 1966.Google Scholar
- 012)]MPIMPI Forum. MPI: A Message-Passing Interface standard. Version 3.0, September 2012.Google Scholar
- R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19 (1): 49--66, 2005. Google Scholar
Digital Library
Index Terms
POSTER: Cache-Oblivious MPI All-to-All Communications on Many-Core Architectures
Recommendations
POSTER: Cache-Oblivious MPI All-to-All Communications on Many-Core Architectures
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingIn the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-...
A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops
In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both ...
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory ...







Comments