Abstract
Known algorithms for two important collective communication operations, allgather and reduce-scatter, are minimal-communication algorithms; no process sends or receives more than the minimum amount of data. This, combined with the data-ordering semantics of the operations, limits the flexibility and performance of these algorithms. Our novel non-minimal, topology-aware algorithms deliver far better performance with the addition of a very small amount of redundant communication. We develop novel algorithms for Clos networks and single or multi-ported torus networks. Tests on a 32k-node BlueGene/P result in allgather speedups of up to 6x and reduce-scatter speedups of over 11x compared to the native IBM algorithm. Broadcast, reduce, and allreduce can be composed of allgather or reduce-scatter and other collective operations; our techniques also improve the performance of these algorithms.
- J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. HyperX: Topology, routing, and packaging of efficient large-scale networks. In Proc of the Conf. on High Performance Computing Networking, Storage and Analysis, SC '09, pages 41:1--41:11, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- M. Barnett, D. G. Payne, and R. A. van de Geijn. Optimal broadcasting in mesh-connected architectures. Technical report, Austin, TX, USA, 1991. Google Scholar
Digital Library
- M. Barnett, S. Gupta, D. G. Payne, L. Shuler, R. van de Geijn, and J. Watts. Interprocessor collective communication library (intercom). In Proc. of the Scalable High Performance Computing Conf., pages 357--364. IEEE Computer Society Press, 1994.Google Scholar
Cross Ref
- K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.Google Scholar
- J. Bruck, , C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In IEEE Transactions on Parallel and Distributed Systems, pages 298--309, 1997. Google Scholar
Digital Library
- E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proc. 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.Google Scholar
Cross Ref
- W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22 (6): 789--828, Sept. 1996. Google Scholar
Digital Library
- W. D. Gropp and E. Lusk. User's Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6.Google Scholar
- T. Hoefler, T. Schneider, and A. Lumsdaine. Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks. In Proc. of the 2008 IEEE Int. Conf. on Cluster Computing. IEEE Computer Society, Oct. 2008. ISBN 978-1-4244-2640.Google Scholar
Cross Ref
- IBM-Blue-Gene-Team. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development, 52 (1/2): 199--220, 2008. Google Scholar
Digital Library
- N. Jain and Y. Sabharwal. Optimal bucket algorithms for large MPI collectives on torus interconnects. In Proc. of the 24th ACM Int. Conf. on Supercomputing, ICS '10, pages 27--36, 2010. Google Scholar
Digital Library
- H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. TOP500 Supercomputing Sites, 2011 (accessed May 13, 2011). http://top500.org.Google Scholar
- B. L. Payne, M. Barnett, R. Littlefield, D. G. Payne, and R. A. van De Geijn. Global combine on mesh architectures with wormhole routing. In Proc. of 7 th Int. Parallel Proc. Symp, 1993. Google Scholar
Digital Library
- P. Sack. Scalable Collective Message-passing Algorithms. PhD thesis, University of Illinois at Urbana-Champaign, 2011.Google Scholar
- H. Tang and T. Yang. Optimizing threaded MPI execution on SMP clusters. In Proc. of 15th ACM int. conf. on supercomputing, pages 381--392. ACM Press, 2001. Google Scholar
Digital Library
- R. Thakur and W. Gropp. Improving the performance of collective operations in MPICH. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. Number 2840 in LNCS, Springer Verlag (2003) 257--267 10th European PVM/MPI User's Group Meeting, pages 257--267. Springer Verlag, 2003.Google Scholar
- R. Thakur and R. Rabenseifner. Optimization of collective communication operations in MPICH. Int. Journal of High Performance Computing Applications, 19: 49--66, 2005.Google Scholar
Digital Library
- J. L. Traff, A. Ripke, C. Siebert, P. Balaji, R. Thakur, and W. Gropp. A pipelined algorithm for large, irregular all-gather problems. Int. Journal High Performance Computing Applications, 24: 58--68, February 2010. Google Scholar
Digital Library
- E. Zahavi. Fat-trees routing and node ordering providing contention free traffic for MPI global collectives. Parallel and Distributed Processing Workshops and PhD Forum, 0: 761--770, 2011. Google Scholar
Digital Library
Index Terms
Faster topology-aware collective algorithms through non-minimal communication
Recommendations
Optimized MPI collective algorithms for dragonfly topology
ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingThe Message Passing Interface (MPI) is the most prominent and dominant programming model for scientific computing in super-computing systems today. Although many general and efficient algorithms have been proposed for MPI collective operations, there is ...
Faster topology-aware collective algorithms through non-minimal communication
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingKnown algorithms for two important collective communication operations, allgather and reduce-scatter, are minimal-communication algorithms; no process sends or receives more than the minimum amount of data. This, combined with the data-ordering ...
Optimizing Collective Communication in UPC
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium WorkshopsMessage Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (...







Comments