ABSTRACT
This paper presents and validates performance models for a varietyvof high-performance collective communication algorithms for systems with Cell processors. The systems modeled include a single Cell processor, two Cell chips on a Cell Blade, and a cluster of Cell Blades. The models extend PLogP, the well-known point-topoint performance model, by accounting for the unique hardware characteristics of the Cell (e.g., heterogeneous interconnects and DMA engines) and by applying the model to collective communication. This paper also presents a micro-benchmark suite to accurately measure the extended PLogP parameters on the Cell Blade and then uses these parameters to model different algorithms for the barrier, broadcast, reduce, all-reduce, and all-gather collective operations. Out of 425 total performance predictions, 398 of them see less than 10% error compared to the actual execution time and all of them see less than 15%.
- http://www.mcs.anl.gov/research/projects/mpich2.Google Scholar
- T. Ainsworth and T. Pinkston. On characterizing performance of the Cell broadband engine element interconnect bus. Networks-on-Chip, 2007. NOCS 2007. First International Symposium on, pages 18--29, May 2007. Google Scholar
Digital Library
- A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. Loggp: Incorporating long messages into the logp model --- one step closer towards a realistic model for parallel computation. Technical report, Santa Barbara, CA, USA, 1995. Google Scholar
Digital Library
- Q. Ali, S. P. Midkiff, and V. S. Pai. Efficient high performance collective communication for the cell blade. In ICS '09: Proceedings of the 23rd International Conference on Supercomputing, pages 193--203, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- L. Barchet-Steffenel and G. Mounie. Total exchange performance modelling under network contention. In Proceedings of the 6th International Conference on Parallel Processing and Applied Mathematics, LNCS Vol. 3911, pages 100--107, 2005. Google Scholar
Digital Library
- K. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin, and J. C. Sancho. Entering the petaflop era: The architecture and performance of Roadrunner. In IEEE/ACM Supercomputing (SC08), November 2008. Google Scholar
Digital Library
- J. Bruck, C. tien Ho, S. Kipnis, E. Upfal, and D.Weathersby. Efficient algorithms for all-to-all communications in multi-port messagepassing systems. In IEEE Transactions on Parallel and Distributed Systems, pages 298--309, 1997. Google Scholar
Digital Library
- D. Buntinas, G. Mercier, and W. Gropp. Data transfers between processes in an SMP system: Performance study and application to MPI. Parallel Processing, International Conference on, 0:487--496, 2006. Google Scholar
Digital Library
- D. Culler, R. K. Y, D. Patterson, A. Sahay, R. Subramonian, and T. V. Eicken. LogP: Towards a realistic model of parallel computation. In In Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1--12, 1993. Google Scholar
Digital Library
- D. E. Culler, L. T. Liu, R. P. Martin, and C. O. Yoshikawa. Assessing fast network interfaces. IEEE Micro, 16(1):35--43, 1996. Google Scholar
Digital Library
- A. Faraj and X. Yuan. Automatic generation and tuning of MPI collective communication routines. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 393--402, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- R. W. Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing, 20(3):389--398, 1994. Google Scholar
Digital Library
- T. Kielmann, H. E. Bal, and S. Gorlatch. Bandwidth-efficient collective communication for clustered wide area systems. In In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun, pages 492--499, 2000. Google Scholar
Digital Library
- M. Kistler, M. Perrone, and F. Petrini. Cell multiprocessor interconnection network: Built for speed. IEEE Micro, 26(3), May-June 2006. Google Scholar
Digital Library
- C. A. Moritz and M. I. Frank. LoGPC: Modeling network contention in message-passing programs. IEEE Transactions on Parallel and Distributed Systems, 12(4):404--415, 2001. Google Scholar
Digital Library
- S. Pakin. Receiver-initiated message passing over RDMA networks. In 22nd International Parallel and Distributed Processing Symposium (IPDPS 2008).Google Scholar
Cross Ref
- J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. Dongarra. Performance analysis of MPI collective operations. In IPDPS, 2005. Google Scholar
Digital Library
- J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra. Performance analysis of MPI collective operations. Cluster Computing Journal, 10:127--143, 2007. Google Scholar
Digital Library
- R. Rabenseifner. Optimization of Collective Reduction Operations. In Proceedings of the International Conference on Computational Science, June 2004.Google Scholar
- R. Thakur and W. Gropp. Improving the performance of collective operations in MPICH. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. Number 2840 in LNCS, Springer Verlag (2003) 257267 10th European PVM/MPI Users Group Meeting, pages 257--267. Springer Verlag, 2003.Google Scholar
- R. Thakur, R. Rabenseifner, andW. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19(1):49--66, February 2005.Google Scholar
Digital Library
Index Terms
Modeling advanced collective communication algorithms on cell-based systems
Recommendations
Modeling advanced collective communication algorithms on cell-based systems
PPoPP '10This paper presents and validates performance models for a varietyvof high-performance collective communication algorithms for systems with Cell processors. The systems modeled include a single Cell processor, two Cell chips on a Cell Blade, and a ...
Efficient high performance collective communication for the cell blade
ICS '09: Proceedings of the 23rd international conference on SupercomputingThis paper presents high-performance collective communication algorithms and implementations that exploit the unique architectural features of the Cell heterogeneous multicore processor. This paper specifically describes novel algorithms for the barrier,...
A Performance Model of Communication in the Quarc NoC
ICPADS '08: Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed SystemsNetworks On-Chip (NoC) emerged as a promising communication medium for future MPSoC development. To serve this purpose, the NoCs have to be able to efficiently exchange all types of traffic including the collective communications at a reasonable cost. ...







Comments