ABSTRACT
This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.
- G. Almasi, C. Archer, J. G. Castaos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. D. Steinmacher-Burow, W. Gropp, and B. Toonen. Design and implementation of message-passing service for the BlueGene/L supercomputer. IBM Journal of Research and Development, 49(2/3):393--406, 2005. Google Scholar
Digital Library
- G. Almasi, L. D. Rose, B. B. Fraguela, J. Moreira, and D. A. Padua. Programming for locality and parallelism with hierarchically tiled arrays. In Workshop on Languages and Compilers for Parallel Computing (LCPC), volume 2958 of Lecture Notes in Computer Science, pages 162--176, College Station, TX, October 2003. Springer.Google Scholar
- C. Bell, W.-Y. Chen, D. Bonachea, and K. Yelick. Evaluating support for global address space languages on the Cray X1. In International Conference on Supercomputing (ICS), pages 184--195, New York, NY, USA, 2004. Google Scholar
Digital Library
- D. Bonachea. GASNet specification, v1.1. Technical Report CSD-02-1207, U.C. Berkeley, November 2002. Google Scholar
Digital Library
- D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997. Google Scholar
Digital Library
- F. Cantonnet, T. El-Ghazawi, P. Lorenz, and J. Gaber. Fast address translation techniques for distributed shared memory compilers. In International Parallel and Distributed Processing Symposium (IPDPS), Denver, CO, 2005. Google Scholar
Digital Library
- W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and language specification. Technical Report CCS-TR-99-157, George Washington University, 1999. ftp://ftp.seas.gwu.edu/pub/upc/downloads/upctr.pdf.Google Scholar
- S. Chakrabarti, M. Gupta, and J.-D. Choi. Global communication analysis and optimization. In Programming Language Design and Implementation (PLDI), pages 68--78, New York, NY, USA, 1996. Google Scholar
Digital Library
- W.-Y. Chen. Building a source-to-source UPC-to-C translator. Master's thesis, University of California at Berkeley, Berkeley, CA, 2005.Google Scholar
- W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In Parallel Architectures and Compilation Techniques (PACT), pages 267--278, Washington, DC, USA, 2005. Google Scholar
Digital Library
- C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mohanti, and Y. Yao. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In Symposium on Principles and practice of parallel Programming (PPoPP), pages 36--47, New York, NY, USA, 2005. Google Scholar
Digital Library
- Cray UPC home page. http://docs.cray.com/books/S-2179-50/html-S-2179-50/z1035483822pvl.html.Google Scholar
- DARPA High Productivity Computing Systems. http://www.darpa.mil/ipto/programs/hpcs.Google Scholar
- T. El-Ghazawi and F. Cantonnet. UPC performance and potential: a NPB experimental study. In Proceedings of the Conference on Supercomputing, pages 1--26, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. Google Scholar
Digital Library
- T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC Language Specifications, v1.1.1 edition, October 2003.Google Scholar
- A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-burow, T. Takken, and P. Vranas. Overview of the BlueGene/L system architecture. IBM Journal of Research and Development, 49(2/3):195--212, 2005. Google Scholar
Digital Library
- GCC UPC home page. http://www.intrepid.com/upc/.Google Scholar
- M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of the Conference on Supercomputing, page 71, New York, NY, USA, 1995. Google Scholar
Digital Library
- HPC challenge award competition. http://www.hpcchallenge.org.Google Scholar
- HP/Compaq UPC. http://h30097.www3.hp.com/upc/index.htm.Google Scholar
- P. Husbands, C. Iancu, and K. Yelick. A performance analysis of the Berkeley UPC compiler. In International Conference on Supercomputing (ICS), pages 63--73, New York, NY, USA, 2003. Google Scholar
Digital Library
- C. Iancu, P. Husbands, and P. Hargrove. Hunting the overlap. In Parallel Architectures and Compilation Techniques (PACT), pages 279--290, Washington, DC, USA, 2005. Google Scholar
Digital Library
- M. Mendell and R. Archambault. IBM's BlueGene/L compiler implementation. In BlueGene/L: Applications, Architecture and Software Workshop, Sparks, NV, Oct 2003. http://www.llnl.gov/asci/platforms/bluegene/papers/10mendell.pdf.Google Scholar
- J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: A nonuniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10(2):169--189, 1996. Google Scholar
Digital Library
- R. Numrich and J. Reid. Co-array Fortran for parallel programming. ACM SIGPLAN Fortran Forum, 17(2):1--31, August 1998. Google Scholar
Digital Library
- J. Savant and S. Seidel. MuPC: A run time system for unified parallel C. Technical Report CS-TR-02-03, Department of Computer Science, Michigan Technological University, 2002.Google Scholar
- G. Shah, J. Nieplocha, J. Mirza, C. Kim, R. Harrison, R. K. Govindaraju, K. Gildea, P. DiNicola, and C. Bender. Performance and experience with LAPI - a new high-performance communication library for the IBM RS/6000 SP. In 12th. International Parallel Processing Symposium (IPPS), pages 260--267, April 1998. Google Scholar
Digital Library
- E. Su, A. Lain, S. Ramaswamy, D. J. Palermo, I. Eugene W. Hodges, and P. Banerjee. Advanced compilation techniques in the paradigm compiler for distributed-memory multicomputers. In International Conference on Supercomputing (ICS), pages 424--433, New York, NY, USA, 1995. Google Scholar
Digital Library
- Top500 supercomputer sites. www.top500.org.Google Scholar
- IBM XL UPC compiler. http://www.alphaworks.ibm.com/tech/upccompiler.Google Scholar
- K. Yelick. Partitioned Global Address Space Languages: Titanium and UPC experience. Presentation at IBM TJ Watson Research Center, Nov. 2005.Google Scholar
- K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. In ACM Workshop on Java for High-Performance Network Computing, New York, NY 10036, USA, 1998.Google Scholar
Cross Ref
- Y. Zhu and L. J. Hendren. Communication optimizations for parallel C programs. In Programming Language Design and Implementation (PLDI), pages 199--211, New York, NY, USA, 1998. Google Scholar
Digital Library
Index Terms
(auto-classified)Shared memory programming for large scale machines
Recommendations
Shared memory programming for large scale machines
Proceedings of the 2006 PLDI ConferenceThis paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the ...
Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computationPartitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Tool-assisted Optimization of Shared-memory Accesses in UPC Applications
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and SystemsUPC is designed to improve user productivity when programming distributed-memory machines. Yet the shared-memory abstraction also makes performance analysis hard as it introduces extra overhead with local accesses and implicit communication with remote ...







Comments