Abstract
We present the first fully automated compiler-runtime system that successfully translates and executes OpenMP shared-address-space programs on laboratory-size clusters, for the complete set of regular, repetitive applications in the NAS Parallel Benchmarks. We introduce a hybrid compiler-runtime translation scheme. Compared to previous work, this scheme features a new runtime data flow analysis and new compiler techniques for improving data affinity and reducing communication costs. We present and discuss the performance of our translated programs, and compare them with the performance of the MPI, HPF and UPC versions of the benchmarks. The results show that our translated programs achieve 75% of the hand-coded MPI programs, on average.
- Berkeley UPC - Unified Parallel C. Available at: upc.lbl.gov.Google Scholar
- GCC Unified Parallel C. Available at: www.gccupc.org.Google Scholar
- UPC NAS Parallel Benchmarks from The George Washington University High Performance Computing Laboratory. Available at: threads.hpcl.gwu.edu/sites/npb-upc.Google Scholar
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. 1991.Google Scholar
- M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 219--228, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- D. Baxter, R. Mirchandaney, and J. H. Saltz. Run-time parallelization and scheduling of loops. In Proceedings of the first annual ACM symposium on Parallel Algorithms and Architectures, SPAA '89, pages 303--312, New York, NY, USA, 1989. ACM. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented Programming, Systems, Languages, and Applications. (OOPSLA '05), pages 519--538, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System. In Proc. of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), pages 186--197, 1996. Google Scholar
Digital Library
- M. Frumkin, H. Jin, and J. Yan. Implementation of NAS Parallel Benchmarks in High Performance Fortran. In Symposium on Parallel and Distributed Processing, 2000.Google Scholar
- M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), page 71, New York, NY, USA, 1995. ACM. Google Scholar
Digital Library
- High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC-TR92225, Houston, Tex., 1993.Google Scholar
- J. P. Hoeflinger. Extending OpenMP to Clusters. White Paper, 2006.Google Scholar
- K. Kusano, M. Sato, T. Hosomi, and Y. Seo. The Omni OpenMP Compiler on the Distributed Shared Memory of Cenju-4. In OpenMP Shared Memory Parallel Programming, volume 2104 of Lecture Notes in Computer Science, pages 20--30. Springer Berlin / Heidelberg, 2001. Google Scholar
Digital Library
- O. Kwon, F. Jubair, S.-J. Min, H. Bae, R. Eigenmann, and S. Midkiff. Automatic Scaling of OpenMP Beyond Shared Memory. In LCPC 2011: Proceedings of the 24th International Workshop on Languages and Compilers for Parallel Computing, Sept. 2011.Google Scholar
- R. W. Numrich and J. Reid. Co-array Fortran for Parallel Programming. SIGPLAN Fortran Forum, 17 (2): 1--31, 1998. Google Scholar
Digital Library
- Y. Paek, J. Hoeflinger, and D. Padua. Efficient and precise array access analysis. ACM Trans. Program. Lang. Syst., 24: 65--109, January 2002. Google Scholar
Digital Library
- S. Rus, L. Rauchwerger, and J. Hoeflinger. Hybrid analysis: static & dynamic memory reference analysis. In Proceedings of the 16th International Conference on Supercomputing, ICS '02, pages 274--284, New York, NY, USA, 2002. ACM. Google Scholar
Digital Library
- H. Shan, F. Blagojević, S.-J. Min, P. Hargrove, H. Jin, K. Fuerlinger, A. Koniges, and N. J. Wright. A programming model performance study using the NAS parallel benchmarks. Scientific Programming, 18: 153--167, August 2010. Google Scholar
Digital Library
- UPC Consortium. UPC Language Specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Laboratory, 2005.Google Scholar
Cross Ref
- R. F. V. D. Wijngaart. Efficient Implementation of a 3-Dimensional ADI Method on the iPSC/860. In In Supercomputing '93, pages 102--111, 1993. Google Scholar
Digital Library
- K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency - Practice and Experience, 10 (11-13): 825--836, 1998.Google Scholar
Index Terms
A hybrid approach of OpenMP for clusters
Recommendations
A hybrid approach of OpenMP for clusters
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingWe present the first fully automated compiler-runtime system that successfully translates and executes OpenMP shared-address-space programs on laboratory-size clusters, for the complete set of regular, repetitive applications in the NAS Parallel ...
Performance-based parallel loop self-scheduling using hybrid OpenMP and MPI programming on multicore SMP clusters
Parallel loop self-scheduling on parallel and distributed systems has been a critical problem and it is becoming more difficult to deal with in the emerging heterogeneous cluster computing environments. In the past, some self-scheduling schemes have ...
Research on MPI/OpenMP Hybrid Parallel Computation of AREM Model
ICIC '10: Proceedings of the 2010 Third International Conference on Information and Computing - Volume 01In this paper, aiming the feature of multi-core high performance computer system architecture containing the shared-memory and distributed memory structures, the research and implementation on MPI/OpenMP hybrid parallel computation of AREM model are ...







Comments