Abstract

X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same fine-grained concurrency mechanisms within and across shared-memory nodes.
We demonstrate that X10 delivers solid performance at petascale by running (weak scaling) eight application kernels on an IBM Power 775 supercomputer utilizing up to 55,680 Power7 cores (for 1.7 Pflop/s of theoretical peak performance). We detail our advances in distributed termination detection, distributed load balancing, and use of high-performance interconnects that enable X10 to scale out to tens of thousands of cores.
For the four HPC Class 2 Challenge benchmarks, X10 achieves 41% to 87% of the system's potential at scale (as measured by IBM's HPCC Class 1 optimized runs). We also implement K-Means, Smith-Waterman, Betweenness Centrality, and Unbalanced Tree Search (UTS) for geometric trees. Our UTS implementation is the first to scale to petaflop systems.
- G. Almási, B. Dalton, L. L. Hu, F. Franchetti, Y. Liu, A. Sidelnik, T. Spelce, I. G. Tanase, E. Tiotto, Y. Voronenko, and X. Xue. 2010 IBM HPC Challenge Class II Submission, Nov. 2010.Google Scholar
- B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS high-performance interconnect. In Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects, HOTI '10, pages 75--82, Washington, DC, USA, 2010. IEEE Computer Society. Google Scholar
Digital Library
- C. Barton, C. Casçaval, G. Almási, Y. Zheng, M. Farreras, S. Chatterje, and J. N. Amaral. Shared memory programming for large scale machines. In Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, PLDI '06, pages 108--117, New York, NY, USA, 2006. ACM. Google Scholar
Digital Library
- S. M. Blackburn, R. L. Hudson, R. Morrison, J. E. B. Moss, D. S. Munro, and J. Zigman. Starting with termination: a methodology for building distributed garbage collection algorithms. In Proceedings of the 24th Australasian conference on Computer science, ACSC '01, pages 20--28, Washington, DC, USA, 2001. IEEE Computer Society. Google Scholar
Digital Library
- U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25:163--177, 2001.Google Scholar
Cross Ref
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In In SDM, 2004.Google Scholar
Cross Ref
- B. Chamberlain, S.-E. Choi, M. Dumler, T. Hildebrandt, D. Iten, V. Litvinov, G. Titus, C. BaAaglino, R. Sobel, B. Holt, and J. Keasler. Chapel HPC Challenge Entry: 2012, Nov. 2012.Google Scholar
- S. Crafa, D. Cunningham, V. Saraswat, A. Shinnar, and O. Tardieu. Semantics of (Resilient) X10. http://arxiv.org/abs/1312.3739, Dec. 2013.Google Scholar
- Cray. Chapel language specification version 0.93. Apr. 2013.Google Scholar
- J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. Scalable work stealing. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--11, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- J. Dongarra, R. Graybill, W. Harrod, R. Lucas, E. Lusk, P. Luszczek, J. Mcmahon, A. Snavely, J. Vetter, K. Yelick, S. Alam, R. Campbell, L. Carrington, T.-Y. Chen, O. Khalili, J. Meredith, and M. Tikir. DARPA's HPCS Program: History, Models, Tools, Languages. In M. V. Zelkowitz, editor, Advances in COMPUTERS High Performance Computing, volume 72 of Advances in Computers, pages 1 -- 100. Elsevier, 2008.Google Scholar
- K. Ebcioglu, V. Sarkar, T. El-Ghazawi, and J. Urbanic. An experiment in measuring the productivity of three parallel programming languages. In P-PHEC workshop, held in conjunction with HPCA, February 2006.Google Scholar
- D. Grove, O. Tardieu, D. Cunningham, B. Herta, I. Peshansky, and V. Saraswat. A performance model for X10 applications: what's going on under the hood? In Proceedings of the 2011 ACM SIGPLAN X10 Workshop, X10 '11, pages 1:1--1:8, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- HPC Challenge Awards Competition. http://www.hpcchallenge.org/.Google Scholar
- HPC Challenge Benchmark Record 482. http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=482, July 2012.Google Scholar
- HPC Challenge Benchmark Record 495. http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=495, Nov. 2012.Google Scholar
- HPC Challenge Benchmarks. http://icl.cs.utk.edu/hpcc/.Google Scholar
- L. V. Kalez, A. Arya, A. Bhatele, A. Gupta, N. Jain, P. Jetley, J. Lifflander, P. Miller, Y. Sun, R. Venkataramanz, L. Wesolowski, and G. Zheng. CharmGoogle Scholar
- for Productivity and Performance, Nov. 2011.Google Scholar
- P. P. Laboratory. The Charm+Parallel Programming System Manual. Technical Report Version 6.4, Department of Computer Science, University of Illinois, Urbana-Champaign, 2013.Google Scholar
- J. K. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '10, pages 25--36, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- S. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theor., 28(2):129--137, Sept. 2006. Google Scholar
Digital Library
- J. MacQueen. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp. Math. Stat. Probab., Univ. Calif. 1965/66, 1, 281--297 (1967)., 1967.Google Scholar
- J. Mellor-Crummey, L. Adhianto, G. Jin, M. Krentel, K. Murthy, W. Scherer, and C. Yang. Class II Submission to the HPC Challenge Award Competition Coarray Fortran 2.0, Nov. 2011.Google Scholar
- M. Nakao, H. Murai, T. Shimosaka, and M. Sato. XcalableMP 2012 HPC Challenge Class II Submission, Nov. 2012.Google Scholar
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: an unbalanced tree search benchmark. In Proceedings of the 19th international conference on Languages and compilers for parallel computing, LCPC'06, pages 235--250, Berlin, Heidelberg, 2007. Springer-Verlag. Google Scholar
Digital Library
- S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing, pages 123--131, Washington, DC, USA, 2008. IEEE Computer Society. Google Scholar
Digital Library
- J. Paudel and J. N. Amaral. Using the Cowichan problems to investigate the programmability of X10 programming system. In Proceedings of the 2011 ACM SIGPLAN X10 Workshop, X10 '11, pages 4:1--4:10, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- J. Prins, J. Huan, B. Pugh, C.-W. Tseng, and P. Sadayappan. UPC Implementation of an Unbalanced Tree Search Benchmark. Technical Report 03-034, Univ. of North Carolina at Chapel Hill, October 2003.Google Scholar
- D. Quintero, K. Bosworth, P. Chaudhary, R. G. da Silva, B. Ha, J. Higino, M.-E. Kahle, T. Kamenoue, J. Pearson, M. Perez, F. Pizzano, R. Simon, and K. Sun. IBM Power Systems 775 for AIX and Linux HPC Solution. IBM, 2012. Google Scholar
Digital Library
- R. Rajamony, M. W. Stephenson, and W. E. Speight. The Power 775 Architecture at Scale. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 183--192, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky, and O. Tardieu. The Asynchronous Partitioned Global Address Space Model. In AMP'10: Proceedings of The First Workshop on Advances in Message Passing, June 2010.Google Scholar
- V. Saraswat, B. Bloom, I. Peshansky, O. Tardieu, and D. Grove. The X10 language specification, v2.2.3. Aug. 2012.Google Scholar
- V. Saraswat and R. Jagadeesan. Concurrent clustered programming. In Concur'05, pages 353--367, 2005. Google Scholar
Digital Library
- V. Saraswat, O. Tardieu, D. Grove, D. Cunningham, M. Takeuchi, and B. Herta. A brief introduction to X10 (for the high performance programmer). http://x10.sourceforge.net/documentation/intro/latest/html/, Feb. 2013.Google Scholar
- V. A. Saraswat, P. Kambadur, S. Kodali, D. Grove, and S. Krishnamoorthy. Lifeline-based global load balancing. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 201--212, 2011. Google Scholar
Digital Library
- A. B. Sinha, L. V. Kale, and B. Ramkumar. A dynamic and adaptive quiescence detection algorithm. Technical Report 93--11, Parallel Programming Laboratory, Department of Computer Science , University of Illinois, Urbana-Champaign, 1993.Google Scholar
- T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195 -- 197, 1981.Google Scholar
Cross Ref
- G. Tanase, G. Almási, E. Tiotto, M. Alvanos, A. Ly, and B. Dalton. Performance analysis of the IBM XL UPC on the PERCS architecture. Technical Report RC25360, IBM Research, Mar. 2013.Google Scholar
- O. Tardieu, D. Grove, B. Bloom, D. Cunningham, B. Herta, P. Kambadur, V. A. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 for Productivity and Performance at Scale, Nov. 2012.Google Scholar
- O. Tardieu, H. Wang, and H. Lin. A work-stealing scheduler for X10's task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 267--276, 2012. Google Scholar
Digital Library
- Wikipedia. PERCS. http://en.wikipedia.org/w/index.php?title=PERCS, 2011.Google Scholar
- C. Yang, K. Murthy, and J. Mellor-Crummey. Managing Asynchronous Operations in Coarray Fortran 2.0. In IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pages 1321--1332, 2013. Google Scholar
Digital Library
- W. Zhang, O. Tardieu, D. Grove, B. Herta, T. Kamada, V. Saraswat, and M. Takeuchi. GLB: Lifeline-based Global Load Balancing Library in X10. http://arxiv.org, Dec. 2013.Google Scholar
Index Terms
X10 and APGAS at Petascale
Recommendations
X10 and APGAS at Petascale
Special Issue on PPOPP 2014X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same ...
X10 and APGAS at Petascale
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingX10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same ...
Parallel computing with x10
IWMSE '08: Proceedings of the 1st international workshop on Multicore software engineeringMany problems require parallel solutions and implementations and how to extract and specify parallelism has been the focus of Research during the last few decades. While there has been a significant progress in terms of (a)automatically deriving ...







Comments