Abstract
Partitioned Global Address Space (PGAS) environments simplify writing parallel code for clusters because they make data movement implicit - dereferencing global pointers automatically moves data around. However, it does not free the programmer from needing to reason about locality - poor placement of data can lead to excessive and even unnecessary communication. For this reason, modern PGAS languages such as X10, Chapel, and UPC allow programmers to express data-layout constraints and explicitly move computation. This places an extra burden on the programmer, and is less effective for applications with limited or data-dependent locality (e.g., graph analytics).
This paper proposes Alembic, a new static analysis that frees programmers from having to manually move computation to exploit locality in PGAS programs. It works by determining regions of code that access the same cluster node, then transforming the code to migrate parts of the execution to increase the proportion of accesses to local data. We implement the analysis and transformation for C++ in LLVM and show that in irregular application kernels, Alembic can achieve 82% of the performance of hand-tuned communication (for comparison, naïve compiler-generated communication achieves only 13%).
- B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '88, pages 1--11. ACM, 1988. Google Scholar
Digital Library
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS parallel benchmarks. International Journal of High Performance Computing Applications, 5:63--73, 1991.Google Scholar
Digital Library
- R. Barik, J. Zhao, D. Grove, I. Peshansky, Z. Budimlic, and V. Sarkar. Communication optimizations for distributed memory X10 programs. In Parallel Distributed Processing Symposium (IPDPS), pages 1101--1113, May 2011. Google Scholar
Digital Library
- J. W. Berry, B. Hendrickson, S. Kahan, and P. Konecny. Software and algorithms for graph queries on multithreaded architectures. In Parallel and Distributed Processing Symposium. IPDPS 2007. IEEE International, pages 1--14. IEEE, 2007.Google Scholar
Cross Ref
- P. Briggs, K. D. Cooper, and L. T. Simpson. Value numbering. Software - Practice and Experience, 27(6):701--724, 1997. Google Scholar
Digital Library
- P. Briggs, K. D. Cooper, and L. Torczon. Rematerialization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI '92, pages 311--321. ACM, 1992. Google Scholar
Digital Library
- M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pages 29--38, New York, NY, USA, 1995. ACM. Google Scholar
Digital Library
- W.W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and Language Specification. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences, 1999.Google Scholar
- B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel Language. International Journal of High Performance Computing Application, 21(3):291--312, Aug. 2007. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538. ACM, 2005. Google Scholar
Digital Library
- W.-Y. Chen, D. Bonachea, C. Iancu, and K. Yelick. Automatic nonblocking communication for partitioned global address space programs. In International Conference on Supercomputing, Proceedings, pages 158--167. ACM, 2007. Google Scholar
Digital Library
- W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT '05, pages 267--278,Washington, DC, USA, 2005. IEEE Computer Society. Google Scholar
Digital Library
- S. Chong, J. Liu, A. C. Myers, X. Qi, K. Vikram, L. Zheng, and X. Zheng. Secure web applications via automatic partitioning. In ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 31--44, New York, NY, USA, 2007. ACM. Google Scholar
Digital Library
- C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mohanti, Y. Yao, and D. Chavarría-Miranda. An evaluation of global address space languages: Co-Array Fortran and Unified Parallel C. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 36--47. ACM, 2005. Google Scholar
Digital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, San Francisco, 1979. Google Scholar
Digital Library
- Graph 500. http://www.graph5Google Scholar
- S. Hiranandani, K. Kennedy, and C.-W. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Commun. ACM, 35(8):66--80, Aug. 1992. Google Scholar
Digital Library
- B. Holt, J. Nelson, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Flat combining synchronized global data structures. In International Conference on PGAS Programming Models (PGAS), Oct 2013.Google Scholar
- HPCC. HPCC random-access benchmark http://icl.cs.utk.edu/hpcc/hpcc_results.cgi.Google Scholar
- W. C. Hsieh, M. F. Kaashoek, and W. E. Weihl. Dynamic computation migration in DSM systems. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Supercomputing '96, Washington, DC, USA, 1996. IEEE Computer Society. Google Scholar
Digital Library
- W. C. Hsieh, P.Wang, and W. E.Weihl. Computation migration: Enhancing locality for distributed-memory parallel systems. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '93, pages 239--248, New York, NY, USA, 1993. ACM. Google Scholar
Digital Library
- ISO/IEC. Programming languages - C - Extensions to support embedded processors. Technical Report 18037, 2006.Google Scholar
- S. Jagannathan. Communication-passing style for coordination languages. In Coordination Languages and Models, pages 131--149. Springer, 1997. Google Scholar
Digital Library
- H. Kaiser, M. Brodowicz, and T. Sterling. Parallex: An advanced parallel execution model for scaling-impaired applications. In Parallel Processing Workshops, 2009. ICPPW'09. International Conference on, pages 394--401. IEEE, 2009. Google Scholar
Digital Library
- L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA '93, pages 91--108, New York, NY, USA, 1993. ACM. Google Scholar
Digital Library
- P. M. Kogge. Of piglets and threadlets: Architectures for self-contained, mobile, memory programming. In Innovative Architecture for Future Generation High-Performance Processors and Systems, Proceedings, pages 130--138. IEEE, 2004. Google Scholar
Digital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO, pages 75--88. IEEE Computer Society, 2004. Google Scholar
Digital Library
- R. C. Murphy. Traveling Threads: A New Multithreaded Execution Model. PhD thesis, University of Notre Dame, 2006. Google Scholar
Digital Library
- J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Grappa: A latency-tolerant runtime for large-scale irregular applications. Technical Report UW-CSE-14-02-01, University of Washington, 2 2014.Google Scholar
- NAS parallel benchmark suite 3.3. http://www.nas.nasa.gov/publications/npb.html, 2012.Google Scholar
- E. Raman, G. Ottoni, A. Raman, M. J. Bridges, and D. I. August. Parallel-stage decoupled software pipelining. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '08, pages 114--123, New York, NY, USA, 2008. ACM. Google Scholar
Digital Library
- A. Rogers, M. C. Carlisle, J. H. Reppy, and L. J. Hendren. Supporting dynamic data structures on distributed-memory machines. ACM Transactions on Programming Languages and Systems, 17(2):233--263, Mar. 1995. Google Scholar
Digital Library
- A. Sanz, R. Asenjo, J. Lopez, R. Larrosa, A. Navarro, V. Litvinov, S.-E. Choi, and B. Chamberlain. Global data re-allocation via communication aggregation in Chapel. In Computer Architecture and High Performance Computing (SBAC-PAD), pages 235--242, Oct 2012. Google Scholar
Digital Library
- Y. Shiloach and U. Vishkin. An O(N log(N)) parallel max-flow algorithm. Journal of Algorithms, 3(2):128--146, 1982. Google Scholar
Digital Library
- J. Shirako, D. M. Peixotto, V. Sarkar, and W. N. Scherer. Phasers: A unified deadlock-free construct for collective and point-to-point synchronization. In International Conference on Supercomputing, ICS '08, pages 277--288. ACM, 2008. Google Scholar
Digital Library
- E. Tilevich and Y. Smaragdakis. J-Orchestra: Automatic Java application partitioning. In B. Magnusson, editor, ECOOP 2002 Object-Oriented Programming, volume 2374 of Lecture Notes in Computer Science, pages 178--204. Springer Berlin Heidelberg, 2002. Google Scholar
Digital Library
- C. Wang and Z. Li. Parametric analysis for adaptive computation offloading. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI '04, pages 119--130, New York, NY, USA, 2004. ACM. Google Scholar
Digital Library
- L. Wang and M. Franz. Automatic partitioning of object-oriented programs for resource-constrained mobile devices with multiple distribution objectives. In International Conference on Parallel and Distributed Systems (ICPADS'08), pages 369--376. IEEE, 2008. Google Scholar
Digital Library
Index Terms
Alembic: automatic locality extraction via migration
Recommendations
Alembic: automatic locality extraction via migration
OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & ApplicationsPartitioned Global Address Space (PGAS) environments simplify writing parallel code for clusters because they make data movement implicit - dereferencing global pointers automatically moves data around. However, it does not free the programmer from ...
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads
MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on MicroarchitectureOnline transaction processing (OLTP) is at the core of many data center applications. OLTP workloads are known to have large instruction footprints that foil existing L1 instruction caches resulting in poor overall performance. Prefetching can reduce ...
Preliminary Implementation of Coarray Fortran Translator Based on Omni XcalableMP
PGAS '15: Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming ModelsXcalableMP (XMP) is a PGAS language for distributed memory environments. It employs Coarray Fortran (CAF) features as the local-view programming model. We implemented the main part of CAF in the form of a translator, i.e., a source-to-source compiler, ...







Comments