skip to main content
research-article

Alembic: automatic locality extraction via migration

Published:15 October 2014Publication History
Skip Abstract Section

Abstract

Partitioned Global Address Space (PGAS) environments simplify writing parallel code for clusters because they make data movement implicit - dereferencing global pointers automatically moves data around. However, it does not free the programmer from needing to reason about locality - poor placement of data can lead to excessive and even unnecessary communication. For this reason, modern PGAS languages such as X10, Chapel, and UPC allow programmers to express data-layout constraints and explicitly move computation. This places an extra burden on the programmer, and is less effective for applications with limited or data-dependent locality (e.g., graph analytics).

This paper proposes Alembic, a new static analysis that frees programmers from having to manually move computation to exploit locality in PGAS programs. It works by determining regions of code that access the same cluster node, then transforming the code to migrate parts of the execution to increase the proportion of accesses to local data. We implement the analysis and transformation for C++ in LLVM and show that in irregular application kernels, Alembic can achieve 82% of the performance of hand-tuned communication (for comparison, naïve compiler-generated communication achieves only 13%).

References

  1. B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '88, pages 1--11. ACM, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS parallel benchmarks. International Journal of High Performance Computing Applications, 5:63--73, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Barik, J. Zhao, D. Grove, I. Peshansky, Z. Budimlic, and V. Sarkar. Communication optimizations for distributed memory X10 programs. In Parallel Distributed Processing Symposium (IPDPS), pages 1101--1113, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. W. Berry, B. Hendrickson, S. Kahan, and P. Konecny. Software and algorithms for graph queries on multithreaded architectures. In Parallel and Distributed Processing Symposium. IPDPS 2007. IEEE International, pages 1--14. IEEE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  5. P. Briggs, K. D. Cooper, and L. T. Simpson. Value numbering. Software - Practice and Experience, 27(6):701--724, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Briggs, K. D. Cooper, and L. Torczon. Rematerialization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI '92, pages 311--321. ACM, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pages 29--38, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W.W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and Language Specification. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences, 1999.Google ScholarGoogle Scholar
  9. B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel Language. International Journal of High Performance Computing Application, 21(3):291--312, Aug. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W.-Y. Chen, D. Bonachea, C. Iancu, and K. Yelick. Automatic nonblocking communication for partitioned global address space programs. In International Conference on Supercomputing, Proceedings, pages 158--167. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT '05, pages 267--278,Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Chong, J. Liu, A. C. Myers, X. Qi, K. Vikram, L. Zheng, and X. Zheng. Secure web applications via automatic partitioning. In ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 31--44, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, F. Cantonnet, T. El-Ghazawi, A. Mohanti, Y. Yao, and D. Chavarría-Miranda. An evaluation of global address space languages: Co-Array Fortran and Unified Parallel C. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 36--47. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, San Francisco, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Graph 500. http://www.graph5Google ScholarGoogle Scholar
  17. S. Hiranandani, K. Kennedy, and C.-W. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Commun. ACM, 35(8):66--80, Aug. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Holt, J. Nelson, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Flat combining synchronized global data structures. In International Conference on PGAS Programming Models (PGAS), Oct 2013.Google ScholarGoogle Scholar
  19. HPCC. HPCC random-access benchmark http://icl.cs.utk.edu/hpcc/hpcc_results.cgi.Google ScholarGoogle Scholar
  20. W. C. Hsieh, M. F. Kaashoek, and W. E. Weihl. Dynamic computation migration in DSM systems. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Supercomputing '96, Washington, DC, USA, 1996. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. C. Hsieh, P.Wang, and W. E.Weihl. Computation migration: Enhancing locality for distributed-memory parallel systems. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '93, pages 239--248, New York, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. ISO/IEC. Programming languages - C - Extensions to support embedded processors. Technical Report 18037, 2006.Google ScholarGoogle Scholar
  23. S. Jagannathan. Communication-passing style for coordination languages. In Coordination Languages and Models, pages 131--149. Springer, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Kaiser, M. Brodowicz, and T. Sterling. Parallex: An advanced parallel execution model for scaling-impaired applications. In Parallel Processing Workshops, 2009. ICPPW'09. International Conference on, pages 394--401. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA '93, pages 91--108, New York, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. M. Kogge. Of piglets and threadlets: Architectures for self-contained, mobile, memory programming. In Innovative Architecture for Future Generation High-Performance Processors and Systems, Proceedings, pages 130--138. IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO, pages 75--88. IEEE Computer Society, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. C. Murphy. Traveling Threads: A New Multithreaded Execution Model. PhD thesis, University of Notre Dame, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Grappa: A latency-tolerant runtime for large-scale irregular applications. Technical Report UW-CSE-14-02-01, University of Washington, 2 2014.Google ScholarGoogle Scholar
  30. NAS parallel benchmark suite 3.3. http://www.nas.nasa.gov/publications/npb.html, 2012.Google ScholarGoogle Scholar
  31. E. Raman, G. Ottoni, A. Raman, M. J. Bridges, and D. I. August. Parallel-stage decoupled software pipelining. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '08, pages 114--123, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Rogers, M. C. Carlisle, J. H. Reppy, and L. J. Hendren. Supporting dynamic data structures on distributed-memory machines. ACM Transactions on Programming Languages and Systems, 17(2):233--263, Mar. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Sanz, R. Asenjo, J. Lopez, R. Larrosa, A. Navarro, V. Litvinov, S.-E. Choi, and B. Chamberlain. Global data re-allocation via communication aggregation in Chapel. In Computer Architecture and High Performance Computing (SBAC-PAD), pages 235--242, Oct 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Shiloach and U. Vishkin. An O(N log(N)) parallel max-flow algorithm. Journal of Algorithms, 3(2):128--146, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Shirako, D. M. Peixotto, V. Sarkar, and W. N. Scherer. Phasers: A unified deadlock-free construct for collective and point-to-point synchronization. In International Conference on Supercomputing, ICS '08, pages 277--288. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. E. Tilevich and Y. Smaragdakis. J-Orchestra: Automatic Java application partitioning. In B. Magnusson, editor, ECOOP 2002 Object-Oriented Programming, volume 2374 of Lecture Notes in Computer Science, pages 178--204. Springer Berlin Heidelberg, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Wang and Z. Li. Parametric analysis for adaptive computation offloading. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI '04, pages 119--130, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. L. Wang and M. Franz. Automatic partitioning of object-oriented programs for resource-constrained mobile devices with multiple distribution objectives. In International Conference on Parallel and Distributed Systems (ICPADS'08), pages 369--376. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Alembic: automatic locality extraction via migration

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 49, Issue 10
            OOPSLA '14
            October 2014
            907 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/2714064
            • Editor:
            • Andy Gill
            Issue’s Table of Contents
            • cover image ACM Conferences
              OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications
              October 2014
              946 pages
              ISBN:9781450325851
              DOI:10.1145/2660193

            Copyright © 2014 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 October 2014

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!