skip to main content
research-article
Public Access
Best Paper

Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

This paper explores how fork-join parallelism, as supported by concurrency platforms such as Cilk and OpenMP, can be embedded into a compiler's intermediate representation (IR). Mainstream compilers typically treat parallel linguistic constructs as syntactic sugar for function calls into a parallel runtime. These calls prevent the compiler from performing optimizations across parallel control constructs. Remedying this situation is generally thought to require an extensive reworking of compiler analyses and code transformations to handle parallel semantics.

Tapir is a compiler IR that represents logically parallel tasks asymmetrically in the program's control flow graph. Tapir allows the compiler to optimize across parallel control constructs with only minor changes to its existing analyses and code transformations. To prototype Tapir in the LLVM compiler, for example, we added or modified about 6000 lines of LLVM's 4-million-line codebase. Tapir enables LLVM's existing compiler optimizations for serial code -- including loop-invariant-code motion, common-subexpression elimination, and tail-recursion elimination -- to work with parallel control constructs such as spawning and parallel loops. Tapir also supports parallel optimizations such as loop scheduling.

References

  1. S. Agarwal, R. Barik, V. Sarkar, and R. K. Shyamasundar. May-happen-in-parallel analysis of X10 programs. In PPoPP, pages 183--193, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, second edition, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Aldrich, C. Chambers, E. Sirer, and S. Eggers. Static analyses for eliminating unnecessary synchronization from Java programs. In A. Cortesi and G. File, editors, Static Analysis, volume 1694 of Lecture Notes in Computer Science, pages 19--38. 1999.Google ScholarGoogle ScholarCross RefCross Ref
  4. N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In SPAA, pages 119--129, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Barik and V. Sarkar. Interprocedural load elimination for dynamic optimization of parallel programs. In PACT, pages 41--52, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Barik, J. Zhao, and V. Sarkar. Interprocedural strength reduction of critical sections in explicitly-parallel programs. In PACT, pages 29--40, 2013. Google ScholarGoogle ScholarCross RefCross Ref
  8. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5): 720--748, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H.-J. Boehm and S. V. Adve. Foundations of the C++ concurrency memory model. In PLDI, pages 68--78, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. Cave, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: the new adventures of old X10. In PPPJ, pages 51--61, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Chatarasi, J. Shirako, and V. Sarkar. Polyhedral optimizations of explicitly parallel programs. In PACT, pages 213--226, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Du, R. Ferreira, and G. Agrawal. Compiler support for exploiting coarse-grained pipelined parallelism. In SC, pages 8--21, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Feng and C. E. Leiserson. Efficient detection of determinacy races in Cilk programs. Theory of Computing Systems, 32(3):301--326, 1999. Google ScholarGoogle ScholarCross RefCross Ref
  15. J. T. Fineman and C. E. Leiserson. Race detectors for Cilk and Cilk++ programs. In D. Padua, editor, Encyclopedia of Parallel Computing, pages 1706--1719. 2011.Google ScholarGoogle Scholar
  16. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212--223, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. GCC Team. GCC 4.9 release series changes, new features, and fixes. Available at https://gcc.gnu.org/gcc-4.9/ changes.html, 2014.Google ScholarGoogle Scholar
  18. GCC Team. GOMP -- an OpenMP implementation for GCC. Available at https://gcc.gnu.org/projects/gomp/, 2015.Google ScholarGoogle Scholar
  19. D. Grunwald and H. Srinivasan. Data flow equations for explicitly parallel programs. In PPoPP, pages 159--168, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. A. R. Hoare. Algorithm 64: Quicksort. CACM, 4(7):321,1961. ISSN 0001-0782.Google ScholarGoogle Scholar
  21. Intel Corporation. Intel Cilk Plus Language Specification, 2010. Document Number: 324396-001US. Available from http://software.intel.com/sites/products/cilk-plus/cilk_plus_language_specification.pdf.Google ScholarGoogle Scholar
  22. Intel Corporation. Intel Cilk Plus Application Binary Interface Specification, 2010. Document Number: 324512-001US. Available from https://software.intel.com/sites/products/cilk-plus/cilk_plus_abi.pdf.Google ScholarGoogle Scholar
  23. Intel Corporation. Cilk Plus/LLVM. Available from http://cilkplus.github.io/, 2013.Google ScholarGoogle Scholar
  24. Intel Corporation. Intel C++ Compiler 16.0 User and Reference Guide, 2015.Google ScholarGoogle Scholar
  25. Intel Corporation. Intel Cilk Plus samples. Available from https://software.intel.com/en-us/codesamples/intel-compiler/intel-compiler-features/ intelcilkplus, 2016.Google ScholarGoogle Scholar
  26. P. G. Joisha, R. S. Schreiber, P. Banerjee, H. J. Boehm, and D. R. Chakrabarti. A technique for the effective and automatic reuse of classical compiler optimizations on multithreaded code. In POPL, pages 623--636, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Jordan, S. Pellegrini, P. Thoman, K. Kofler, and T. Fahringer. INSPIRE: The Insieme parallel intermediate representation. In PACT, pages 7--18, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  28. B. W. Kernighan and D. M. Ritchie. The C Programming Language. Prentice Hall, Inc., second edition, 1988.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Khaldi, P. Jouvelot, C. Ancourt, and F. Irigoin. SPIRE, a sequential to parallel intermediate representation extension. Technical report, Technical Report CRI/A-487, MINES ParisTech, 2012.Google ScholarGoogle Scholar
  30. D. Khaldi, P. Jouvelot, F. Irigoin, C. Ancourt, and B. Chapman. LLVM parallel intermediate representation: Design and evaluation using OpenSHMEM communications. In LLVM, pages 2:1--2:8, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Knoop, B. Steffen, and J. Vollmer. Parallelism for free: Efficient and optimal bitvector analyses for parallel programs. ACM TOPLAS, pages 268--299, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO, pages 75--87, 2004. Google ScholarGoogle ScholarCross RefCross Ref
  33. I.-T. A. Lee, C. E. Leiserson, T. B. Schardl, Z. Zhang, and J. Sukha. On-the-fly pipeline parallelism. ACM TOPC, 2(3): 17:1--17:42, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Lee, S. P. Midkiff, and D. A. Padua. Concurrent static single assignment form and constant propagation for explicitly parallel programs. In LCPC, pages 114--130, 1997.Google ScholarGoogle Scholar
  35. C. E. Leiserson. The Cilk++ concurrency platform. Journal of Supercomputing, 51(3):244--257, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. LLVM Developer List. LLVMdev discussions on Intel OpenMP proposal. Available from http: //lists.llvm.org/pipermail/llvm-dev/2012-September/053861.html, September 2012.Google ScholarGoogle Scholar
  37. LLVM Developer List. LLVMdev Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.). Available from http://lists.llvm.org/pipermail/llvm-dev/2012September/053792.html, September 2012.Google ScholarGoogle Scholar
  38. LLVM Developer List. LLVMdev discussions on OpenCL SPIR proposal. Available from http://lists.llvm.org/pipermail/llvm-dev/2012-September/053293.html, September 2012.Google ScholarGoogle Scholar
  39. LLVM Developer List. LLVMdev discussions on parallel IR. Available from http://lists.llvm.org/pipermail/llvm-dev/2015-March/083314.html, March 2015.Google ScholarGoogle Scholar
  40. LLVM Project. OpenMP®: Support for the OpenMP language. Available at http://openmp.llvm.org/, 2015.Google ScholarGoogle Scholar
  41. LLVM Project. LLVM Language Reference Manual, 2015. Available from http://llvm.org/docs/LangRef.html.Google ScholarGoogle Scholar
  42. LLVM Project. LLVM's Analysis and Transform Passes, 2015. Available from http://llvm.org/docs/Passes.html.Google ScholarGoogle Scholar
  43. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new framework for parallel machine learning. In UAI, pages 340--349, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, pages 716--727, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. McCool, A. D. Robison, and J. Reinders. Structured Parallel Programming: Patterns for Efficient Computation. Elsevier Science, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. P. Midkiff and D. A. Padua. Issues in the optimization of parallel programs. In ICPP, pages 105--113, 1990.Google ScholarGoogle Scholar
  48. S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Navarro, R. Asenjo, S. Tabik, and C. Cascaval. Analytical modeling of pipeline parallelism. In PACT, pages 281--290, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. R. H. B. Netzer and B. P. Miller. What are race conditions? ACM Letters on Programming Languages and Systems, 1(1): 74--88, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, pages 456--471, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. D. Nguyen, A. Lenharth, and K. Pingali. Deterministic Galois: On-demand, portable and parameterless. In ASPLOS, pages 499--512, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. D. Novillo, R. Unrau, and J. Schaeffer. Concurrent SSA form in the presence of mutual exclusion. In ICPP, pages 356--364, 1998. Google ScholarGoogle ScholarCross RefCross Ref
  54. OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 4.0, July 2013. Available from http://www.openmp.org/mp-documents/ OpenMP4.0.0.pdf.Google ScholarGoogle Scholar
  55. A. Pop and A. Cohen. Preserving high-level semantics of parallel programming annotations through the compilation flow of optimizing compilers. In CPC, 2010. URL https://hal.inria.fr/inria-00551518.Google ScholarGoogle Scholar
  56. W. Pugh. Fixing the Java memory model. In JAVA, pages 89--98, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. E. Ruf. Effective synchronization removal for Java. In PLDI, pages 208--218, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. R. Rugina and M. C. Rinard. Pointer analysis for structured parallel programs. TOPLAS, pages 70--116, Jan. 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. V. Sarkar. Analysis and optimization of explicitly parallel programs using the parallel program graph representation. In LCPC, pages 94--113, 1998.Google ScholarGoogle Scholar
  60. V. Sarkar and B. Simons. Parallel program graphs and their classification. In LCPC, pages 633--655, 1994. Google ScholarGoogle ScholarCross RefCross Ref
  61. T. B. Schardl, B. C. Kuszmaul, I.-T. A. Lee, W. M. Leiserson, and C. E. Leiserson. The Cilkprof scalability profiler. In SPAA, pages 89--100, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In PPoPP, pages 135--146, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. J. Shun, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, A. Kyrola, H. V. Simhadri, and K. Tangwongsan. Brief announcement: the Problem Based Benchmark Suite. In SPAA, pages 68--70, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. J. Shun, L. Dhulipala, and G. E. Blelloch. Smaller and faster: Parallel processing of compressed graphs with Ligra+. In DCC, pages 403--412, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. H. Srinivasan and D. Grunwald. An efficient construction of parallel static single assignment form for structured parallel programs. Technical report, Technical Report CU-CS-564--91, University of Colorado at Boulder, 1991.Google ScholarGoogle Scholar
  66. H. Srinivasan and M. Wolfe. Analyzing programs with explicit parallelism. In LCPC, pages 405--419, 1991.Google ScholarGoogle Scholar
  67. H. Srinivasan, J. Hook, and M. Wolfe. Static single assignment for explicitly parallel programs. In POPL, pages 260-- 272, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. R. M. Stallman and the GCC Developer Community. Using the GNU Compiler Collection (for GCC version 6.1.0). Free Software Foundation, 2016.Google ScholarGoogle Scholar
  69. B. Stroustrup. The C++ Programming Language. AddisonWesley, 4th edition, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. R. Utterback, K. Agrawal, J. T. Fineman, and I.-T. A. Lee. Provably good and practically efficient parallel race detection for fork-join programs. In SPAA, pages 83--94, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. V. Vafeiadis, T. Balabonski, S. Chakraborty, R. Morisset, and F. Zappa Nardelli. Common compiler optimisations are invalid in the C11 memory model and what we can do about it. In POPL, pages 209--220, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. J. Zhao and V. Sarkar. Intermediate language extensions for parallelism. In SPLASH, pages 329--340, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!