Abstract
Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages - like C or OpenCL - force the programmer to intertwine the code describing functionality and optimizations. This results in a portability nightmare that is particularly problematic given the accelerating trend towards specialized hardware devices to further increase efficiency.
Many emerging DSLs used in performance demanding domains such as deep learning or high-performance image processing attempt to simplify or even fully automate the optimization process. Using a high-level - often functional - language, programmers focus on describing functionality in a declarative way. In some systems such as Halide or TVM, a separate schedule specifies how the program should be optimized. Unfortunately, these schedules are not written in well-defined programming languages. Instead, they are implemented as a set of ad-hoc predefined APIs that the compiler writers have exposed.
In this functional pearl, we show how to employ functional programming techniques to solve this challenge with elegance. We present two functional languages that work together - each addressing a separate concern. RISE is a functional language for expressing computations using well known functional data-parallel patterns. ELEVATE is a functional language for describing optimization strategies. A high-level RISE program is transformed into a low-level form using optimization strategies written in ELEVATE . From the rewritten low-level program high-performance parallel code is automatically generated. In contrast to existing high-performance domain-specific systems with scheduling APIs, in our approach programmers are not restricted to a set of built-in operations and optimizations but freely define their own computational patterns in RISE and optimization strategies in ELEVATE in a composable and reusable way. We show how our holistic functional approach achieves competitive performance with the state-of-the-art imperative systems Halide and TVM.
Supplemental Material
- Oana Andrei, Maribel Fernández, Hélène Kirchner, Guy Melançon, Olivier Namet, and Bruno Pinaud. 2011. PORGY: Strategy-Driven Interactive Transformation of Graphs. In Proceedings 6th International Workshop on Computing with Terms and Graphs, TERMGRAPH 2011, Saarbrücken, Germany, 2nd April 2011. 54-68. https://doi.org/10.4204/EPTCS.48.7 Google Scholar
Cross Ref
- Robert Atkey, Michel Steuwer, Sam Lindley, and Christophe Dubach. 2017. Strategy Preserving Compilation for Parallel Functional Code. CoRR abs/1710.08332 ( 2017 ).Google Scholar
- Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman P. Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, February 16-20, 2019. 193-205. https://doi.org/10.1109/CGO. 2019.8661197 Google Scholar
Cross Ref
- Paul Barham and Michael Isard. 2019. Machine Learning Systems are Stuck in a Rut. In HotOS. ACM, 177-183.Google Scholar
Digital Library
- Richard Bird and Oege de Moor. 1997. Algebra of Programming. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.Google Scholar
Digital Library
- Peter Borovanský, Claude Kirchner, Hélène Kirchner, Pierre-Etienne Moreau, and Christophe Ringeissen. 1998. An overview of ELAN. Electr. Notes Theor. Comput. Sci. 15 ( 1998 ), 55-70. https://doi.org/10.1016/S1571-0661 ( 05 ) 82552-6 Google Scholar
Cross Ref
- Peter Borovanský, Claude Kirchner, Hélène Kirchner, Pierre-Etienne Moreau, and Marian Vittek. 1996. ELAN: A logical framework based on computational systems. Electr. Notes Theor. Comput. Sci. 4 ( 1996 ), 35-50. https://doi.org/10.1016/ S1571-0661 ( 04 ) 00032-5 Google Scholar
Cross Ref
- James M Boyle, Terence J Harmer, and Victor L Winter. 1997. The TAMPR program transformation system: Simplifying the development of numerical software. In Modern software tools for scientific computing. Springer, 353-372.Google Scholar
- Martin Bravenboer and Eelco Visser. 2002. Rewriting Strategies for Instruction Selection. In Rewriting Techniques and Applications, 13th International Conference, RTA 2002, Copenhagen, Denmark, July 22-24, 2002, Proceedings. 237-251. https://doi.org/10.1007/3-540-45610-4_17 Google Scholar
Cross Ref
- Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In DAMP. ACM, 3-14.Google Scholar
- Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. Citeseer.Google Scholar
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. 578-594. https://www.usenix.org/conference/osdi18/presentation/chenGoogle Scholar
Digital Library
- Elliot J. Chikofsky and James H. Cross II. 1990. Reverse Engineering and Design Recovery: A Taxonomy. IEEE Software 7, 1 ( 1990 ), 13-17. https://doi.org/10.1109/52.43044 Google Scholar
Digital Library
- Manuel Clavel, Francisco Durán, Steven Eker, Patrick Lincoln, Narciso Martí-Oliet, José Meseguer, and Jose F. Quesada. 2002. Maude: specification and programming in rewriting logic. Theor. Comput. Sci. 285, 2 ( 2002 ), 187-243. https: //doi.org/10.1016/S0304-3975 ( 01 ) 00359-0 Google Scholar
Digital Library
- Christian S. Collberg, Clark D. Thomborson, and Douglas Low. 1998. Manufacturing Cheap, Resilient, and Stealthy Opaque Constructs. In POPL '98, Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, San Diego, CA, USA, January 19-21, 1998. 184-196. https://doi.org/10.1145/268946.268962 Google Scholar
Digital Library
- Alexander Collins, Dominik Grewe, Vinod Grover, Sean Lee, and Adriana Susnea. 2014. NOVA: A Functional Language for Data Parallelism. In [email protected]. ACM, 8-13.Google Scholar
- David Delahaye. 2000. A Tactic Language for the System Coq. In LPAR (Lecture Notes in Computer Science), Vol. 1955. Springer, 85-95.Google Scholar
Cross Ref
- Eelco Dolstra and Eelco Visser. 2002. Building Interpreters with Rewriting Strategies. Electr. Notes Theor. Comput. Sci. 65, 3 ( 2002 ), 57-76. https://doi.org/10.1016/S1571-0661 ( 04 ) 80427-4 Google Scholar
Cross Ref
- Amy P. Felty. 1993. Implementing Tactics and Tacticals in a Higher-Order Logic Programming Language. J. Autom. Reasoning 11, 1 ( 1993 ), 41-81.Google Scholar
Cross Ref
- Maribel Fernández, Hélène Kirchner, and Olivier Namet. 2011. A Strategy Language for Graph Rewriting. In Logic-Based Program Synthesis and Transformation-21st International Symposium, LOPSTR 2011, Odense, Denmark, July 18-20, 2011. Revised Selected Papers. 173-188. https://doi.org/10.1007/978-3-642-32211-2_12 Google Scholar
Digital Library
- Martin Fowler. 1999. Refactoring-Improving the Design of Existing Code. Addison-Wesley. http://martinfowler.com/books/ refactoring.htmlGoogle Scholar
- Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies. International Journal of Parallel Programming 34, 3 ( 2006 ), 261-317. https://doi.org/10.1007/s10766-006-0012-3 Google Scholar
Digital Library
- Joseph A. Goguen, Claude Kirchner, Hélène Kirchner, Aristide Mégrelis, José Meseguer, and Timothy C. Winkler. 1987. An Introduction to OBJ 3. In Conditional Term Rewriting Systems, 1st International Workshop, Orsay, France, July 8-10, 1987, Proceedings. 258-263. https://doi.org/10.1007/3-540-19242-5_22 Google Scholar
Cross Ref
- Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High performance stencil code generation with lift. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO 2018, Vösendorf / Vienna, Austria, February 24-28, 2018. 100-112. https://doi.org/10.1145/3168824 Google Scholar
Digital Library
- Halide. 2020. Tutorial: Scheduling. https://halide-lang.org/tutorials/tutorial_lesson_05_scheduling_1.htmlGoogle Scholar
- Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. 2009. Loop transformation recipes for code generation and auto-tuning. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 50-64.Google Scholar
- John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM 62, 2 ( 2019 ), 48-60.Google Scholar
- Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: purely functional GPU-programming with nested parallelism and in-place array updates. In PLDI. ACM, 556-571.Google Scholar
- Hélène Kirchner. 2015. Rewriting Strategies and Strategic Rewrite Programs. In Logic, Rewriting, and Concurrency-Essays dedicated to José Meseguer on the Occasion of His 65th Birthday. 380-403. https://doi.org/10.1007/978-3-319-23165-5_18 Google Scholar
Cross Ref
- Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore's Law. arXiv:cs.PL/ 2002.11054Google Scholar
- Sebastiaan Pascal Luttik, Eelco Visser, et al. 1997. Specification of rewriting strategies. Universiteit van Amsterdam. Programming Research Group.Google Scholar
- Trevor L. McDonell, Manuel M. T. Chakravarty, Gabriele Keller, and Ben Lippmeier. 2013. Optimising purely functional GPU programs. In ICFP. ACM, 49-60.Google Scholar
- Ulf Norell. 2007. Towards a practical programming language based on dependent type theory. Ph.D. Dissertation. Department of Computer Science and Engineering, Chalmers University of Technology, SE-412 96 Göteborg, Sweden.Google Scholar
- Karina Olmos and Eelco Visser. 2002. Strategies for Source-to-Source Constant Progagation. Electr. Notes Theor. Comput. Sci. 70, 6 ( 2002 ), 156-175. https://doi.org/10.1016/S1571-0661 ( 04 ) 80605-4 Google Scholar
Cross Ref
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic diferentiation in PyTorch. ( 2017 ).Google Scholar
- Simon Peyton Jones, Andrew Tolmach, and Tony Hoare. 2001. Playing by the rules: rewriting as a practical optimisation technique in GHC. In 2001 Haskell Workshop (2001 haskell workshop ed.). ACM SIGPLAN.Google Scholar
- Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman P. Amarasinghe, and Frédo Durand. 2018. Halide: decoupling algorithms from schedules for high-performance image processing. Commun. ACM 61, 1 ( 2018 ), 106-115. https://doi.org/10.1145/3150211 Google Scholar
Digital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In PLDI. ACM, 519-530.Google Scholar
- Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In ICFP. ACM, 205-217.Google Scholar
Digital Library
- Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2016. Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation. In CASES. ACM, 15 : 1-15 : 10.Google Scholar
- Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. Lift: a functional data-parallel IR for high-performance GPU code generation. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017. 74-85. http://dl.acm.org/citation.cfm?id= 3049841Google Scholar
Digital Library
- Joel Svensson, Mary Sheeran, and Koen Claessen. 2008. Obsidian: A Domain Specific Embedded Language for Parallel Programming of Graphics Processors. In IFL (Lecture Notes in Computer Science), Vol. 5836. Springer, 156-173.Google Scholar
- TVM. 2020. How to optimize GEMM on CPU. https://docs.tvm.ai/tutorials/optimize/opt_gemm.htmlGoogle Scholar
- Mark van den Brand, Arie van Deursen, Jan Heering, H. A. de Jong, Merijn de Jonge, Tobias Kuipers, Paul Klint, Leon Moonen, Pieter A. Olivier, Jeroen Scheerder, Jurgen J. Vinju, Eelco Visser, and Joost Visser. 2001. The ASF+SDF Meta-environment: A Component-Based Language Development Environment. In Compiler Construction, 10th International Conference, CC 2001 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2001 Genova, Italy, April 2-6, 2001, Proceedings. 365-370. https://doi.org/10.1007/3-540-45306-7_26 Google Scholar
Cross Ref
- Eelco Visser. 2001a. Stratego: A Language for Program Transformation Based on Rewriting Strategies. In Rewriting Techniques and Applications, 12th International Conference, RTA 2001, Utrecht, The Netherlands, May 22-24, 2001, Proceedings. 357-362. https://doi.org/10.1007/3-540-45127-7_27 Google Scholar
Cross Ref
- Eelco Visser. 2001b. A Survey of Strategies in Program Transformation Systems. Electr. Notes Theor. Comput. Sci. 57 ( 2001 ), 109-143. https://doi.org/10.1016/S1571-0661 ( 04 ) 00270-1 Google Scholar
Cross Ref
- Eelco Visser. 2004. Program transformation with Stratego/XT. In Domain-specific program generation. Springer, 216-238.Google Scholar
- Eelco Visser. 2005. A survey of strategies in rule-based program transformation systems. J. Symb. Comput. 40, 1 ( 2005 ), 831-873. https://doi.org/10.1016/j.jsc. 2004. 12.011 Google Scholar
Digital Library
- Eelco Visser, Zine-El-Abidine Benaissa, and Andrew P. Tolmach. 1998. Building Program Optimizers with Rewriting Strategies. In Proceedings of the third ACM SIGPLAN International Conference on Functional Programming (ICFP '98), Baltimore, Maryland, USA, September 27-29, 1998. 13-26. https://doi.org/10.1145/289423.289425 Google Scholar
Digital Library
- Philip Wadler. 2015. Propositions as types. Commun. ACM 58, 12 ( 2015 ), 75-84.Google Scholar
- Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay V. Rajopadhye. 2012. AlphaZ: A System for Design Space Exploration in the Polyhedral Model. In Languages and Compilers for Parallel Computing, 25th International Workshop, LCPC 2012, Tokyo, Japan, September 11-13, 2012, Revised Selected Papers. 17-31. https://doi.org/10.1007/978-3-642-37658-0_2 Google Scholar
Cross Ref
- Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman P. Amarasinghe. 2018. GraphIt: a high-performance graph DSL. PACMPL 2, OOPSLA ( 2018 ), 121 : 1-121 : 30. https://doi.org/10.1145/3276491 Google Scholar
Digital Library
Index Terms
Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies
Recommendations
Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
ICFP 2015: Proceedings of the 20th ACM SIGPLAN International Conference on Functional ProgrammingComputers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension ...
Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code
ICFP '15Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension ...
Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisations
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This ...






Comments