Abstract
We present a lightweight Coq framework for optimizing tensor kernels written in a pure, functional array language. Optimizations rely on user scheduling using series of verified, semantics-preserving rewrites. Unusually for compilation targeting imperative code with arrays and nested loops, all rewrites are source-to-source within a purely functional language. Our language comprises a set of core constructs for expressing high-level computation detail and a set of what we call reshape operators, which can be derived from core constructs but trigger low-level decisions about storage patterns and ordering. We demonstrate that not only is this system capable of deriving the optimizations of existing state-of-the-art languages like Halide and generating comparably performant code, it is also able to schedule a family of useful program transformations beyond what is reachable in Halide.
Supplemental Material
- Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: expressing locality and independence with logical regions. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12. IEEE, Piscataway, NJ, USA. 66. https://doi.org/10.1109/SC.2012.71 Google Scholar
Digital Library
- Gilbert Bernstein, Michael Mara, Tzu-Mao Li, Dougal Maclaurin, and Jonathan Ragan-Kelley. 2020. Differentiating a Tensor Language. arxiv:2008.11256.Google Scholar
- Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the POPL 2011 Workshop on Declarative Aspects of Multicore Programming, Manuel Carro and John H. Reppy (Eds.). Association for Computing Machinery, New York, NY, USA. 3–14. https://doi.org/10.1145/1926354.1926358 Google Scholar
Digital Library
- B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. The International Journal of High Performance Computing Applications, 21, 3 (2007), 291–312. https://doi.org/10.1177/1094342007078442 Google Scholar
Digital Library
- Bradford L. Chamberlain. 2001. The design and implementation of a region-based parallel programming language. Ph.D. Dissertation. The University of Washington.Google Scholar
- Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. University of Southern California.Google Scholar
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-end Optimizing Compiler for Deep Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Berkeley, CA, USA. 579–594. isbn:978-1-931971-47-8 http://dl.acm.org/citation.cfm?id=3291168.3291211Google Scholar
- Benjamin Delaware, Clément Pit-Claudel, Jason Gross, and Adam Chlipala. 2015. Fiat: Deductive Synthesis of Abstract Data Types in a Proof Assistant. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015. 689–700. https://doi.org/10.1145/2676726.2677006 Google Scholar
Digital Library
- Benjamin Delaware, Sorawit Suriyakarn, Clément Pit-Claudel, Qianchuan Ye, and Adam Chlipala. 2019. Narcissus: Correct-By-Construction Derivation of Decoders and Encoders from Binary Formats. In Proc. ICFP. https://doi.org/10.1145/3341686 Google Scholar
Digital Library
- Sébastien Donadio, James C. Brodman, Thomas Roeder, Kamen Yotov, Denis Barthou, Albert Cohen, María Jesús Garzarán, David A. Padua, and Keshav Pingali. 2005. A Language for the Compact Representation of Multiple Program Versions. In Languages and Compilers for Parallel Computing, 18th International Workshop, LCPC 2005. Springer Berlin Heidelberg, Berlin, Heidelberg. 136–151. https://doi.org/10.1007/978-3-540-69330-7_10 Google Scholar
Digital Library
- Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06). Association for Computing Machinery, New York, NY, USA. 83–es. isbn:0769527000 https://doi.org/10.1145/1188455.1188543 Google Scholar
Digital Library
- Rongxiao Fu, Xueying Qin, Ornela Dardha, and Michel Steuwer. 2021. Row-Polymorphic Types for Strategic Rewriting. arxiv:2103.13390.Google Scholar
- Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 2011. Concrete Mathematics. Addison Wesley, 36–37.Google Scholar
- Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, and Vinod Grover. 2020. Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs. arxiv:2003.06324.Google Scholar
- Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009. IEEE, Piscataway, NJ, USA. 1–11. https://doi.org/10.1109/IPDPS.2009.5161004 Google Scholar
Digital Library
- Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-programming with Nested Parallelism and In-place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA. 556–571. isbn:978-1-4503-4988-8 https://doi.org/10.1145/3062341.3062354 Google Scholar
Digital Library
- Kesha Hietala, Robert Rand, Shih-Han Hung, Xiaodi Wu, and Michael Hicks. 2021. A verified optimizer for Quantum circuits. Proceedings of the ACM on Programming Languages, 5, POPL (2021), Jan, 1–29. issn:2475-1421 https://doi.org/10.1145/3434318 Google Scholar
Digital Library
- Yuanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019. Taichi: a language for high-performance computation on spatially sparse data structures. ACM Trans. Graph., 38, 6 (2019), 201:1–201:16. https://doi.org/10.1145/3355089.3356506 Google Scholar
Digital Library
- Kenneth E. Iverson. 1962. A Programming Language. John Wiley & Sons, Inc., New York, NY, USA. isbn:0-471430-14-5Google Scholar
Digital Library
- Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages, 1, OOPSLA (2017), oct, 1–29. https://doi.org/10.1145/3133901 Google Scholar
Digital Library
- Steve Kommrusch, Théo Barollet, and Louis-Noël Pouchet. 2021. Proving Equivalence Between Complex Expressions Using Graph-to-Sequence Neural Models. arxiv:2106.02452.Google Scholar
- Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. 2018. Differentiable programming for image processing and deep learning in Halide. ACM Trans. Graph. (Proc. SIGGRAPH), 37, 4 (2018), 139:1–139:13. https://doi.org/10.1145/3197517.3201383 Google Scholar
Digital Library
- Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew J. Johnson, Jonathan Ragan-Kelley, and Dougal Maclaurin. 2021. Getting to the Point. Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming. In The 25th ACM SIGPLAN International Conference on Functional Programming (ICFP). ACM. https://doi.org/10.1145/3473593 Google Scholar
Digital Library
- Clément Pit-Claudel, Peng Wang, Benjamin Delaware, Jason Gross, and Adam Chlipala. 2020. Extensible Extraction of Efficient Imperative Programs with Foreign Functions, Manually Managed Memory, and Proofs. In IJCAR’20: Proceedings of the 9th International Joint Conference on Automated Reasoning. https://doi.org/10.1007/978-3-030-51054-1_7 Google Scholar
Digital Library
- Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman P. Amarasinghe, and Frédo Durand. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31, 4 (2012), 32:1–32:12. https://doi.org/10.1145/2185520.2185528 Google Scholar
Digital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proc. PLDI. ACM, Seattle. https://doi.org/10.1145/2491956.2462176 Google Scholar
Digital Library
- Justin Slepak, Olin Shivers, and Panagiotis Manolios. 2014. An Array-Oriented Language with Static Rank Polymorphism. In Programming Languages and Systems, Zhong Shao (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 27–46. isbn:978-3-642-54833-8 https://doi.org/10.1007/978-3-642-54833-8_3 Google Scholar
Digital Library
- Gus Henry Smith, Andrew Liu, Steven Lyubomirsky, Scott Davidson, Joseph McMahan, Michael Taylor, Luis Ceze, and Zachary Tatlock. 2021. Pure Tensor Program Rewriting via Access Patterns (Representation Pearl). In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2021). Association for Computing Machinery, New York, NY, USA. 21–31. isbn:9781450384674 https://doi.org/10.1145/3460945.3464953 Google Scholar
Digital Library
- Michel Steuwer, Chris Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code using Rewrite Rules: From High-Level Functional Expressions to High-Performance OpenCL Code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. 50, Association for Computing Machinery. https://doi.org/10.1145/2784731.2784754 Google Scholar
Digital Library
- Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arxiv:1802.04730.Google Scholar
- Anand Venkat, Tharindu Rusira, Raj Barik, Mary Hall, and Leonard Truong. 2019. SWIRL: High-performance many-core CPU code generation for deep neural networks. The International Journal of High Performance Computing Applications, 33, 6 (2019), 1275–1289. https://doi.org/10.1177/1094342019866247 arxiv:https://doi.org/10.1177/1094342019866247. Google Scholar
Digital Library
- Qing Yi, Keith Seymour, Haihang You, Richard W. Vuduc, and Daniel J. Quinlan. 2007. POET: Parameterized Optimizations for Empirical Tuning. In 21st International Parallel and Distributed Processing Symposium (IPDPS 2007). IEEE, Piscataway, NJ, USA. 1–8. https://doi.org/10.1109/IPDPS.2007.370637 Google Scholar
Cross Ref
- Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman P. Amarasinghe. 2018. GraphIt: a high-performance graph DSL. PACMPL, 2, OOPSLA (2018), 121:1–121:30. https://doi.org/10.1145/3276491 Google Scholar
Digital Library
Index Terms
Verified tensor-program optimization via high-level scheduling rewrites
Recommendations
An EDSL approach to high performance Haskell programming
Haskell '13This paper argues for a new methodology for writing high performance Haskell programs by using Embedded Domain Specific Languages.
We exemplify the methodology by describing a complete library, meta-repa, which is a reimplementation of parts of the repa ...
Handling Environments in a Nested Relational Algebra with Combinators and an Implementation in a Verified Query Compiler
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataAlgebras based on combinators, i.e., variable-free, have been proposed as a better representation for query compilation and optimization. A key benefit of combinators is that they avoid the need to handle variable shadowing or accidental capture during ...
An Equivalence-Checking Method for Scheduling Verification in High-Level Synthesis
A formal method for checking equivalence between a given behavioral specification prior to scheduling and the one produced by the scheduler is described. Finite state machine with data path (FSMD) models have been used to represent both the behaviors. ...






Comments