skip to main content

Verified tensor-program optimization via high-level scheduling rewrites

Published:12 January 2022Publication History
Skip Abstract Section

Abstract

We present a lightweight Coq framework for optimizing tensor kernels written in a pure, functional array language. Optimizations rely on user scheduling using series of verified, semantics-preserving rewrites. Unusually for compilation targeting imperative code with arrays and nested loops, all rewrites are source-to-source within a purely functional language. Our language comprises a set of core constructs for expressing high-level computation detail and a set of what we call reshape operators, which can be derived from core constructs but trigger low-level decisions about storage patterns and ordering. We demonstrate that not only is this system capable of deriving the optimizations of existing state-of-the-art languages like Halide and generating comparably performant code, it is also able to schedule a family of useful program transformations beyond what is reachable in Halide.

Skip Supplemental Material Section

Supplemental Material

Auxiliary Presentation Video

This is a presentation video for our paper at POPL 2022 accepted into the research track. In this paper we introduce a verified framework for optimizing tensor programs and a small evaluation presenting preliminary results.

References

  1. Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: expressing locality and independence with logical regions. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12. IEEE, Piscataway, NJ, USA. 66. https://doi.org/10.1109/SC.2012.71 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gilbert Bernstein, Michael Mara, Tzu-Mao Li, Dougal Maclaurin, and Jonathan Ragan-Kelley. 2020. Differentiating a Tensor Language. arxiv:2008.11256.Google ScholarGoogle Scholar
  3. Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the POPL 2011 Workshop on Declarative Aspects of Multicore Programming, Manuel Carro and John H. Reppy (Eds.). Association for Computing Machinery, New York, NY, USA. 3–14. https://doi.org/10.1145/1926354.1926358 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. The International Journal of High Performance Computing Applications, 21, 3 (2007), 291–312. https://doi.org/10.1177/1094342007078442 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bradford L. Chamberlain. 2001. The design and implementation of a region-based parallel programming language. Ph.D. Dissertation. The University of Washington.Google ScholarGoogle Scholar
  6. Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. University of Southern California.Google ScholarGoogle Scholar
  7. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-end Optimizing Compiler for Deep Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Berkeley, CA, USA. 579–594. isbn:978-1-931971-47-8 http://dl.acm.org/citation.cfm?id=3291168.3291211Google ScholarGoogle Scholar
  8. Benjamin Delaware, Clément Pit-Claudel, Jason Gross, and Adam Chlipala. 2015. Fiat: Deductive Synthesis of Abstract Data Types in a Proof Assistant. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2015, Mumbai, India, January 15-17, 2015. 689–700. https://doi.org/10.1145/2676726.2677006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Benjamin Delaware, Sorawit Suriyakarn, Clément Pit-Claudel, Qianchuan Ye, and Adam Chlipala. 2019. Narcissus: Correct-By-Construction Derivation of Decoders and Encoders from Binary Formats. In Proc. ICFP. https://doi.org/10.1145/3341686 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sébastien Donadio, James C. Brodman, Thomas Roeder, Kamen Yotov, Denis Barthou, Albert Cohen, María Jesús Garzarán, David A. Padua, and Keshav Pingali. 2005. A Language for the Compact Representation of Multiple Program Versions. In Languages and Compilers for Parallel Computing, 18th International Workshop, LCPC 2005. Springer Berlin Heidelberg, Berlin, Heidelberg. 136–151. https://doi.org/10.1007/978-3-540-69330-7_10 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06). Association for Computing Machinery, New York, NY, USA. 83–es. isbn:0769527000 https://doi.org/10.1145/1188455.1188543 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rongxiao Fu, Xueying Qin, Ornela Dardha, and Michel Steuwer. 2021. Row-Polymorphic Types for Strategic Rewriting. arxiv:2103.13390.Google ScholarGoogle Scholar
  13. Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 2011. Concrete Mathematics. Addison Wesley, 36–37.Google ScholarGoogle Scholar
  14. Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, and Vinod Grover. 2020. Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs. arxiv:2003.06324.Google ScholarGoogle Scholar
  15. Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using Orio. In 23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009. IEEE, Piscataway, NJ, USA. 1–11. https://doi.org/10.1109/IPDPS.2009.5161004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-programming with Nested Parallelism and In-place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA. 556–571. isbn:978-1-4503-4988-8 https://doi.org/10.1145/3062341.3062354 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kesha Hietala, Robert Rand, Shih-Han Hung, Xiaodi Wu, and Michael Hicks. 2021. A verified optimizer for Quantum circuits. Proceedings of the ACM on Programming Languages, 5, POPL (2021), Jan, 1–29. issn:2475-1421 https://doi.org/10.1145/3434318 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yuanming Hu, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019. Taichi: a language for high-performance computation on spatially sparse data structures. ACM Trans. Graph., 38, 6 (2019), 201:1–201:16. https://doi.org/10.1145/3355089.3356506 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kenneth E. Iverson. 1962. A Programming Language. John Wiley & Sons, Inc., New York, NY, USA. isbn:0-471430-14-5Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages, 1, OOPSLA (2017), oct, 1–29. https://doi.org/10.1145/3133901 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Steve Kommrusch, Théo Barollet, and Louis-Noël Pouchet. 2021. Proving Equivalence Between Complex Expressions Using Graph-to-Sequence Neural Models. arxiv:2106.02452.Google ScholarGoogle Scholar
  22. Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. 2018. Differentiable programming for image processing and deep learning in Halide. ACM Trans. Graph. (Proc. SIGGRAPH), 37, 4 (2018), 139:1–139:13. https://doi.org/10.1145/3197517.3201383 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew J. Johnson, Jonathan Ragan-Kelley, and Dougal Maclaurin. 2021. Getting to the Point. Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming. In The 25th ACM SIGPLAN International Conference on Functional Programming (ICFP). ACM. https://doi.org/10.1145/3473593 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Clément Pit-Claudel, Peng Wang, Benjamin Delaware, Jason Gross, and Adam Chlipala. 2020. Extensible Extraction of Efficient Imperative Programs with Foreign Functions, Manually Managed Memory, and Proofs. In IJCAR’20: Proceedings of the 9th International Joint Conference on Automated Reasoning. https://doi.org/10.1007/978-3-030-51054-1_7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman P. Amarasinghe, and Frédo Durand. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31, 4 (2012), 32:1–32:12. https://doi.org/10.1145/2185520.2185528 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proc. PLDI. ACM, Seattle. https://doi.org/10.1145/2491956.2462176 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Justin Slepak, Olin Shivers, and Panagiotis Manolios. 2014. An Array-Oriented Language with Static Rank Polymorphism. In Programming Languages and Systems, Zhong Shao (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 27–46. isbn:978-3-642-54833-8 https://doi.org/10.1007/978-3-642-54833-8_3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gus Henry Smith, Andrew Liu, Steven Lyubomirsky, Scott Davidson, Joseph McMahan, Michael Taylor, Luis Ceze, and Zachary Tatlock. 2021. Pure Tensor Program Rewriting via Access Patterns (Representation Pearl). In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2021). Association for Computing Machinery, New York, NY, USA. 21–31. isbn:9781450384674 https://doi.org/10.1145/3460945.3464953 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Michel Steuwer, Chris Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code using Rewrite Rules: From High-Level Functional Expressions to High-Performance OpenCL Code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. 50, Association for Computing Machinery. https://doi.org/10.1145/2784731.2784754 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arxiv:1802.04730.Google ScholarGoogle Scholar
  31. Anand Venkat, Tharindu Rusira, Raj Barik, Mary Hall, and Leonard Truong. 2019. SWIRL: High-performance many-core CPU code generation for deep neural networks. The International Journal of High Performance Computing Applications, 33, 6 (2019), 1275–1289. https://doi.org/10.1177/1094342019866247 arxiv:https://doi.org/10.1177/1094342019866247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Qing Yi, Keith Seymour, Haihang You, Richard W. Vuduc, and Daniel J. Quinlan. 2007. POET: Parameterized Optimizations for Empirical Tuning. In 21st International Parallel and Distributed Processing Symposium (IPDPS 2007). IEEE, Piscataway, NJ, USA. 1–8. https://doi.org/10.1109/IPDPS.2007.370637 Google ScholarGoogle ScholarCross RefCross Ref
  33. Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman P. Amarasinghe. 2018. GraphIt: a high-performance graph DSL. PACMPL, 2, OOPSLA (2018), 121:1–121:30. https://doi.org/10.1145/3276491 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Verified tensor-program optimization via high-level scheduling rewrites

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!