skip to main content
research-article
Open Access

Efficient automatic scheduling of imaging and vision pipelines for the GPU

Published:15 October 2021Publication History
Skip Abstract Section

Abstract

We present a new algorithm to quickly generate high-performance GPU implementations of complex imaging and vision pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or hand-optimized kernels. We address the scalability challenge of extending search-based automatic scheduling to map large real-world programs to the deep hierarchies of memory and parallelism on GPU architectures in reasonable compile time. We achieve this using (1) a two-phase search algorithm that first ‘freezes’ decisions for the lowest cost sections of a program, allowing relatively more time to be spent on the important stages, (2) a hierarchical sampling strategy that groups schedules based on their structural similarity, then samples representatives to be evaluated, allowing us to explore a large space with few samples, and (3) memoization of repeated partial schedules, amortizing their cost over all their occurrences. We guide the process with an efficient cost model combining machine learning, program analysis, and GPU architecture knowledge. We evaluate our method’s performance on a diverse suite of real-world imaging and vision pipelines. Our scalability optimizations lead to average compile time speedups of 49x (up to 530x). We find schedules that are on average 1.7x faster than existing automatic solutions (up to 5x), and competitive with what the best human experts were able to achieve in an active effort to beat our automatic results.

Skip Supplemental Material Section

Supplemental Material

Auxiliary Presentation Video

This is a presentation video at OOPSLA 2021 for the paper "Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU"

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google ScholarGoogle Scholar
  2. Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to Optimize Halide with Tree Search and Random Programs. ACM Trans. Graph. (Proc. SIGGRAPH), 38, 4 (2019), Article 121, July, 12 pages. issn:0730-0301 https://doi.org/10.1145/3306346.3322967 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, and Hadi Esmaeilzadeh. 2020. Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  4. Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Parallel Architectures and Compilation. 303–316. https://doi.org/10.1145/2628071.2628092 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Amir H Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. 2018. A survey on compiler autotuning using machine learning. Computing Surveys, 51, 5 (2018), 96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Riyadh Baghdadi, Massinissa Merouani, Mohamed-Hicham Leghettas, Kamel Abdous, Taha Arbaoui, Karima Benatchba, and Saman Amarasinghe. 2021. A Deep Learning Based Cost Model for Automatic Code Optimization. In Proceedings of the Fourth Conference on Machine Learning and Systems (MLSys 2021). http://groups.csail.mit.edu/commit/papers/21/tiramisu_autoscheduler.pdfGoogle ScholarGoogle Scholar
  7. Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. (Proc. PLDI), 43, 6 (2008), 101–113. https://doi.org/10.1145/1379022.1375595 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, USA. 579–594. isbn:9781931971478 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. 3389–3400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly — performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22, 04 (2012), 1250010. https://doi.org/10.1142/S0129626412500107 Google ScholarGoogle Scholar
  11. Ameer Haj-Ali, Nesreen K Ahmed, Ted Willke, Yakun Sophia Shao, Krste Asanovic, and Ion Stoica. 2020. NeuroVectorizer: end-to-end vectorization with deep reinforcement learning. 242–255. https://doi.org/10.1145/3368826.3377928 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Abhinav Jangda and Uday Bondhugula. 2018. An Effective Fusion and Tile Size Model for Optimizing Image Processing Pipelines. SIGPLAN Not. (Proc. PPoPP), 53, 1 (2018), Feb., 261–275. https://doi.org/10.1145/3178487.3178507 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP). ACM, 47–62. https://doi.org/10.1145/3341301.3359630 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. 2018. Differentiable programming for image processing and deep learning in Halide. ACM Trans. Graph. (Proc. SIGGRAPH), 37, 4 (2018), 139:1–139:13. https://doi.org/10.1145/3197517.3201383 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Charith Mendis, Alex Renda, Saman Amarasinghe, and Michael Carbin. 2019. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In International Conference on Machine Learning. 4505–4515.Google ScholarGoogle Scholar
  16. Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. (Proc. SIGGRAPH), 35, 4 (2016), Article 83, 11 pages. issn:0730-0301 https://doi.org/10.1145/2897824.2925952 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. SIGPLAN Not. (Proc. ASPLOS), 43, 1 (2015), 429–443. https://doi.org/10.1145/2775054.2694364 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Aditya Paliwal, Felix Gimeno, Vinod Nair, Yujia Li, Miles Lubin, Pushmeet Kohli, and Oriol Vinyals. 2020. Reinforced Genetic Algorithm Learning for Optimizing Computation Graphs. In Proceedings of the International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  19. Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines. ACM Trans. Graph. (Proc. SIGGRAPH), 31, 4 (2012), Article 32, 12 pages. issn:0730-0301 https://doi.org/10.1145/2185520.2185528 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not. (Proc. PLDI), 48, 6 (2013), 519–530. https://doi.org/10.1145/2491956.2462176 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Savvas Sioutas, Sander Stuijk, Twan Basten, Henk Corporaal, and Lou Somers. 2020. Schedule Synthesis for Halide Pipelines on GPUs. ACM Trans. Archit. Code Optim., 17, 3 (2020), Article 23, 25 pages. issn:1544-3566 https://doi.org/10.1145/3406117 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Savvas Sioutas, Sander Stuijk, Luc Waeijen, Twan Basten, Henk Corporaal, and Lou Somers. 2019. Schedule Synthesis for Halide Pipelines through Reuse Analysis. Trans. Archit. Code Optim., 16, 2 (2019), 10:1–10:22. https://doi.org/10.1145/3310248 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Benoit Steiner, Chris Cummins, Horace He, and Hugh Leather. 2021. Value Learning for Throughput Optimization of Deep Learning Workloads. In Proceedings of Machine Learning and Systems.Google ScholarGoogle Scholar
  24. The XLA Team. 2017. XLA – TensorFlow compiled. https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html Accessed: 2020-08-19.Google ScholarGoogle Scholar
  25. Vincent Vanhoucke. 2014. Learning visual representations at scale. ICLR invited talk, 1 (2014), 2.Google ScholarGoogle Scholar
  26. Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv:1802.04730.Google ScholarGoogle Scholar
  27. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. Trans. Archit. Code Optim., Article 54, 23 pages. https://doi.org/10.1145/2400682.2400713 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, and Koushik Sen. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. arXiv preprint arXiv:2006.06762. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 859–873. https://doi.org/10.1145/3373376.3378508 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Phothilimtha, Shen Wang, Anna Goldie, Azalia Mirhoseini, and James Laudon. 2020. Transferable Graph Optimizers for ML Compilers. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). 33, 13844–13855.Google ScholarGoogle Scholar

Index Terms

  1. Efficient automatic scheduling of imaging and vision pipelines for the GPU

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the ACM on Programming Languages
        Proceedings of the ACM on Programming Languages  Volume 5, Issue OOPSLA
        October 2021
        2001 pages
        EISSN:2475-1421
        DOI:10.1145/3492349
        Issue’s Table of Contents

        Copyright © 2021 Owner/Author

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2021
        Published in pacmpl Volume 5, Issue OOPSLA

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!