skip to main content

An effective fusion and tile size model for optimizing image processing pipelines

Published:10 February 2018Publication History
Skip Abstract Section

Abstract

Effective models for fusion of loop nests continue to remain a challenge in both general-purpose and domain-specific language (DSL) compilers. The difficulty often arises from the combinatorial explosion of grouping choices and their interaction with parallelism and locality. This paper presents a new fusion algorithm for high-performance domain-specific compilers for image processing pipelines. The fusion algorithm is driven by dynamic programming and explores spaces of fusion possibilities not covered by previous approaches, and is driven by a cost function more concrete and precise in capturing optimization criteria than prior approaches. The fusion model is particularly tailored to the transformation and optimization sequence applied by PolyMage and Halide, two recent DSLs for image processing pipelines. Our model-driven technique when implemented in PolyMage provides significant improvements (up to 4.32X) over PolyMage's approach (which uses auto-tuning to aid its model), and over Halide's automatic approach (by up to 2.46X) on two state-of-the-art shared-memory multicore architectures.

Skip Supplemental Material Section

Supplemental Material

References

  1. Protonu Basu, Anand Venkat, Mary W. Hall, Samuel W. Williams, Brian van Straalen, and Leonid Oliker. 2013. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In 20th International Conference on High Performance Computing (HiPC). 452--461.Google ScholarGoogle ScholarCross RefCross Ref
  2. Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In International conference on Parallel Architectures and Compilation Techniques. 343--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Guang R. Gao, R. Olsen, Vivek Sarkar, and Radhika Thekkath. 1992. Collective Loop Fusion for Array Contraction. In Languages and Compilers for Parallel Computing, 5th International Workshop. 281--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Halide on GitHub, MIT license 2017. Halide auto-scheduler. (2017). https://github.com/halide/Halide/tree/auto_scheduler commit 89679918b42eb14d358a8e6214755de1e42ff046, Dec 11, 2017.Google ScholarGoogle Scholar
  5. Google Inc. 2017. XLA (Accelerated Linear Algebra) for TensorFlow. (2017). https://www.tensorflow.org/performance/xla/.Google ScholarGoogle Scholar
  6. Ken Kennedy. 2001. Fast Greedy Weighted Fusion. International Journal of Parallel Programming 29, 5 (2001), 463--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ken Kennedy and Kathryn S. McKinley. 1993. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In Languages and Compilers for Parallel Computing. 301--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. In ACM SIGPLAN conference on Programming Languages Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nimrod Megiddo and Vivek Sarkar. 1997. Optimal Weighted Loop Fusion for Parallel Programs. In ACM Symposium on Parallel Algorithms and Architectures (SPAA '97). 282--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. SIGGRAPH 2016/ACM Trans. Graph. 35, 4 (July 2016), 83:1--83:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ravi Teja Mullapudi, Vinay Vasista, and Uday. Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 429--443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, and Jeffrey Hittinger. 2014. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 793--804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. PolyMage project, Apache 2.0 license 2016. PolyMage. (2016). https://bitbucket.org/udayb/polymage commit 0ff0b46456605a5579db09c6ef98cb247dd2131d, Dec 16, 2016.Google ScholarGoogle Scholar
  14. Apan Qasem and Ken Kennedy. 2006. Profitable loop fusion and tiling using model-driven empirical search. In International Conference on Supercomputing (ICS). 249--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines. ACM Transactions on Graphics 31, 4 (2012), 32:1--32:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN conference on Programming Languages Design and Implementation. 519--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. István Z. Reguly, Gihan R. Mudalige, and Mike B. Giles. 2017. Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS. CoRR abs/1704.00693 (2017). arXiv:1704.00693 http://arxiv.org/abs/1704.00693Google ScholarGoogle Scholar
  18. Gerald Roth and Ken Kennedy. 1998. Loop Fusion in High Performance Fortran. In International conference on Supercomputing, ICS 1998. 125--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In ACM SIGPLAN symposium on Programming Languages Design and Implementation. 30--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. David Wonnacott. 1999. Time Skewing for Parallel Computers. In In Proceedings of the Twelfth Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, 477--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Qing Yi and Ken Kennedy. 2004. Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion. IJHPCA 18, 2 (2004), 237--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical Overlapped Tiling. In International symposium on Code Generation and Optimization (CGO). 207--218. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An effective fusion and tile size model for optimizing image processing pipelines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!