Abstract
Effective models for fusion of loop nests continue to remain a challenge in both general-purpose and domain-specific language (DSL) compilers. The difficulty often arises from the combinatorial explosion of grouping choices and their interaction with parallelism and locality. This paper presents a new fusion algorithm for high-performance domain-specific compilers for image processing pipelines. The fusion algorithm is driven by dynamic programming and explores spaces of fusion possibilities not covered by previous approaches, and is driven by a cost function more concrete and precise in capturing optimization criteria than prior approaches. The fusion model is particularly tailored to the transformation and optimization sequence applied by PolyMage and Halide, two recent DSLs for image processing pipelines. Our model-driven technique when implemented in PolyMage provides significant improvements (up to 4.32X) over PolyMage's approach (which uses auto-tuning to aid its model), and over Halide's automatic approach (by up to 2.46X) on two state-of-the-art shared-memory multicore architectures.
Supplemental Material
Available for Download
PolyMage PPoPP 2018 artifact
- Protonu Basu, Anand Venkat, Mary W. Hall, Samuel W. Williams, Brian van Straalen, and Leonid Oliker. 2013. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In 20th International Conference on High Performance Computing (HiPC). 452--461.Google Scholar
Cross Ref
- Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In International conference on Parallel Architectures and Compilation Techniques. 343--352. Google Scholar
Digital Library
- Guang R. Gao, R. Olsen, Vivek Sarkar, and Radhika Thekkath. 1992. Collective Loop Fusion for Array Contraction. In Languages and Compilers for Parallel Computing, 5th International Workshop. 281--295. Google Scholar
Digital Library
- Halide on GitHub, MIT license 2017. Halide auto-scheduler. (2017). https://github.com/halide/Halide/tree/auto_scheduler commit 89679918b42eb14d358a8e6214755de1e42ff046, Dec 11, 2017.Google Scholar
- Google Inc. 2017. XLA (Accelerated Linear Algebra) for TensorFlow. (2017). https://www.tensorflow.org/performance/xla/.Google Scholar
- Ken Kennedy. 2001. Fast Greedy Weighted Fusion. International Journal of Parallel Programming 29, 5 (2001), 463--491. Google Scholar
Digital Library
- Ken Kennedy and Kathryn S. McKinley. 1993. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In Languages and Compilers for Parallel Computing. 301--320. Google Scholar
Digital Library
- Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. In ACM SIGPLAN conference on Programming Languages Design and Implementation (PLDI). Google Scholar
Digital Library
- Nimrod Megiddo and Vivek Sarkar. 1997. Optimal Weighted Loop Fusion for Parallel Programs. In ACM Symposium on Parallel Algorithms and Architectures (SPAA '97). 282--291. Google Scholar
Digital Library
- Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. SIGGRAPH 2016/ACM Trans. Graph. 35, 4 (July 2016), 83:1--83:11. Google Scholar
Digital Library
- Ravi Teja Mullapudi, Vinay Vasista, and Uday. Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 429--443. Google Scholar
Digital Library
- Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, and Jeffrey Hittinger. 2014. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 793--804. Google Scholar
Digital Library
- PolyMage project, Apache 2.0 license 2016. PolyMage. (2016). https://bitbucket.org/udayb/polymage commit 0ff0b46456605a5579db09c6ef98cb247dd2131d, Dec 16, 2016.Google Scholar
- Apan Qasem and Ken Kennedy. 2006. Profitable loop fusion and tiling using model-driven empirical search. In International Conference on Supercomputing (ICS). 249--258. Google Scholar
Digital Library
- Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines. ACM Transactions on Graphics 31, 4 (2012), 32:1--32:12. Google Scholar
Digital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN conference on Programming Languages Design and Implementation. 519--530. Google Scholar
Digital Library
- István Z. Reguly, Gihan R. Mudalige, and Mike B. Giles. 2017. Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS. CoRR abs/1704.00693 (2017). arXiv:1704.00693 http://arxiv.org/abs/1704.00693Google Scholar
- Gerald Roth and Ken Kennedy. 1998. Loop Fusion in High Performance Fortran. In International conference on Supercomputing, ICS 1998. 125--132. Google Scholar
Digital Library
- M. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In ACM SIGPLAN symposium on Programming Languages Design and Implementation. 30--44. Google Scholar
Digital Library
- David Wonnacott. 1999. Time Skewing for Parallel Computers. In In Proceedings of the Twelfth Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, 477--480. Google Scholar
Digital Library
- Qing Yi and Ken Kennedy. 2004. Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion. IJHPCA 18, 2 (2004), 237--253. Google Scholar
Digital Library
- Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical Overlapped Tiling. In International symposium on Code Generation and Optimization (CGO). 207--218. Google Scholar
Digital Library
Index Terms
An effective fusion and tile size model for optimizing image processing pipelines
Recommendations
Automatically scheduling halide image processing pipelines
The Halide image processing language has proven to be an effective system for authoring high-performance image processing code. Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine (a ...
PolyMage: Automatic Optimization for Image Processing Pipelines
ASPLOS'15This paper presents the design and implementation of PolyMage, a domain-specific language and compiler for image processing pipelines. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each ...
An Effective Fusion and Tile Size Model for PolyMage
Effective models for fusion of loop nests continue to remain a challenge in both general-purpose and domain-specific language (DSL) compilers. The difficulty often arises from the combinatorial explosion of grouping choices and their interaction with ...







Comments