Abstract
Loop fusion is an important compiler optimization for improving memory hierarchy performance through enabling data reuse. Traditional compilers have approached loop fusion in a manner decoupled from other high-level loop optimizations, missing several interesting solutions. Recently, the polyhedral compiler framework with its ability to compose complex transformations, has proved to be promising in performing loop optimizations for small programs. However, our experiments with large programs using state-of-the-art polyhedral compiler frameworks reveal suboptimal fusion partitions in the transformed code. We trace the reason for this to be lack of an effective cost model to choose a good fusion partitioning among the possible choices, which increase exponentially with the number of program statements. In this paper, we propose a fusion algorithm to choose good fusion partitions with two objective functions - achieving good data reuse and preserving parallelism inherent in the source code. These objectives, although targeted by previous work in traditional compilers, pose new challenges within the polyhedral compiler framework and have thus not been addressed. In our algorithm, we propose several heuristics that work effectively within the polyhedral compiler framework and allow us to achieve the proposed objectives. Experimental results show that our fusion algorithm achieves performance comparable to the existing polyhedral compilers for small kernel programs, and significantly outperforms them for large benchmark programs such as those in the SPEC benchmark suite.
- Pluto: An automatic parallelizer and locality optimizer for multicores. Available at http://pluto-compiler.sourceforge.net.Google Scholar
- Pocc: Polyhedral compiler collection. Available at http://www.cse.ohio-state.edu/~pouchet/software/pocc/.Google Scholar
- Polybench. Available at http://www-roc.inria.fr/~pouchet/software/polybench/.Google Scholar
- Polyopt: a polyhedral optimizer for the rose compiler. Available at http://www.cse.ohio-state.edu/~pouchet/software/polyopt/.Google Scholar
- R-stream compiler. Available at https://www.reservoir.com/?q=rstream.Google Scholar
- Rose compiler. Available at http://rosecompiler.org/.Google Scholar
- Ibm xl compiler. Available at http://ibm.com/software/awdtools/xlcpp/.Google Scholar
- M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. In CC '2010, pages 283--303, Paphos, Cyprus, Mar. . Springer Verlag. Google Scholar
Digital Library
- U. Bondhugula. Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model. PhD thesis, Ohio State University, 2010. Google Scholar
Digital Library
- U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In CC'2008, pages 132--146. Springer Berlin / Heidelberg. Google Scholar
Digital Library
- U. Bondhugula, O. Gunluk, S. Dash, and L. Renganarayanan. A model for fusion and code motion in an automatic parallelizing compiler. In PACT '2010, pages 343--352. ACM. Google Scholar
Digital Library
- C. Chen, J. Chame, and M. Hall. Chill: A framework for composing high-level loop transformations. U. of Southern California, Tech. Rep, pages 08--897, 2008.Google Scholar
- A. Cohen, M. Sigler, S. Girbal, O. Temam, D. Parello, and N. Vasilache. Facilitating the search for compositions of program transformations. In ICS '05, pages 151--160. ACM. Google Scholar
Digital Library
- A. Darte. On the complexity of loop fusion. In PACT '99, Washington, DC, USA. IEEE Computer Society. Google Scholar
Digital Library
- C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. JPDC, 64 (1): 108 -- 134, 2004. Google Scholar
Digital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem. i. one-dimensional time. IJPP, pages 313--347, 1992. Google Scholar
Digital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem. part ii. multidimensional time. IJPP, pages 389--420, 1992.Google Scholar
Cross Ref
- S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. IJPP, 34: 261--317, 2006. Google Scholar
Digital Library
- M. Griebl. phAutomatic Parallelization of Loop Programs for Distributed Memory Architectures. PhD thesis, University of Passau, 2004.Google Scholar
- T. Grosser, H. Zheng, R. Aloor, A. Simbürger, A. Größlinger, and L.-N. Pouchet. Polly-polyhedral optimization in llvm. In IMPACT '2011.Google Scholar
- W. Kelly and W. Pugh. Minimizing communication while preserving parallelism. In ICS '1995, pages 52--60. ACM Press. Google Scholar
Digital Library
- K. Kennedy and K. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In LCPC, volume 768 of Lecture Notes in Computer Science, pages 301--320. 1994. Google Scholar
Digital Library
- A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing, 24: 445 -- 475, 1998. Google Scholar
Digital Library
- A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In ICS '99, pages 228--237. ACM. Google Scholar
Digital Library
- LLVM. Polly: Polyhedral optimizations for llvm. Available at http://polly.llvm.org/.Google Scholar
- N. Megiddo and V. Sarkar. Optimal weighted loop fusion for parallel programs. In SPAA '97, pages 282--291. ACM. Google Scholar
Digital Library
- L.-N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: Part ii, multidimensional time. In PLDI '08, pages 90--100. ACM. Google Scholar
Digital Library
- L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part i, one-dimensional time. In CGO '07, pages 144 --156. Google Scholar
Digital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop transformations: convexity, pruning and optimization. In POPL '2011, pages 549--562. ACM. Google Scholar
Digital Library
- M. Sharir. A strong-connectivity algorithm and its applications in data flow analysis. Computers & Mathematics with Applications, 7 (1): 67--72, 1981.Google Scholar
Cross Ref
- S. K. Singhai and K. S. McKinley. A parametrized loop fusion algorithm for improving parallelism and cache locality. The Computer Journal, 40 (6): 340--355, 1997.Google Scholar
Cross Ref
Index Terms
Revisiting loop fusion in the polyhedral framework
Recommendations
Revisiting loop fusion in the polyhedral framework
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingLoop fusion is an important compiler optimization for improving memory hierarchy performance through enabling data reuse. Traditional compilers have approached loop fusion in a manner decoupled from other high-level loop optimizations, missing several ...
General loop fusion technique for nested loops considering timing and code size
CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systemsLoop fusion is commonly used to improve the instruction-level parallelism of loops for high-performance embedded computing systems. Loop fusion, however, is not always directly applicable because the fusion prevention dependencies may exist among loops. ...
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion
Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing loops that access similar sets of data. Typically, ...







Comments