Abstract
Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks like the Pluto algorithm, that include a cost function for modern multicore architectures where coarse-grained parallelism and locality are crucial, consider only a sub-space of transformations to avoid a combinatorial explosion in finding the transformations. The ensuing practical trade-offs lead to the exclusion of certain useful transformations, in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In this paper, we propose an approach to address this limitation by modeling a much larger space of affine transformations in conjunction with the Pluto algorithm's cost function. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance in any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes with periodic boundary conditions, it provides a mean speedup of 1.33x over Pluto. We also show that Pluto+ does not increase compile times significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time only by 15%. In cases where it improves execution time significantly, it increased polyhedral optimization time only by 2.04x.
- A. V. Aho, R. Sethi, J. D. Ullman, and M. S. Lam. Compilers: Principles, Techniques, and Tools (second edition). Prentice Hall, 2006. Google Scholar
Digital Library
- V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Supercomputing, pages 40:1– 40:11, 2012. Google Scholar
Digital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 7–16, 2004.. Google Scholar
Digital Library
- U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International conference on Compiler Construction (ETAPS CC), 2008. Google Scholar
Digital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN symposium on Programming Languages Design and Implementation (PLDI), pages 101–113, 2008. Google Scholar
Digital Library
- U. Bondhugula, V. Bandishti, A. Cohen, G. Potron, and N. Vasilache. Tiling and optimizing time-iterated computations on periodic domains. In International conference on Parallel Architectures and Compilation Techniques (PACT), pages 39–50, 2014. Google Scholar
Digital Library
- C. Chen. Polyhedra scanning revisited. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 499–508, 2012. Google Scholar
Digital Library
- S. Chen and G. D. Doolen. Lattice boltzmann method for fluid flows. Annual review of fluid mechanics, 30(1):329–364, 1998.Google Scholar
- C. Choffrut and K. Culik. Folding of the plane and the design of systolic arrays. Information Processing Letters, 17(3):149 – 153, 1983.Google Scholar
Cross Ref
- Cloog. The Chunky Loop Generator. http://www.cloog.org.Google Scholar
- D. d’Humières. Multiple–relaxation–time lattice boltzmann models in three dimensions. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 360(1792):437–451, 2002.Google Scholar
Cross Ref
- P. Feautrier. Parametric integer programming. RAIRO Recherche Opérationnelle, 22(3):243–268, 1988.Google Scholar
Cross Ref
- P. Feautrier. Dataflow analysis of scalar and array references. International Journal of Parallel Programming, 20(1):23–53, Feb. 1991.Google Scholar
Cross Ref
- P. Feautrier. Some efficient solutions to the affine scheduling problem: Part II, multidimensional time. International Journal of Parallel Programming, 21(6):389–420, 1992. Google Scholar
Digital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem: Part I, one-dimensional time. International Journal of Parallel Programming, 21(5):313–348, 1992. Google Scholar
Digital Library
- GNU. GLPK (GNU Linear Programming Kit). https://www.gnu.org/software/glpk/.Google Scholar
- M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau, 2004. Habilitation thesis.Google Scholar
- A. Hartono, M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, and J. Ramanujam. A parametric multi-level tiler for imperfect loop nests. In International conference on Supercomputing (ICS), 2009. Google Scholar
Digital Library
- T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short simd architectures. In ETAPS International conference on Compiler Construction (CC’11), pages 225–245, Mar. 2011. Google Scholar
Digital Library
- T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector SIMD architectures. In ACM International Conference on Supercomputing, 2013. Google Scholar
Digital Library
- M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet simd code generation. In ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), June 2013. Google Scholar
Digital Library
- A. Leung, N. Vasilache, B. Meister, and R. Lethin. Methods and apparatus for joint parallelism and locality optimization in source code compilation, June 3 2010. WO Patent App. PCT/US2009/057,194.Google Scholar
- W. Li and K. Pingali. A singular loop transformation framework based on non-singular matrices. International Journal of Parallel Programming, 22(2):183–205, 1994. Google Scholar
Digital Library
- A. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, pages 201–214, 1997. Google Scholar
Digital Library
- A. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing, 24(3-4): 445–475, 1998. Google Scholar
Digital Library
- A. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In ACM International Conference on Supercomputing (ICS), pages 228–237, 1999. Google Scholar
Digital Library
- B. Meister, N. Vasilache, D. Wohlford, M. Baskaran, A. Leung, and R. Lethin. R-Stream Compiler. In Encyclopedia of Parallel Computing, pages 1756–1765. 2011.Google Scholar
- N. Osheim, M. M. Strout, D. Rostron, and S. Rajopadhye. Smashing: Folding space to tile through time. In Workshop on Languages and Compilers for Parallel Computing (LCPC), pages 80–93. Springer-Verlag, 2008. Google Scholar
Digital Library
- Palabos. Palabos. http://www.palabos.org/.Google Scholar
- Polybench. Polybench suite. http://polybench.sourceforge.net.Google Scholar
- L.-N. Pouchet, C. Bastoul, J. Cavazos, and A. Cohen. Iterative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN symposium on Programming Languages Design and Implementation (PLDI), June 2008. Google Scholar
Digital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop transformations: Convexity, pruning and optimization. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’11), Jan. 2011. Google Scholar
Digital Library
- R. Sadourny. The dynamics of finite-difference models of the shallowwater equations. J. Atm. Sciences, 32(4), Apr. 1975.Google Scholar
Cross Ref
- R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accurate time skewing in iterative stencil computations. In International conference on Parallel Processing (ICPP), pages 571–581, 2011. Google Scholar
Digital Library
- P. N. Swarztrauber. 171.swim spec cpu2000 benchmark description file. Standard Performance Evaluation Corporation. http://www.spec.org/cpu2000/CFP2000/171.swim/docs/171.swim.html, 2000.Google Scholar
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In ACM symposium on Parallelism in Algorithms and Architectures (SPAA), pages 117–128, 2011. Google Scholar
Digital Library
- N. Vasilache. Scalable Program Optimization Techniques in the Polyhedral Model. PhD thesis, Université de Paris-Sud, INRIA Futurs, Sept. 2007.Google Scholar
- S. Verdoolaege. ISL: An Integer Set Library for the Polyhedral Model. In K. Fukuda, J. Hoeven, M. Joswig, and N. Takayama, editors, Mathematical Software - ICMS 2010, volume 6327, pages 299– 302. Springer, 2010. Google Scholar
Digital Library
- S. Verdoolaege and T. Grosser. Polyhedral extraction tool. In International workshop on Polyhedral Compilation Techniques (IMPACT), 2012.Google Scholar
- D. Wonnacott. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS, pages 171 –180, 2000.Google Scholar
Digital Library
- Y. Yaacoby and P. R. Cappello. Converting affine recurrence equations to quasi-uniform recurrence equations. VLSI Signal Processing, 11(1- 2):113–131, 1995. Google Scholar
Digital Library
- T. Yuki. Understanding PolyBench/C 3.2 kernels. In International workshop on Polyhedral Compilation Techniques (IMPACT), Jan. 2014.Google Scholar
- Q. Zou and X. He. On pressure and velocity boundary conditions for the lattice Boltzmann BGK model. Physics of Fluids (1994-present), 9(6):1591–1598, 1997.Google Scholar
Index Terms
PLUTO+: near-complete modeling of affine transformations for parallelism and locality
Recommendations
A practical automatic polyhedral parallelizer and locality optimizer
PLDI '08: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and ImplementationWe present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this ...
The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests
Affine transformations have proven to be powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multidimensional affine function can represent a long and complex sequence of simpler transformations. ...
A practical automatic polyhedral parallelizer and locality optimizer
PLDI '08We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this ...






Comments