skip to main content
research-article
Free Access

The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests

Published:08 April 2016Publication History
Skip Abstract Section

Abstract

Affine transformations have proven to be powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multidimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks such as the Pluto algorithm, which include a cost function for modern multicore architectures for which coarse-grained parallelism and locality are crucial, consider only a subspace of transformations to avoid a combinatorial explosion in finding transformations. The ensuing practical trade-offs lead to the exclusion of certain useful transformations: in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In addition, there is currently no proof that the algorithm successfully finds a tree of permutable loop bands for all affine loop nests. In this article, we propose an approach to address these two issues (1) by modeling a much larger space of practically useful affine transformations in conjunction with the existing cost function of Pluto, and (2) by extending the Pluto algorithm in a way that allows a proof for its soundness and completeness for all affine loop nests. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance for any benchmark from Polybench. For the Lattice Boltzmann Method (LBM) simulations with periodic boundary conditions, it provides a mean speedup of 1.33 × over Pluto. We also show that Pluto+ does not increase compilation time significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time by only 15%. In cases in which it improves execution time significantly, it increased polyhedral optimization time by only 2.04 × .

References

  1. Aravind Acharya and Uday Bondhugula. 2015. PLUTO+: Near-complete modeling of affine transformations for parallelism and locality. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Ahmed, Nikolay Mateev, and Keshav Pingali. 2001. Synthesizing transformations for locality enhancement of imperfectly-nested loops. International Journal of Parallel Programming 29, 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, and Monica S. Lam. 2006. Compilers: Principles, Techniques, and Tools (2nd Edition). Prentice Hall, Upper Saddle River, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Corinne Ancourt and Francois Irigoin. 1991. Scanning polyhedra with DO loops. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In Supercomputing. Article 40, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cédric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 7--16. DOI:http://dx.doi.org/10.1109/PACT.2004.11 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Uday Bondhugula. 2008. Effective Automatic Parallelization and Locality Optimization Using the Polyhedral Model. Ph.D. Dissertation. The Ohio State University, Columbus, OH. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Uday Bondhugula, M. Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008a. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In International Conference on Parallel Architectures and Compilation Techniques. 343--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008b. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Symposium on Programming Languages Design and Implementation (PLDI’08). ACM, New York, NY, 101--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chun Chen. 2012. Polyhedra scanning revisited. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). ACM, New York, NY, 499--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Shiyi Chen and Gary D. Doolen. 1998. Lattice Boltzmann method for fluid flows. Annual Review of Fluid Mechanics 30, 1, 329--364.Google ScholarGoogle ScholarCross RefCross Ref
  14. C. Choffrut and K. Culik. 1983. Folding of the plane and the design of systolic arrays. Information Processing Letters 17, 3, 149--153.Google ScholarGoogle ScholarCross RefCross Ref
  15. Cloog 2004. The CLooG Code Generator in the Polyhedral Model. http://www.cloog.org.Google ScholarGoogle Scholar
  16. COIN-OR. 2001. Computational Infrastructure for Operations Research. Retrieved March 2, 2016 from http://www.coin-or.org.Google ScholarGoogle Scholar
  17. Dominique d’Humières. 2002. Multiple--relaxation--time Lattice Boltzmann models in three dimensions. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences 360, 1792, 437--451.Google ScholarGoogle Scholar
  18. P. Feautrier. 1988. Parametric integer programming. RAIRO Recherche Opérationnelle 22, 3, 243--268.Google ScholarGoogle ScholarCross RefCross Ref
  19. P. Feautrier. 1991. Dataflow analysis of scalar and array references. International Journal of Parallel Programming 20, 1, 23--53.Google ScholarGoogle ScholarCross RefCross Ref
  20. P. Feautrier. 1992a. Some efficient solutions to the affine scheduling problem: Part I, one-dimensional time. International Journal of Parallel Programming 21, 5, 313--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Feautrier. 1992b. Some efficient solutions to the affine scheduling problem: Part II, multidimensional time. International Journal of Parallel Programming 21, 6, 389--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. GNU. 2012. GLPK (GNU Linear Programming Kit). Retrieved March 2, 2016 from https://www.gnu.org/software/glpk/.Google ScholarGoogle Scholar
  23. Martin Griebl. 2004. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. Habilitation thesis. University of Passau, Passau, Germany.Google ScholarGoogle Scholar
  24. Tobias Grosser, Hongbin Zheng, Ragesh Aloor, Andreas Simbrger, Armin Grolinger, and Louis-Noël Pouchet. 2011. Polly polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11).Google ScholarGoogle Scholar
  25. Albert Hartono, Muthu Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, and J. Ramanujam. 2009. A parametric multi-level tiler for imperfect loop nests. In International Conference on Supercomputing (ICS’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam, and P. Sadayappan. 2011. Data layout transformation for stencil computations on short SIMD architectures. In ETAPS International Conference on Compiler Construction (CC’11). 225--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector SIMD architectures. In ACM International Conference on Supercomputing. ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Irigoin and R. Triolet. 1988. Supernode partitioning. In ACM SIGPLAN Principles of Programming Languages. ACM, New York, NY, 319--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Kelly and W. Pugh. 1995. A Unifying Framework for Iteration Reordering Transformations. Technical Report CS-TR-3430. Department of Computer Science, University of Maryland, College Park, MD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Kelly, W. Pugh, and E. Rosser. 1995. Code generation for multiple mappings. In International Symposium on the Frontiers of Massively Parallel Computation. 332--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Leung, N. T. Vasilache, B. Meister, and R. A. Lethin. 2010. Methods and apparatus for joint parallelism and locality optimization in source code compilation. (June 2010). US Patent 2010/0070956 A1. Filed September 16, 2009, Issued March 18, 2010.Google ScholarGoogle Scholar
  33. Wei Li and Keshav Pingali. 1994. A singular loop transformation framework based on non-singular matrices. International Journal of Parallel Programming 22, 2, 183--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An affine partitioning algorithm to maximize parallelism and minimize communication. In ACM International Conference on Supercomputing (ICS’99). 228--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Lim and Monica S. Lam. 1997. Maximizing parallelism and minimizing synchronization with affine transforms. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 201--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Lim and Monica S. Lam. 1998. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing 24, 3--4, 445--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Robin Lougee-Heimer. 2003. The common optimization interface for operations research. IBM Journal of Research and Development 47, 1, 57--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Benoit Meister, Allen Leung, Nicolas Vasilache, David Wohlford, Cédric Bastoul, and Richard Lethin. 2009. Productivity via automatic code generation for PGAS platforms with the R-stream compiler. In Workshop on Asynchrony in the PGAS Programming Model.Google ScholarGoogle Scholar
  39. Benoît Meister, Nicolas Vasilache, David Wohlford, Muthu Baskaran, Allen Leung, and Richard Lethin. 2011. R-stream compiler. In Encyclopedia of Parallel Computing. 1756--1765.Google ScholarGoogle Scholar
  40. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Nissa Osheim, Michelle Mills Strout, Dave Rostron, and Sanjay Rajopadhye. 2008. Smashing: Folding space to tile through time. In Workshop on Languages and Compilers for Parallel Computing (LCPC’08). Springer, Berlin, 80--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Palabos. 2012. Palabos. Retrieved March 2, 2016 from http://www.palabos.org/.Google ScholarGoogle Scholar
  43. Irshad Pananilath, Aravind Acharya, Vinay Vasista, and Uday Bondhugula. 2015. An optimizing code generator for a class of Lattice-Boltzmann computations. ACM Transactions on Architecture and Code Optimization. 12, 2, 14:1--14:23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Pluto 2008. PLUTO: An automatic parallelizer and locality optimizer for affine loop nests. http://pluto-compiler.sourceforge.net.Google ScholarGoogle Scholar
  45. Polybench 2010. Polybench suite. Retrieved March 2, 2016 from http://polybench.sourceforge.net.Google ScholarGoogle Scholar
  46. L.-N. Pouchet, Cédric Bastoul, John Cavazos, and Albert Cohen. 2008. Iterative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN Symposium on Programming Languages Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. 2011. Loop transformations: Convexity, pruning and optimization. In ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Fabien Quilleré, Sanjay V. Rajopadhye, and Doran Wilde. 2000. Generation of efficient nested loops from polyhedra. International Journal of Parallel Programming 28, 5, 469--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Gabe Rudy, Malik Murtaza Khan, Mary Hall, Chun Chen, and Jacqueline Chame. 2011. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing (LCPC’10). Springer, Berlin, 136--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Robert Sadourny. 1975. The dynamics of finite-difference models of the shallow-water equations. Journal of the Atmospheric Sciences 32, 4.Google ScholarGoogle ScholarCross RefCross Ref
  51. Alexander Schrijver. 1986. Theory of Linear and Integer Programming. John Wiley & Sons, Hoboken, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In International Conference on Parallel Processing (ICPP’11). 571--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Paul N. Swarztrauber. 2000. 171.swim SPEC CPU2000 Benchmark Description File. Standard Performance Evaluation Corporation. Retrieved March 2, 2016 from http://www.spec.org/cpu2000/CFP2000/171.swim/docs/171.swim.html.Google ScholarGoogle Scholar
  54. Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir stencil compiler. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). New York, NY, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Nicolas Vasilache. 2007. Scalable Program Optimization Techniques in the Polyhedral Model. Ph.D. Dissertation. Université de Paris-Sud, INRIA Futurs, Orsay, France.Google ScholarGoogle Scholar
  56. Nicolas Vasilache, Benoit Meister, Muthu Baskaran, and Richard Lethin. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. In International Workshop on Polyhedral Compilation Techniques (IMPACT’12).Google ScholarGoogle Scholar
  57. Sven Verdoolaege. 2010. ISL: An integer set library for the polyhedral model. In Mathematical Software—ICMS 2010, Komei Fukuda, Joris Hoeven, Michael Joswig, and Nobuki Takayama (Eds.). Lecture Series in Computer Science, Vol. 6327. Springer, Berlin, 299--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4, 54:1--54:23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. M. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In ACM SIGPLAN Symposium on Programming Languages Design and Implementation. ACM, New York, NY, 30--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. D. Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IPDPS. 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yoav Yaacoby and Peter R. Cappello. 1995. Converting affine recurrence equations to quasi-uniform recurrence equations. VLSI Signal Processing 11, 1--2, 113--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Tomofumi Yuki. 2014. Understanding Polybench/C 3.2 kernels. In International Workshop on Polyhedral Compilation Techniques (IMPACT’14).Google ScholarGoogle Scholar
  63. Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye. 2013. AlphaZ: A system for design space exploration in the polyhedral model. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 7760. Springer, Berlin, 17--31.Google ScholarGoogle Scholar
  64. Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems 31, 6, Article 20, 39 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Qisu Zou and Xiaoyi He. 1997. On pressure and velocity boundary conditions for the Lattice Boltzmann BGK model. Physics of Fluids (1994-present) 9, 6, 1591--1598.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!