Abstract
High-level loop transformations are a key instrument in mapping computational kernels to effectively exploit the resources in modern processor architectures. Nevertheless, selecting required compositions of loop transformations to achieve this remains a significantly challenging task; current compilers may be off by orders of magnitude in performance compared to hand-optimized programs. To address this fundamental challenge, we first present a convex characterization of all distinct, semantics-preserving, multidimensional affine transformations. We then bring together algebraic, algorithmic, and performance analysis results to design a tractable optimization algorithm over this highly expressive space. Our framework has been implemented and validated experimentally on a representative set of benchmarks running on state-of-the-art multi-core platforms.
Supplemental Material
- F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O'Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learning to focus iterative optimization. In Proc. of the Intl. Symposium on Code Generation and Optimization (CGO'06), pages 295--305, Washington, 2006. Google Scholar
Digital Library
- D. Barthou, J.-F. Collard, and P. Feautrier. Fuzzy array dataflow analysis. Journal of Parallel and Distributed Computing, 40:210--226, 1997. Google Scholar
Digital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'04), pages 7--16, Juan-les-Pins, France, Sept. 2004. Google Scholar
Digital Library
- M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul. The polyhedral model is more widely applicable than you think. In Intl. Conf. on Compiler Construction (ETAPS CC'10), LNCS 6011, pages 283--303, Paphos, Cyprus, Mar. 2010. Google Scholar
Digital Library
- F. Bodin, T. Kisuki, P. M. W. Knijnenburg, M. F. P. O'Boyle, and E. Rohou. Iterative compilation in a non-linear optimisation space. In W. on Profile and Feedback Directed Compilation, Paris, Oct. 1998.Google Scholar
- U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International conference on Compiler Construction (ETAPS CC), Apr. 2008. Google Scholar
Digital Library
- U. Bondhugula, O. Gunluk, S. Dash, and L. Renganarayanan. A model for fusion and code motion in an automatic parallelizing compiler. In Proc. of the 19th intl. conf. on Parallel Architectures and Compilation Techniques (PACT'10), pages 343--352. ACM press, 2010. Google Scholar
Digital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2008. Google Scholar
Digital Library
- C. Chen, J. Chame, and M. Hall. CHiLL: A framework for composing high-level loop transformations. Technical Report 08-897, U. of Southern California, 2008.Google Scholar
- P. Clauss. Counting solutions to linear and nonlinear constraints through ehrhart polynomials: applications to analyze and transform scientific programs. In Proc. of the Intl. Conf. on Supercomputing, pages 278--285. ACM, 1996. Google Scholar
Digital Library
- A. Cohen, S. Girbal, D. Parello, M. Sigler, O. Temam, and N. Vasilache. Facilitating the search for compositions of program transformations. In ACM International conference on Supercomputing, pages 151--160, June 2005. Google Scholar
Digital Library
- A. Darte. On the complexity of loop fusion. Parallel Computing, pages 149--157, 1999.Google Scholar
- A. Darte and G. Huard. Loop shifting for loop parallelization. Technical Report RR2000-22, ENS Lyon, May 2000.Google Scholar
- A. Darte, G.-A. Silber, and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop tiling. Parallel Proc. Letters, 7(4):379--392, 1997.Google Scholar
Cross Ref
- P. Feautrier. Parametric integer programming. RAIRO Recherche Opérationnelle, 22(3):243--268, 1988.Google Scholar
Cross Ref
- P. Feautrier. Dataflow analysis of scalar and array references. Intl. J. of Parallel Programming, 20(1):23--53, Feb. 1991.Google Scholar
Cross Ref
- P. Feautrier. Some efficient solutions to the affine scheduling problem, part I: one dimensional time. Intl. J. of Parallel Programming, 21(5):313--348, Oct. 1992. Google Scholar
Digital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. Intl. J. of Parallel Programming, 21(6):389--420, Dec. 1992. Google Scholar
Digital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. Formal loop merging for signal transforms. In ACM SIGPLAN Conf. on Programming Language Design and Implementation, pages 315--326, 2005. Google Scholar
Digital Library
- S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations. Intl. J. of Parallel Programming, 34(3):261--317, June 2006. Google Scholar
Digital Library
- M. Griebl. Automatic parallelization of loop programs for distributed memory architectures. Habilitation thesis. Facultät für Mathematik und Informatik, Universität Passau, 2004.Google Scholar
- A.-C. Guillou, F. Quilleré, P. Quinton, S. Rajopadhye, and T. Risset. Hardware design methodology with the Alpha language. In FDL'01, Lyon, France, Sept. 2001.Google Scholar
- F. Irigoin and R. Triolet. Supernode partitioning. In ACM SIGPLAN Principles of Programming Languages, pages 319--329, 1988. Google Scholar
Digital Library
- W. Kelly. Optimization within a Unified Transformation Framework. PhD thesis, Univ. of Maryland, 1996. Google Scholar
Digital Library
- K. Kennedy and K. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, pages 301--320, 1993. Google Scholar
Digital Library
- I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In ACM SIGPLAN'97 Conf. on Programming Language Design and Implementation, pages 346--357, Las Vegas, June 1997. Google Scholar
Digital Library
- M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI'08), pages 114--124. ACM Press, 2008. Google Scholar
Digital Library
- R. Lethin, A. Leung, B. Meister, N. Vasilache, D. Wohlford, M. Baskaran, A. Hartono, and K. Datta. In D. Padua, editor, Encyclopedia of Parallel Computing. 1st edition., 2011, 50 p. in 4 volumes, not available separately., hardcover edition, June 2011.Google Scholar
- K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst., 18(4):424--453, 1996. Google Scholar
Digital Library
- N. Megiddo and V. Sarkar. Optimal weighted loop fusion for parallel programs. In symposium on Parallel Algorithms and Architectures, pages 282--291, 1997. Google Scholar
Digital Library
- A. Nisbet. GAPS: A compiler framework for genetic algorithm (GA) optimised parallelisation. In HPCN Europe 1998: Proc. of the Intl. Conf. and Exhibition on High-Performance Computing and Networking, pages 987--989, London, UK, 1998. Springer-Verlag. Google Scholar
Digital Library
- L.-N. Pouchet. Interative Optimization in the Polyhedral Model. PhD thesis, University of Paris-Sud 11, Orsay, France, Jan. 2010.Google Scholar
- L.-N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI'08), pages 90--100. ACM Press, 2008. Google Scholar
Digital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In Conf. on SuperComputing (SC'10), New Orleans, LA, Nov. 2010. To appear. Google Scholar
Digital Library
- A. Qasem and K. Kennedy. Profitable loop fusion and tiling using model-driven empirical search. In Proc. of the 20th Intl. Conf. on Supercomputing (ICS'06), pages 249--258. ACM press, 2006. Google Scholar
Digital Library
- J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for multicomputers. Journal of Parallel and Distributed Computing, 16(2):108--230, 1992.Google Scholar
Cross Ref
- M. Ren, J. Y. Park, M. Houston, A. Aiken, and W. J. Dally. A tuning framework for software-managed memory hierarchies. In Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'08), pages 280--291. ACM Press, 2008. Google Scholar
Digital Library
- L. Renganarayanan, D. Kim, S. Rajopadhye, and M. M. Strout. Parameterized tiled loops for free. SIGPLAN Notices, Proc. of the 2007 PLDI Conf., 42(6):405--414, 2007. Google Scholar
Digital Library
- A. Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1986. Google Scholar
Digital Library
- S. Singhai and K. McKinley. A Parameterized Loop Fusion Algorithm for Improving Parallelism and Cache Locality. The Computer Journal, 40(6):340--355, 1997.Google Scholar
Cross Ref
- N. J. A. Sloane. Sequence a000670. The On-Line Encyclopedia of Integer Sequences.Google Scholar
- M. Stephenson, S. Amarasinghe, M. Martin, and U.-M. O'Reilly. Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not., 38(5):77--90, 2003. Google Scholar
Digital Library
- A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable autotuning framework for computer optimization. In IPDPS'09, Rome, May 2009. Google Scholar
Digital Library
- N. Vasilache. Scalable Program Optimization Techniques in the Polyhedra Model. PhD thesis, University of Paris-Sud 11, 2007.Google Scholar
- S. Verdoolaege, F. Catthoor, M. Bruynooghe, and G. Janssens. Feasibility of incremental translation. Technical Report CW 348, Katholieke Universiteit Leuven Department of Computer Science, Oct. 2002.Google Scholar
- Y. Voronenko, F. de Mesmay, and M. Püschel. Computer generation of general size linear transform libraries. In Intl. Symp. on Code Generation and Optimization (CGO'09), Mar. 2009. Google Scholar
Digital Library
- R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the atlas project. Parallel Computing, 27(1--2):3--35, 2001.Google Scholar
- M. Wolf, D. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, pages 274--286, 1996. Google Scholar
Digital Library
- M. Wolfe. More iteration space tiling. In Proceedings of Supercomputing '89, pages 655--664, 1989. Google Scholar
Digital Library
- M. Wolfe. High performance compilers for parallel computing. Addison-Wesley Publishing Company, 1995. Google Scholar
Digital Library
Index Terms
Loop transformations: convexity, pruning and optimization
Recommendations
Loop transformations: convexity, pruning and optimization
POPL '11: Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languagesHigh-level loop transformations are a key instrument in mapping computational kernels to effectively exploit the resources in modern processor architectures. Nevertheless, selecting required compositions of loop transformations to achieve this remains a ...
Loop Transformation Using Nonunimodular Matrices
Linear transformations are widely used to vectorize and parallelize loops. A subset of these transformations are unimodular transformations. When a unimodular transformation is used, the exact bounds of the transformed loop nest are easily computed and ...
When polyhedral transformations meet SIMD code generation
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and ImplementationData locality and parallelism are critical optimization objectives for performance on modern multi-core machines. Both coarse-grain parallelism (e.g., multi-core) and fine-grain parallelism (e.g., vector SIMD) must be effectively exploited, but despite ...







Comments