Abstract
Data locality and parallelism are critical optimization objectives for performance on modern multi-core machines. Both coarse-grain parallelism (e.g., multi-core) and fine-grain parallelism (e.g., vector SIMD) must be effectively exploited, but despite decades of progress at both ends, current compiler optimization schemes that attempt to address data locality and both kinds of parallelism often fail at one of the three objectives.
We address this problem by proposing a 3-step framework, which aims for integrated data locality, multi-core parallelism and SIMD execution of programs. We define the concept of vectorizable codelets, with properties tailored to achieve effective SIMD code generation for the codelets. We leverage the power of a modern high-level transformation framework to restructure a program to expose good ISA-independent vectorizable codelets, exploiting multi-dimensional data reuse. Then, we generate ISA-specific customized code for the codelets, using a collection of lower-level SIMD-focused optimizations.
We demonstrate our approach on a collection of numerical kernels that we automatically tile, parallelize and vectorize, exhibiting significant performance improvements over existing compilers.
- PoCC, the polyhedral compiler collection. http://pocc.sourceforge.net.Google Scholar
- PolyOpt/C. http://hpcrl.cse.ohio-state.edu/wiki/index.php/polyopt/c.Google Scholar
- www.spiral.net/software/stencilgen.html.Google Scholar
- V. Bandishti, I. Pananilath, , and U. Bondhugula. Tiling stencil computations to maximize parallelism. In ACM/IEEE Conf. on Supercomputing (SC'12), 2012. Google Scholar
Digital Library
- M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J. Ramanujam, and P. Sadayappan. Parameterized tiling revisited. In CGO, April 2010. Google Scholar
Digital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'04), pages 7--16, Juan-les-Pins, France, Sept. 2004. Google Scholar
Digital Library
- C. Bastoul and P. Feautrier. More legal transformations for locality. In Euro-Par'10 Intl. Euro-Par conference, LNCS 3149, pages 272--283, Pisa, august 2004.Google Scholar
Cross Ref
- D. Batory, C. Johnson, B. MacDonald, and D. von Heeder. Achieving extensibility through product-lines and domain-specific languages: A case study. ACM Transactions on Software Engineering and Methodology (TOSEM), 11(2):191--214, 2002. Google Scholar
Digital Library
- D. Batory, R. Lopez-Herrejon, and J.-P. Martin. Generating productlines of product-families. In Proc. Automated Software Engineering Conference (ASE), 2002. Google Scholar
Digital Library
- J. Bentley. Programming pearls: little languages. Communications of the ACM, 29(8):711--721, 1986. Google Scholar
Digital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In PLDI, June 2008.Google Scholar
Digital Library
- C. Chen, J. Chame, and M. Hall. Chill: A framework for composing high-level loop transformations. Technical Report 08-897, USC Computer Science Technical Report, 2008.Google Scholar
- K. Czarnecki and U. Eisenecker. Generative Programming: Methods, Tools, and Applications. Addison-Wesley, 2000. Google Scholar
Digital Library
- J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, C. Whaley, and K. Yelick. Self adapting linear algebra algorithms and software. Proc. of the IEEE, 93(2):293--312, 2005.Google Scholar
Cross Ref
- A. Eichenberger, P. Wu, and K. O'Brien. Vectorization for simd architectures with alignment constraints. In PLDI, 2004. Google Scholar
Digital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem, part II: multidimensional time. Intl. J. of Parallel Programming, 21(6):389--420, Dec. 1992. Google Scholar
Digital Library
- M. Frigo. A fast Fourier transform compiler. In PLDI, pages 169--180, 1999. Google Scholar
Digital Library
- M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proc. of the IEEE, 93(2):216--231, 2005.Google Scholar
Cross Ref
- S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations. International Journal of Parallel Programming, 34(3):261--317, June 2006. Google Scholar
Digital Library
- K. J. Gough. Little language processing, an alternative to courses on compiler construction. SIGCSE Bulletin, 13(3):31--34, 1981. Google Scholar
Digital Library
- GPCE. ACM conference on generative programming and component engineering.Google Scholar
- A. Hartono, M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested loops. In ICS, 2009. Google Scholar
Digital Library
- T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short simd architectures. In ETAPS International Conference on Compiler Construction (CC'11), pages 225--245, Saarbrcken, Germany, Mar. 2011. Springer Verlag. Google Scholar
Digital Library
- P. Hudak. Domain specific languages. Available from author on request, 1997.Google Scholar
- E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int'l J. High Performance Computing Applications, 18(1), 2004. Google Scholar
Digital Library
- K. Kennedy and J. Allen. Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann, 2002. Google Scholar
Digital Library
- M. Kong, L.-N. Pouchet, and P. Sadayappan. Abstract vector SIMD code generation using the polyhedral model. Technical Report Technical Report 4/13-TR08, Ohio State University, Apr. 2013.Google Scholar
- S. Larsen and S. P. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, 2000. Google Scholar
Digital Library
- A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine transforms. In POPL, pages 201--214, 1997. Google Scholar
Digital Library
- D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for simd. In PLDI, 2006. Google Scholar
Digital Library
- D. Nuzman and A. Zaks. Outer-loop vectorization: revisited for short simd architectures. In PACT, 2008. Google Scholar
Digital Library
- L.-N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: Part II, multidimensional time. In PLDI, pages 90--100. ACM Press, 2008. Google Scholar
Digital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In ACM Supercomputing Conf. (SC'10), New Orleans, Lousiana, Nov. 2010. Google Scholar
Digital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop transformations: Convexity, pruning and optimization. In POPL, pages 549--562, Austin, TX, Jan. 2011. Google Scholar
Digital Library
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proc. of the IEEE, 93(2):232--275, 2005.Google Scholar
Cross Ref
- D. R. Smith. Mechanizing the development of software. In M. Broy, editor, Calculational System Design, Proc. of the International Summer School Marktoberdorf. NATO ASI Series, IOS Press, 1999. Kestrel Institute Technical Report KES.U.99.1.Google Scholar
- W. Taha. Domain-specific languages. In Proc. Intl Conf. Computer Engineering and Systems (ICCES), 2008.Google Scholar
- K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen. Polyhedral-model guided loop-nest auto-vectorization. In PACT, Sept. 2009. Google Scholar
Digital Library
- N. Vasilache. Scalable Program Optimization Techniques in the Polyhedra Model. PhD thesis, University of Paris-Sud 11, 2007.Google Scholar
- N. Vasilache, B. Meister, M. Baskaran, and R. Lethin. Joint scheduling and layout optimization to enable multi-level vectorization. In Proc. of IMPACT'12, Jan. 2012.Google Scholar
- Y. Voronenko and M. Püschel. Algebraic signal processing theory: Cooley-tukey type algorithms for real dfts. IEEE Transactions on Signal Processing, 57(1), 2009. Google Scholar
Digital Library
- R. C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software (ATLAS). In Proc. Supercomputing, 1998. math-atlas. sourceforge.net. Google Scholar
Digital Library
- M. J. Wolfe. High Performance Compilers For Parallel Computing. Addison-Wesley, 1996. Google Scholar
Digital Library
Index Terms
When polyhedral transformations meet SIMD code generation
Recommendations
Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and CompilersThis article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
A practical automatic polyhedral parallelizer and locality optimizer
PLDI '08We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this ...
When polyhedral transformations meet SIMD code generation
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and ImplementationData locality and parallelism are critical optimization objectives for performance on modern multi-core machines. Both coarse-grain parallelism (e.g., multi-core) and fine-grain parallelism (e.g., vector SIMD) must be effectively exploited, but despite ...







Comments