Abstract
The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms and to a more limited extent in optimizing compilers.
In this paper, we develop a novel framework utilizing the associativity and commutativity of operations in regular loop computations to enhance register reuse. Stencils represent a particular class of important computations where the optimization framework can be applied to enhance performance. We show how stencil operations can be implemented to better exploit register reuse and reduce load/stores. We develop a multi-dimensional retiming formalism to characterize the space of valid implementations in conjunction with other program transformations. Experimental results demonstrate the effectiveness of the framework on a collection of high-order stencils.
- M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, 1964. Google Scholar
Digital Library
- F. Aleen and N. Clark. Commutativity analysis for software parallelization: letting program transformations see the big picture. In ASPLOS, pages 241--252, 2009. Google Scholar
Digital Library
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS parallel benchmarks - summary and preliminary results. In SC, pages 158--165, 1991. Google Scholar
Digital Library
- J. W. Banks and W. D. Henshaw. Upwind schemes for the wave equation in second-order form. J. Comput. Phys., 231(17):5854--5889, 2012. Google Scholar
Digital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT, pages 7--16, 2004. Google Scholar
Digital Library
- G. E. Blelloch. Scans as primitive parallel operations. IEEE TC, 38 (11):1526--1538, 1989. Google Scholar
Digital Library
- P.-Y. Calland, A. Darte, and Y. Robert. Circuit retiming applied to decomposed software pipelining. IEEE TPDS, 9(1):24--35, 1998. Google Scholar
Digital Library
- Chombo. https://commons.lbl.gov/display/chombo.Google Scholar
- R. Cruz, M. Araya-Polo, and J. Cela. Introducing the semi-stencil algorithm. In PPAM, pages 496--506. 2010. Google Scholar
Digital Library
- A. Darte, G.-A. Silber, and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop tiling. PPL, 7(4):379--392, 1997.Google Scholar
Cross Ref
- K. Datta. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis, EECS, University of California, Berkeley, 2009. Google Scholar
Digital Library
- S. J. Deitz, B. L. Chamberlain, and L. Snyder. Eliminating redundancies in sum-of-product array computations. In ICS, pages 65--77, 2001. Google Scholar
Digital Library
- Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In ICS, pages 205--213, 2008. Google Scholar
Digital Library
- H. Dursun, M. Kunaseth, K. ichi Nomura, J. Chame, R. F. Lucas, C. Chen, M. W. Hall, R. K. Kalia, A. Nakano, and P. Vashishta. Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters. The Journal of Supercomputing, 62(2): 946--966, 2012. Google Scholar
Digital Library
- P. Feautrier. Dataflow analysis of scalar and array references. IJPP, 20(1):23--53, 1991.Google Scholar
Cross Ref
- L. Han, W. Liu, and J. Tuck. Speculative parallelization of partial reduction variables. In CGO, pages 141--150, 2010. Google Scholar
Digital Library
- R. Haralick and L. Shapiro. Computer and robot vision. Computer and Robot Vision. Addison-Wesley, 1993. Google Scholar
Digital Library
- T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti, J. Ramanujam, and P. Sadayappan. Data layout transformation for stencil computations on short simd architectures. In CC, pages 225--245, 2011. Google Scholar
Digital Library
- T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector simd architectures. In ICS, 2013. Google Scholar
Digital Library
- J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012. Google Scholar
Digital Library
- S. Kim and S.-M. Moon. Rotating register allocation for enhanced pipeline scheduling. In PACT, pages 60--72, 2007. Google Scholar
Digital Library
- M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When polyhedral transformations meet simd code generation. In PLDI, 2013. Google Scholar
Digital Library
- M. Kulkarni, D. Nguyen, D. Prountzos, X. Sui, and K. Pingali. Exploiting the commutativity lattice. In PLDI, pages 542--555, 2011. Google Scholar
Digital Library
- T. Liebig. openEMS - Open Electromagnetic Field Solver. URL http://openEMS.de.Google Scholar
- J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE TCCA, pages 19--25, 1995.Google Scholar
- Overture. Overture: An Object-Oriented Toolkit for Solving Partial Differential Equations in Complex Geometry; version 25, 2012. http://www.overtureframework.org/.Google Scholar
- N. L. Passos and E. H.-M. Sha. Achieving full parallelism using multidimensional retiming. IEEE TPDS, 7(11):1150--1163, 1996. Google Scholar
Digital Library
- N. L. Passos, E. H.-M. Sha, and S. C. Bass. Optimizing dsp flow graphs via schedule-based multidimensional retiming. IEEE TSP, 44 (1):150--155, 1996. Google Scholar
Digital Library
- L.-N. Pouchet. PoCC 1.2: the Polyhedral Compiler Collection. http://pocc.sourceforge.net, 2012.Google Scholar
- L.-N. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral model: Part II, multidimensional time. In PLDI, pages 90--100, 2008. Google Scholar
Digital Library
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and N. Vasilache. Loop transformations: Convexity, pruning and optimization. In POPL, pages 549--562, 2011. Google Scholar
Digital Library
- L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-based data reuse optimization for configurable computing. In FPGA, 2013. Google Scholar
Digital Library
- P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In PLDI, pages 1--11, 2011. Google Scholar
Digital Library
- F. Quilleré, S. Rajopadhye, and D. Wilde. Generation of efficient nested loops from polyhedra. IJPP, 28(5):469--498, 2000. Google Scholar
Digital Library
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In PLDI, pages 519--530, 2013. Google Scholar
Digital Library
- X. Redon and P. Feautrier. Detection of recurrences in sequential programs with loops. In PARLE, pages 132--145, 1993. Google Scholar
Digital Library
- M. C. Rinard and P. C. Diniz. Commutativity analysis: A new analysis technique for parallelizing compilers. TOPLAS, 19(6):942--991, 1997. Google Scholar
Digital Library
- N. Sedaghati, R. Thomas, L. Pouchet, R. Teodorescu, and P. Sadayappan. StVEC: A vector instruction extension for high performance stencil computation. In PACT, pages 276--287, 2011. Google Scholar
Digital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In GH, pages 97--106, 2007. Google Scholar
Digital Library
- L. T. Simpson. Value-driven Redundancy Elimination. PhD thesis, Houston, TX, USA, 1996. Google Scholar
Digital Library
- N. Vasilache, A. Cohen, and L.-N. Pouchet. Automatic correction of loop transformations. In PACT, pages 292--304, 2007. Google Scholar
Digital Library
- S. Verdoolaege. ISL: An integer set library for the polyhedral model. In Mathematical Software--ICMS 2010, pages 299--302. Springer, 2010. Google Scholar
Digital Library
- H. Weller. OpenFOAM. URL http://www.openfoam.org/.Google Scholar
- S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009. Google Scholar
Digital Library
- Y. Zou and S. Rajopadhye. Scan detection and parallelization in "inherently sequential" nested loop programs. In CGO, pages 74--83, 2012. Google Scholar
Digital Library
Index Terms
(auto-classified)A framework for enhancing data reuse via associative reordering
Recommendations
A framework for enhancing data reuse via associative reordering
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationThe freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms and to a more limited extent in optimizing compilers.
In this paper, we develop a novel framework utilizing the ...
Complexity-Effective Reorder Buffer Designs for Superscalar Processors
Abstract--All contemporary dynamically scheduled processors support register renaming to cope with false data dependencies. One of the ways to implement register renaming is to use the slots within the Reorder Buffer (ROB) as physical registers. In such ...
Associative instruction reordering to alleviate register pressure
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and AnalysisRegister allocation is generally considered a practically solved problem. For most applications, the register allocation strategies in production compilers are very effective in controlling the number of loads/stores and register spills. However, ...







Comments