Abstract
Performance optimization of stencil computations has been widely studied in the literature, since they occur in many computationally intensive scientific and engineering applications. Compiler frameworks have also been developed that can transform sequential stencil codes for optimization of data locality and parallelism. However, loop skewing is typically required in order to tile stencil codes along the time dimension, resulting in load imbalance in pipelined parallel execution of the tiles. In this paper, we develop an approach for automatic parallelization of stencil codes, that explicitly addresses the issue of load-balanced execution of tiles. Experimental results are provided that demonstrate the effectiveness of the approach.
- V. Adve, G. Jin, J. Mellor-Crummey, and Q. Yi. High performance fortran compilation techniques for parallelizing scientific codes. In Proceedings of Supercomputing '98, pages 1--23, 1998. Google Scholar
Digital Library
- N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly nested loops. In Proceedings of ACM ICS 2000, pages 141--152, 2000. Google Scholar
Digital Library
- N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Proceedings of SC'00, page 31, 2000. Google Scholar
Digital Library
- N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. International Journal of Parallel Programming, 29(5), Oct. 2001. Google Scholar
Digital Library
- C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In Proceedings of PPOPP '91, pages 39--50, 1991. Google Scholar
Digital Library
- R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. Optimal semi-oblique tiling. IEEE Trans. Par. & Dist. Sys., 14(9):944--960, 2003. Google Scholar
Digital Library
- P. Boulet, A. Darte, T. Risset, and Y. Robert. (Pen)-ultimate tiling? Integration, the VLSI Journal, 17(1):33--51, 1994. Google Scholar
Digital Library
- S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of PLDI '95, pages 279--290, 1995. Google Scholar
Digital Library
- F. Desprez, J. Dongarra, F. Rastello, and Y. Robert. Determining the idle time of a tiling: new results. Journal of Information Science and Engineering, 14:167--190, 1998.Google Scholar
- M. Frigo and V. Strumpen. The memory behavior of cache oblivious stencil computations. Journal of Supercomputing, 2006. Google Scholar
Digital Library
- M. Griebl. On tiling space-time mapped loop nests. In Proceedings of SPAA '01, pages 322--323, 2001. Google Scholar
Digital Library
- M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau, 2004. Habilitation Thesis.Google Scholar
- R. Haralick and L. Shapiro. Computer and Robot Vision. Addison Wesley, 1992. Google Scholar
Digital Library
- E. Hodzic and W. Shang. On time optimal supernode shape. IEEE Trans. Par. & Dist. Sys., 13(12):1220--1233, 2002. Google Scholar
Digital Library
- K. Hogstedt, L. Carter, and J. Ferrante. Determining the idle time of a tiling. In Proceedings of POPL '97, pages 160--173, 1997. Google Scholar
Digital Library
- K. Hogstedt, L. Carter, and J. Ferrante. Selecting tile shape for minimal execution time. In Proceedings of SPAA '99, pages 201--211, 1999. Google Scholar
Digital Library
- F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of POPL '88, pages 319--329, 1988. Google Scholar
Digital Library
- S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In Proceedings of MSPC '06, pages 51--60, 2006. Google Scholar
Digital Library
- S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of MSP '05, pages 36--43, 2005. Google Scholar
Digital Library
- W. Kelly, W. Pugh, and E. Rosser. Code generation for multiple mappings. In Proceedings of FRONTIERS '95, page 332, 1995. Google Scholar
Digital Library
- J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Proceedings of Supercomputing '91, pages 111--120, 1991. Google Scholar
Digital Library
- L. Renganarayana and S. Rajopadhye. A geometric programming framework for optimal multi-level tiling. In Proceedings of SC '04, page 18, 2004. Google Scholar
Digital Library
- A. Sawdey, M. O'Keefe, and R. Bleck. The design, implementation, and performance of a parallel ocean circulation model. In Proceedings of 6th ECMWF Workshop on the Use of Parallel Processors in Meteorology: Coming of Age, pages 523--550, 1995.Google Scholar
- A. Sawdey and M. T. O'Keefe. Program analysis of overlap area usage in self-similar parallel programs. In Proceedings of LCPC '97, pages 79--93, 1998. Google Scholar
Digital Library
- R. Schreiber and J. Dongarra. Automatic blocking of nested loops. Technical report, University of Tennessee, Knoxville, TN, Aug. 1990. Google Scholar
Digital Library
- Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of PLDI '99, pages 215--228, 1999. Google Scholar
Digital Library
- A. Taflove and S. C. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method, Third Edition. Artech House Publishers, 2005.Google Scholar
- M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of PLDI '91, pages 30--44, 1991. Google Scholar
Digital Library
- M. Wolfe. More iteration space tiling. In Proceedings of Supercomputing '89, pages 655--664, 1989. Google Scholar
Digital Library
Index Terms
Effective automatic parallelization of stencil computations
Recommendations
Effective automatic parallelization of stencil computations
PLDI '07: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and ImplementationPerformance optimization of stencil computations has been widely studied in the literature, since they occur in many computationally intensive scientific and engineering applications. Compiler frameworks have also been developed that can transform ...
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization
COMPSAC '09: Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01We present a pipelined wavefront parallelization approach for stencil-based computations. Within a fixed spatial domain successive wavefronts are executed by threads scheduled to a multicore processor chip with a shared outer level cache. By re-using ...
Effective resource management for enhancing performance of 2D and 3D stencils on GPUs
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing UnitGPUs are an attractive target for data parallel stencil computations prevalent in scientific computing and image processing applications. Many tiling schemes, such as overlapped tiling and split tiling, have been proposed in past to improve the ...







Comments