ABSTRACT
Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show that iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cache for larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17-121 percent for key scientific kernels, including a 27 percent average improvement for the key computational loop nest in the SPEC/NAS benchmark mgrid.
- 1.D. Bacon, J.-H. Chow, D.-C. Ju, K. Muthukumar, and V. Sarkar. A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness. In Proceedings of CASCON'94, Toronto, Canada, October 1994. Google Scholar
Digital Library
- 2.S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing '92, Minneapolis, MN, November 1992. Google Scholar
Digital Library
- 3.J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In Proceedings of the 1999 ACM International Conference on Supercomputing, Rhodes, Greece, June 1999. Google Scholar
Digital Library
- 4.S. Chatterjee, V. Jain, A. Lebeck, S. Mundhra, and M. Thottethodi. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the 1999 ACM International Conference on Supercomputing, Rhodes, Greece, June 1999. Google Scholar
Digital Library
- 5.M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. Google Scholar
Digital Library
- 6.S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. Google Scholar
Digital Library
- 7.K. Esseghir. Improving data locality for caches. Master's thesis, Dept. of Computer Science, Rice University, September 1993.Google Scholar
- 8.J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, CA, August 1991. Springer-Verlag. Google Scholar
Digital Library
- 9.D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5(5):587-616, October 1988. Google Scholar
Digital Library
- 10.K. Gatlin and L. Carter. Architecture-cognizant divide and conquer algorithms. In Proceedings of SC'99, Portland, OR, November 1999. Google Scholar
Digital Library
- 11.S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 ACM International Conference on Supercomputing, Vienna, Austria, July 1997. Google Scholar
Digital Library
- 12.S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), San Jose, CA, October 1998. Google Scholar
Digital Library
- 13.S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACMTransactions on Programming Languages and Systems, 21(4):703-746, July 1999. Google Scholar
Digital Library
- 14.S. Ghosh, M. Martonosi, and S. Malik. Automated cache optimizations using cme driven diagnosis. In Proceedings of the 2000 ACM International Conference on Supercomputing, Santa Fe, NM, May 2000. Google Scholar
Digital Library
- 15.F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of the Fifteenth Annual ACM Symposium on the Principles of Programming Languages, San Diego, CA, January 1988. Google Scholar
Digital Library
- 16.M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In Proceedings of the 31th IEEE/ACM International Symposium on Microarchitecture, Dallas, TX, November 1998. Google Scholar
Digital Library
- 17.M. Kandemir, J. Ramanujam, and A. Choudhary. A compiler algorithm for optimizing locality in loop nests. In Proceedings of the 1997 ACM International Conference on Supercomputing, Vienna, Austria, July 1997. Google Scholar
Digital Library
- 18.I. Kodukula, N. Ahmed, and K. Pingali. Datacentric multi-level blocking. In Proceedings of the SIG- PLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997. Google Scholar
Digital Library
- 19.I. Kodukula, K. Pingali, R. Cox, and D. Maydan. An experimental evaluation of tiling and shacking for memory hierarchy management. In Proceedings of the 1999 ACM International Conference on Supercomputing, Rhodes, Greece, June 1999. Google Scholar
Digital Library
- 20.M. Lam, E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, CA, April 1991. Google Scholar
Digital Library
- 21.K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424-453, July 1996. Google Scholar
Digital Library
- 22.N. Mitchell, L. Carter, J. Ferrante, and K. H ogstedt. Quantifying the multi-level nature of tiling interactions. In Proceedings of the Tenth Workshop on Languages and Compilers for Parallel Computing, Minneapolis, MN, August 1997. Google Scholar
Digital Library
- 23.R. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2):142-149, February 1999. Google Scholar
Digital Library
- 24.G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998. Google Scholar
Digital Library
- 25.G. Rivera and C.-W. Tseng. Eliminating conflict misses for high performance architectures. In Proceedings of the 1998 ACM International Confer ence on Supercomputing, Melbourne, Australia, July 1998. Google Scholar
Digital Library
- 26.G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction (CC'99), Amsterdam, TheNetherlands, March1999. Google Scholar
Digital Library
- 27.G. Rivera and C.-W. Tseng. Locality optimizations for multi-level caches. In Proceedings of SC'99, Portland, OR, November 1999. Google Scholar
Digital Library
- 28.V. Sarkar. Automatic selection of higher order transformations in the IBM XL Fortran compilers. IBM Journal of Research and Development, 41(3):233- 264, May 1997. Google Scholar
Digital Library
- 29.Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the SIG- PLAN '99 Conference on Programming Language Design and Implementation, Atlanta, GA, May 1999. Google Scholar
Digital Library
- 30.O. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement & Modeling Computer Systems, Santa Clara, CA, May 1994. Google Scholar
Digital Library
- 31.O. Temam, E. Granston, and W. Jalby. To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In Proceedings of Supercomputing '93, Portland, OR, November 1993. Google Scholar
Digital Library
- 32.C. Weib, W. Karl, M. Kowarschik, and U. R ude. Memory characteristics of iterative methods. In Proceedings of SC'99, Portland, OR, November 1999. Google Scholar
Digital Library
- 33.M. E. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. Google Scholar
Digital Library
- 34.M. E. Wolf, D. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th IEEE/ACMInternational Symposium on Microarchitecture, Paris, France, December 1996. Google Scholar
Digital Library
- 35.M. J. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third SIAM Conference on Parallel Processing, December 1987. Google Scholar
Digital Library
- 36.M. J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing '89, Reno, NV, November 1989. Google Scholar
Digital Library
- 37.D. Wonnacott. Time skewing for parallel computers. In Proceedings of the Twelfth Workshop on Languages and Compilers for Parallel Computing, San Diego, CA, August 1999.Google Scholar
- 38.Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level memory hierarchies. In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, Vancouver, Canada, June 2000. Google Scholar
Digital Library
Index Terms
Tiling optimizations for 3D scientific computations
Recommendations
Tiling stencil computations to maximize parallelism
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisMost stencil computations allow tile-wise concurrent start, i.e., there always exists a face of the iteration space and a set of tiling hyperplanes such that all tiles along that face can be started concurrently. This provides load balance and maximizes ...
Tiling imperfectly-nested loop nests
SC '00: Proceedings of the 2000 ACM/IEEE conference on SupercomputingTiling is one of the more important transformations for enhancing loca lity of reference in programs. Intuitively, tiling a set of loops achieves the effect of interleaving iterations of these loops. Tiling of perfectly-nested loop nests (which are loop ...





Comments