Abstract
Stencil computations are an important class of compute and data intensive programs that occur widely in scientific and engineeringapplications. A number of tools use sophisticated tiling, parallelization, and memory mapping strategies, and generate code that relies on vendor-supplied compilers. This code has a number of parameters, such as tile sizes, that are then tuned via empirical exploration.
We develop a model that guides such a choice. Our model is a simple set of analytical functions that predict the execution time of the generated code. It is deliberately optimistic, since tile sizes and, moreover, the optimistic assumptions are intended to enable we are targeting modeling and parameter selections yielding highly tuned codes.
We experimentally validate the model on a number of 2D and 3D stencil codes, and show that the root mean square error in the execution time is less than 10% for the subset of the codes that achieve performance within 20% of the best. Furthermore, based on using our model, we are able to predict tile sizes that achieve a further improvement of 9% on average.
- N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. International Journal of Parallel Programming, 29(5): 493--544, 2001. Google Scholar
Digital Library
- K. Asanovic, R. Bodik, B. C. Catanzaro, P. Gebis, J. J. abd Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from Berkeley. EECS Tech Report EECE-2006--183, UC Berkeley, Decembeer 2006.Google Scholar
- V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 40:1--40:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google Scholar
Digital Library
- C. Bleck, R. Rooth, D. Hu, and L. T. Smith. Salinity-driven Thermocline Transients in a Wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486--1505, 1992. Google Scholar
Cross Ref
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2008.Google Scholar
Digital Library
- U. Bondhugula, V. Bandishti, A. Cohen, G. Potron, and N. Vasilache. Tiling and optimizing time-iterated computations on periodic domains. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 39--50, New York, NY, USA, 2014. ACM. Google Scholar
Digital Library
- Bonmin 2016. Bonmin Project Page. https://projects. coin-or.org/Bonmin, 2015 (accessed March 11, 2016).Google Scholar
- R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cacheoblivious dynamic programming for bioinformatics. TCBB, 7 (3):495--510, July-September 2010.Google Scholar
- M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE Int., pages 676--687, May 2011. Google Scholar
Digital Library
- A. Darte, Y. Robert, and F. Vivien. Scheduling and Automatic Parallelization. Birkhauser, 2000. Google Scholar
Cross Ref
- K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on stateof-the-art multicore architectures. In SC08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 4:1--4:12, Austin, TX, November 2008. http://portal.acm.org/citation.cfm?id=1413370.1413375.Google Scholar
Digital Library
- H. Dursun, K. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta. A multilevel parallelization framework for high-order stencil computations. In Euro-Par 09, pages 642--653, Delft, The Netherlands, August 2009. Google Scholar
Digital Library
- H. Dursun, K. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. Incore optimization of high-order stencil computations. In PDPTA, pages 533--538, Las Vegas, NV, July 2009.Google Scholar
- J. F. Epperson. An Introduction to Numerical Methods and Analysis. Wiley-Interscience, 2007.Google Scholar
- P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23--53, Feb 1991. Google Scholar
Cross Ref
- P. Feautrier. Some efficient solutions to the affine scheduling problem. Part I. one-dimensional time. International Journal of Parallel Programming, 21(5):313--347, 1992. Google Scholar
Digital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem. Part II. multidimensional time. International Journal of Parallel Programming, 21(6):389--420, 1992. Google Scholar
Cross Ref
- R. Fourer, D. M. Gay, and B. W. Kernighan. AMPL: A Modelling Language for Mathematical Programming. Duxbury Press, Brooks/Cole Publishing Company, 2nd edition, 2002.Google Scholar
- M. Frigo and V. Strumpen. Cache oblivious stencil computations. In Proc. of the 19th Annual Int. Conf. on Supercomputing, ICS '05, pages 361--366, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS: IEEE Symposium on Foundations of Computer Science, pages 285--297, New York, NY, October 1999. Google Scholar
Cross Ref
- S. M. Griffies, C. Bning, F. O. Bryan, E. P. Chassignet, R. Gerdes, H. Hasumi, A. Hirst, A.-M. Treguier, and D. Webb. Developments in Ocean Climate Modelling. Ocean Modelling, 2:123--192, 2000. Google Scholar
Cross Ref
- T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid hexagonal/classical tiling for GPUs. In CGO, page 66, Orlando, FL, Feb 2014.Google Scholar
- T. Gysi, T. Grosser, and T. Hoefler. Modesto: Datacentric analytic optimization of complex stencil programs on heterogeneous architectures. In Proc. of the 29th ACM on Int. Conf. on Supercomputing, ICS '15, pages 177--186, New York, NY, USA, 2015. ACM.Google Scholar
Digital Library
- T. Gysi, C. Osuna, O. Fuhrer, M. Bianco, and T. C. Schulthess. Stella: A domain-specific tool for structured grid methods in weather and climate models. In Proc. of the Int. Conf. for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 41:1--41:12, New York, NY, USA, 2015. ACM. Google Scholar
Digital Library
- T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector simd architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 13--24, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- J. Holewinski, L.-N. Pouchet, and P. Sadayappan. Highperformance code generation for stencil computations on GPU architectures. In Proc. of the 26th ACM Int. Conf. on Supercomputing, ICS '12, pages 311--320, New York, NY, USA, 2012. ACM. Google Scholar
Digital Library
- S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 152--163, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- C. John. Options, Futures, and Other Derivatives. Prentice Hall, 2006.Google Scholar
- S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSPC 2005: Workshop on Memory Systems Performance, pages 36--43, Chicago, IL, June 2005. ACM Sigplan. Google Scholar
Digital Library
- S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC 2006: Workshop on Memory Systems Performance and Correctness, pages 51--60, San Jose, CA, October 2006. ACM Sigplan. Google Scholar
Digital Library
- S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12, April 2010. Google Scholar
Cross Ref
- S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI 2007: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 235--244, San Diego, CA, June 2007. ACM. Google Scholar
Digital Library
- Z. Li and Y. Song. Automatic tiling of iterative stencil loops. ACM Trans. Program. Lang. Syst., 26(6):975--1028, Nov.2004. Google Scholar
Digital Library
- P. Liu, R. Seymour, K. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS 2009: IEEE International Parallel abd Distributed Processing Symposium, pages 1--11, Rome, Italy, May 2009.Google Scholar
Digital Library
- C. Mauras, P. Quinton, S. Rajopadhye, and Y. Saouter. Scheduling affine parameterized recurrences by means of variable dependent timing functions. In S. Y. Kung and E. Swartzlander, editors, International Conference on Application Specific Array Processing, pages 100--110, Princeton, New Jersey, Sept 1990. IEEE Computer Society. Google Scholar
Cross Ref
- W. Mei, W. Shyy, D. Yu, and L. S. Luo. Lattice Boltzmann Method for 3-D Flows with Curved Boundary. Journal of Computational Physics, 161(2):680--699, 2000. Google Scholar
Digital Library
- J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing, ICS '09, pages 256--265, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- P. Micikevicius. 3D finite difference computation on GPUs using CUDA. In GPPGPU, pages 79--84, Washington, DC, March 2009.Google Scholar
Digital Library
- R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 429--443, New York, NY, USA, 2015. ACM. Google Scholar
Digital Library
- A. Nakano, R. K. Kalia, and P. Vashishta. Multiresolution Molecular Dynamics Algorithm for Realistic Materials Modeling on Parallel Computers. Computer Physics Communications, 83(2--3):197--214, 1994.Google Scholar
- A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--13, Washington, DC, USA, 2010. IEEE Computer Society. Google Scholar
Digital Library
- A. Nitsure. Implementation and optimization of a cache oblivious lattice boltzmann algorithm. Master's thesis, Institut fr Informatic, Friedrich-Alexander-Universitt ErlangenNrnberg, July 2006.Google Scholar
- L. Peng, R. Seymour, K. ichi Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPPS, 2009.Google Scholar
Digital Library
- P. Quinton and V. Van Dongen. The mapping of linear recurrence equations on regular arrays. Journal of VLSI Signal Processing, 1(2):95--113, 1989. Google Scholar
Digital Library
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM. Google Scholar
Digital Library
- S. V. Rajopadhye, S. Purushothaman, and R. M. Fujimoto. On synthesizing systolic arrays from recurrence equations with linear dependencies. In Proceedings, Sixth Conference on Foundations of Software Technology and Theoretical Computer Science, pages 488--503, New Delhi, India, December 1986. Springer Verlag, LNCS 241. Google Scholar
Cross Ref
- G. Rizk, D. Lavenier, and S. Rajopadhye. GPU accelerated RNA folding algorithm, chapter 14. Morgan Kauffman, 2010. in GPU Computing Gems 4, editor: W-M. Hwu.Google Scholar
- M. Shaheen and R. Strzodka. Numa aware iterative stencil computations on many-core system. In 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Shanghai, China, 2012. Google Scholar
Digital Library
- J. Shirako, K. Sharma, N. Fauzia, L.-N. Pouchet, J. Ramanujam, P. Sadayappan, and V. Sarkar. Analytical Bounds for Optimal Tile Size Selection, pages 101--121. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.Google Scholar
- R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache oblivious parallelograms in iterative stencil computations. In 24th ACM/SIGARCH International Conference on Supercomputing (ICS), pages 49--59, Tsukuba, Japan, June 2010. Google Scholar
Digital Library
- R. Strzodka, M. Shaheen, and D. Pajak. Time skewing made simple (poster). In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 295--296, New York, NY, USA, 2011. ACM.Google Scholar
Digital Library
- R. Strzodka, M. Shaheen, D. Pajak, and H. P. Seidel. Cache accurate time skewing in iterative stencil computations. In Parallel Processing (ICPP), 2011 International Conference on, pages 571--581, Sept 2011. Google Scholar
Digital Library
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gomez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO), 9(4):54, 2013. Google Scholar
Digital Library
- R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. Parallel Computing, 27(1):3--35, 2001. Google Scholar
Digital Library
- M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In ACM Sigplan Not., volume 26, pages 30--44. ACM, 1991. Google Scholar
Digital Library
- M. J. Wolfe. Iteration space tiling for memory hierarchies. Parallel Processing for Scientific Computing (SIAM), pages 357--361, 1987.Google Scholar
Digital Library
- D. Wonnacott. Time skewing for parallel computers. In Languages and Compilers for Parallel Computing, 12th International Workshop, LCPC'99, La Jolla/San Diego, CA, USA, August 4--6, 1999, Proceedings, pages 477--480, 1999.Google Scholar
- D. Wonnacott. Achieving scalable locality with time skewing. International Journal of Parallel Programming, 30(3):1--221, 2002. Google Scholar
Digital Library
- J. Xue. Loop Tiling for Parallelism, volume 575 of Kluwer International Series in Engineering and Computer Science. Kluwer, 2000.Google Scholar
- K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, and P. Wu. A comparison of empirical and model-driven optimization. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, PLDI'03, pages 63--76, New York, NY, USA, 2003. ACM. Google Scholar
Digital Library
Index Terms
Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils
Recommendations
Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingStencil computations are an important class of compute and data intensive programs that occur widely in scientific and engineeringapplications. A number of tools use sophisticated tiling, parallelization, and memory mapping strategies, and generate code ...
The boat hull model: enabling performance prediction for parallel computing prior to code development
CF '12: Proceedings of the 9th conference on Computing FrontiersMulti-core and many-core were already major trends for the past six years and are expected to continue for the next decade. With these trends of parallel computing, it becomes increasingly difficult to decide on which processor to run a given ...
Adaptation of fluid model EULAG to graphics processing unit architecture
The goal of this study is to adapt the multiscale fluid solver EULerian or LAGrangian framewrok EULAG to future graphics processing units GPU platforms. The EULAG model has the proven record of successful applications, and excellent efficiency and ...







Comments