skip to main content
research-article

Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

Stencil computations are an important class of compute and data intensive programs that occur widely in scientific and engineeringapplications. A number of tools use sophisticated tiling, parallelization, and memory mapping strategies, and generate code that relies on vendor-supplied compilers. This code has a number of parameters, such as tile sizes, that are then tuned via empirical exploration.

We develop a model that guides such a choice. Our model is a simple set of analytical functions that predict the execution time of the generated code. It is deliberately optimistic, since tile sizes and, moreover, the optimistic assumptions are intended to enable we are targeting modeling and parameter selections yielding highly tuned codes.

We experimentally validate the model on a number of 2D and 3D stencil codes, and show that the root mean square error in the execution time is less than 10% for the subset of the codes that achieve performance within 20% of the best. Furthermore, based on using our model, we are able to predict tile sizes that achieve a further improvement of 9% on average.

References

  1. N. Ahmed, N. Mateev, and K. Pingali. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. International Journal of Parallel Programming, 29(5): 493--544, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Asanovic, R. Bodik, B. C. Catanzaro, P. Gebis, J. J. abd Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from Berkeley. EECS Tech Report EECE-2006--183, UC Berkeley, Decembeer 2006.Google ScholarGoogle Scholar
  3. V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 40:1--40:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Bleck, R. Rooth, D. Hu, and L. T. Smith. Salinity-driven Thermocline Transients in a Wind- and Thermohaline-forced Isopycnic Coordinate Model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486--1505, 1992. Google ScholarGoogle ScholarCross RefCross Ref
  5. U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. U. Bondhugula, V. Bandishti, A. Cohen, G. Potron, and N. Vasilache. Tiling and optimizing time-iterated computations on periodic domains. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 39--50, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bonmin 2016. Bonmin Project Page. https://projects. coin-or.org/Bonmin, 2015 (accessed March 11, 2016).Google ScholarGoogle Scholar
  8. R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cacheoblivious dynamic programming for bioinformatics. TCBB, 7 (3):495--510, July-September 2010.Google ScholarGoogle Scholar
  9. M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE Int., pages 676--687, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Darte, Y. Robert, and F. Vivien. Scheduling and Automatic Parallelization. Birkhauser, 2000. Google ScholarGoogle ScholarCross RefCross Ref
  11. K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on stateof-the-art multicore architectures. In SC08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 4:1--4:12, Austin, TX, November 2008. http://portal.acm.org/citation.cfm?id=1413370.1413375.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Dursun, K. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta. A multilevel parallelization framework for high-order stencil computations. In Euro-Par 09, pages 642--653, Delft, The Netherlands, August 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Dursun, K. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. Incore optimization of high-order stencil computations. In PDPTA, pages 533--538, Las Vegas, NV, July 2009.Google ScholarGoogle Scholar
  14. J. F. Epperson. An Introduction to Numerical Methods and Analysis. Wiley-Interscience, 2007.Google ScholarGoogle Scholar
  15. P. Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23--53, Feb 1991. Google ScholarGoogle ScholarCross RefCross Ref
  16. P. Feautrier. Some efficient solutions to the affine scheduling problem. Part I. one-dimensional time. International Journal of Parallel Programming, 21(5):313--347, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Feautrier. Some efficient solutions to the affine scheduling problem. Part II. multidimensional time. International Journal of Parallel Programming, 21(6):389--420, 1992. Google ScholarGoogle ScholarCross RefCross Ref
  18. R. Fourer, D. M. Gay, and B. W. Kernighan. AMPL: A Modelling Language for Mathematical Programming. Duxbury Press, Brooks/Cole Publishing Company, 2nd edition, 2002.Google ScholarGoogle Scholar
  19. M. Frigo and V. Strumpen. Cache oblivious stencil computations. In Proc. of the 19th Annual Int. Conf. on Supercomputing, ICS '05, pages 361--366, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS: IEEE Symposium on Foundations of Computer Science, pages 285--297, New York, NY, October 1999. Google ScholarGoogle ScholarCross RefCross Ref
  21. S. M. Griffies, C. Bning, F. O. Bryan, E. P. Chassignet, R. Gerdes, H. Hasumi, A. Hirst, A.-M. Treguier, and D. Webb. Developments in Ocean Climate Modelling. Ocean Modelling, 2:123--192, 2000. Google ScholarGoogle ScholarCross RefCross Ref
  22. T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid hexagonal/classical tiling for GPUs. In CGO, page 66, Orlando, FL, Feb 2014.Google ScholarGoogle Scholar
  23. T. Gysi, T. Grosser, and T. Hoefler. Modesto: Datacentric analytic optimization of complex stencil programs on heterogeneous architectures. In Proc. of the 29th ACM on Int. Conf. on Supercomputing, ICS '15, pages 177--186, New York, NY, USA, 2015. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Gysi, C. Osuna, O. Fuhrer, M. Bianco, and T. C. Schulthess. Stella: A domain-specific tool for structured grid methods in weather and climate models. In Proc. of the Int. Conf. for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 41:1--41:12, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector simd architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 13--24, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Holewinski, L.-N. Pouchet, and P. Sadayappan. Highperformance code generation for stencil computations on GPU architectures. In Proc. of the 26th ACM Int. Conf. on Supercomputing, ICS '12, pages 311--320, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 152--163, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. John. Options, Futures, and Other Derivatives. Prentice Hall, 2006.Google ScholarGoogle Scholar
  29. S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSPC 2005: Workshop on Memory Systems Performance, pages 36--43, Chicago, IL, June 2005. ACM Sigplan. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC 2006: Workshop on Memory Systems Performance and Correctness, pages 51--60, San Jose, CA, October 2006. ACM Sigplan. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12, April 2010. Google ScholarGoogle ScholarCross RefCross Ref
  32. S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI 2007: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 235--244, San Diego, CA, June 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Z. Li and Y. Song. Automatic tiling of iterative stencil loops. ACM Trans. Program. Lang. Syst., 26(6):975--1028, Nov.2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Liu, R. Seymour, K. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS 2009: IEEE International Parallel abd Distributed Processing Symposium, pages 1--11, Rome, Italy, May 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Mauras, P. Quinton, S. Rajopadhye, and Y. Saouter. Scheduling affine parameterized recurrences by means of variable dependent timing functions. In S. Y. Kung and E. Swartzlander, editors, International Conference on Application Specific Array Processing, pages 100--110, Princeton, New Jersey, Sept 1990. IEEE Computer Society. Google ScholarGoogle ScholarCross RefCross Ref
  36. W. Mei, W. Shyy, D. Yu, and L. S. Luo. Lattice Boltzmann Method for 3-D Flows with Curved Boundary. Journal of Computational Physics, 161(2):680--699, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing, ICS '09, pages 256--265, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P. Micikevicius. 3D finite difference computation on GPUs using CUDA. In GPPGPU, pages 79--84, Washington, DC, March 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 429--443, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Nakano, R. K. Kalia, and P. Vashishta. Multiresolution Molecular Dynamics Algorithm for Realistic Materials Modeling on Parallel Computers. Computer Physics Communications, 83(2--3):197--214, 1994.Google ScholarGoogle Scholar
  41. A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--13, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Nitsure. Implementation and optimization of a cache oblivious lattice boltzmann algorithm. Master's thesis, Institut fr Informatic, Friedrich-Alexander-Universitt ErlangenNrnberg, July 2006.Google ScholarGoogle Scholar
  43. L. Peng, R. Seymour, K. ichi Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPPS, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. P. Quinton and V. Van Dongen. The mapping of linear recurrence equations on regular arrays. Journal of VLSI Signal Processing, 1(2):95--113, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. V. Rajopadhye, S. Purushothaman, and R. M. Fujimoto. On synthesizing systolic arrays from recurrence equations with linear dependencies. In Proceedings, Sixth Conference on Foundations of Software Technology and Theoretical Computer Science, pages 488--503, New Delhi, India, December 1986. Springer Verlag, LNCS 241. Google ScholarGoogle ScholarCross RefCross Ref
  47. G. Rizk, D. Lavenier, and S. Rajopadhye. GPU accelerated RNA folding algorithm, chapter 14. Morgan Kauffman, 2010. in GPU Computing Gems 4, editor: W-M. Hwu.Google ScholarGoogle Scholar
  48. M. Shaheen and R. Strzodka. Numa aware iterative stencil computations on many-core system. In 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Shanghai, China, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. Shirako, K. Sharma, N. Fauzia, L.-N. Pouchet, J. Ramanujam, P. Sadayappan, and V. Sarkar. Analytical Bounds for Optimal Tile Size Selection, pages 101--121. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.Google ScholarGoogle Scholar
  50. R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache oblivious parallelograms in iterative stencil computations. In 24th ACM/SIGARCH International Conference on Supercomputing (ICS), pages 49--59, Tsukuba, Japan, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. R. Strzodka, M. Shaheen, and D. Pajak. Time skewing made simple (poster). In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 295--296, New York, NY, USA, 2011. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. R. Strzodka, M. Shaheen, D. Pajak, and H. P. Seidel. Cache accurate time skewing in iterative stencil computations. In Parallel Processing (ICPP), 2011 International Conference on, pages 571--581, Sept 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gomez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO), 9(4):54, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. Parallel Computing, 27(1):3--35, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In ACM Sigplan Not., volume 26, pages 30--44. ACM, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M. J. Wolfe. Iteration space tiling for memory hierarchies. Parallel Processing for Scientific Computing (SIAM), pages 357--361, 1987.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. Wonnacott. Time skewing for parallel computers. In Languages and Compilers for Parallel Computing, 12th International Workshop, LCPC'99, La Jolla/San Diego, CA, USA, August 4--6, 1999, Proceedings, pages 477--480, 1999.Google ScholarGoogle Scholar
  59. D. Wonnacott. Achieving scalable locality with time skewing. International Journal of Parallel Programming, 30(3):1--221, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J. Xue. Loop Tiling for Parallelism, volume 575 of Kluwer International Series in Engineering and Computer Science. Kluwer, 2000.Google ScholarGoogle Scholar
  61. K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, and P. Wu. A comparison of empirical and model-driven optimization. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, PLDI'03, pages 63--76, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Simple, Accurate, Analytical Time Modeling and Optimal Tile Size Selection for GPGPU Stencils

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!