skip to main content
research-article

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Published:20 June 2009Publication History
Skip Abstract Section

Abstract

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications.

To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.

References

  1. ATI Mobility RadeonTM HD4850/4870 Graphics-Overview. http://ati.amd.com/products/radeonhd4800.Google ScholarGoogle Scholar
  2. Intel Core2 Quad Processors. http://www.intel.com/products/processor/core2quad.Google ScholarGoogle Scholar
  3. NVIDIA GeForce series GTX280, 8800GTX, 8800GT. http://www.nvidia.com/geforce.Google ScholarGoogle Scholar
  4. NVIDIA Quadro FX5600. http://www.nvidia.com/quadro.Google ScholarGoogle Scholar
  5. Advanced Micro Devices, Inc. AMD Brook+. http://ati.amd.com/technology/streamcomputing/AMDBrookplus.pdf.Google ScholarGoogle Scholar
  6. A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed GPU simulator. In IEEE ISPASS, April 2009.Google ScholarGoogle ScholarCross RefCross Ref
  7. X. E. Chen and T. M. Aamodt. A first-order fine-grained multithreaded throughput model. In HPCA, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. E. Lindholm, J. Nickolls, S.Oberman and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39--55, March-April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Fatica, P. LeGresley, I. Buck, J. Stone, J. Phillips, S. Morton, and P. Micikevicius. High Performance Computing with CUDA, SC08, 2008.Google ScholarGoogle Scholar
  10. A. Glew. MLP yes! ILP no! In ASPLOS Wild and Crazy Idea Session '98, Oct. 1998.Google ScholarGoogle Scholar
  11. GPGPU. General-Purpose Computation Using Graphics Hardware. http://www.gpgpu.org/.Google ScholarGoogle Scholar
  12. S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. Technical Report TR-2009-003, Atlanta, GA, USA, 2009.Google ScholarGoogle Scholar
  13. W. Hwu and D. Kirk. Ece 498al1: Programming massively parallel processors, fall 2007. http://courses.ece.uiuc.edu/ece498/al1/.Google ScholarGoogle Scholar
  14. Intel SSE / MMX2 / KNI documentation. http://www.intel80386.com/simd/mmx2-doc.html.Google ScholarGoogle Scholar
  15. T. S. Karkhanis and J. E. Smith. A first-order superscalar processor model. In ISCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Khronos. Opencl - the open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  17. M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In ASPLOS XIII, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Michaud and A. Seznec. Data-flow prescheduling for large instruction windows in out-of-order processors. In HPCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Michaud, A. Seznec, and S. Jourdan. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors. In PACT, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. ACM Queue, 6(2):40--53, March-April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. B. Noonburg and J. P. Shen. Theoretical modeling of superscalar processor performance. In MICRO-27, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. NVIDIA Corporation. CUDA Programming Guide, Version 2.1.Google ScholarGoogle Scholar
  23. M. Pharr and R. Fernando. GPU Gems 2. Addison-Wesley Professional, 2005.Google ScholarGoogle Scholar
  24. S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Stratton, and W. Hwu. Program optimization space pruning for a multithreaded gpu. In CGO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. H. Saavedra-Barrera and D. E. Culler. An analytical solution for a markov chain modeling multithreaded. Technical report, Berkeley, CA, USA, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A. Wood. Analytic evaluation of shared-memory systems with ILP processors. In ISCA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. A. Waring and X. Liu. Face detection using spectral histograms and SVMs. Systems, Man, and Cybernetics, Part B, IEEE Transactions on, 35(3):467--476, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!