skip to main content
research-article

Thread Batching for High-performance Energy-efficient GPU Memory Design

Published:16 December 2019Publication History
Skip Abstract Section

Abstract

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Second, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

References

  1. Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power-efficient register file for GPGPUs. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE Computer Society, Washington, DC, 412--423. DOI:https://doi.org/10.1109/HPCA.2013.6522337Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W. Keckler. 2015. Page placement strategies for GPUs within heterogeneous memory systems. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 607--618. DOI:https://doi.org/10.1145/2694344.2694381Google ScholarGoogle Scholar
  3. Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12). IEEE Computer Society, Washington, DC, 416--427. Retrieved from http://dl.acm.org/citation.cfm?id=2337159.2337207.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). IEEE Computer Society, Washington, DC, 421--432. DOI:https://doi.org/10.1109/MICRO.2010.50Google ScholarGoogle Scholar
  5. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 163--174. DOI:https://doi.org/10.1109/ISPASS.2009.4919648Google ScholarGoogle Scholar
  6. Alexander Branover, Denis Foley, and Maurice Steinman. 2012. AMD fusion APU: Llano. IEEE Micro 32, 2 (Mar. 2012), 28--37. DOI:https://doi.org/10.1109/MM.2012.2Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Austin, TX, 44--54. DOI:https://doi.org/10.1109/IISWC.2009.5306797Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hanjin Chu. 2013. AMD heterogeneous uniform memory access. In Proceedings of the APU 13th Developer Summit. 11--13.Google ScholarGoogle Scholar
  9. Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A. Joao, Onur Mutlu, and Yale N. Patt. 2011. Parallel application memory scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 362--373. DOI:https://doi.org/10.1145/2155620.2155663Google ScholarGoogle Scholar
  10. Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (May 2008), 42--53. DOI:https://doi.org/10.1109/MM.2008.44Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A mapreduce framework on graphics processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 260--269.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Advanced Micro Devices Inc. [n.d.]. AMD Quad-Core A10-Series APU for Desktops. Retrieved from http://products.amd.com/en-us/DesktopAPUDetail.aspx?id=100/.Google ScholarGoogle Scholar
  13. Micron Technology Inc. [n.d.]. Micron DDR3 SDRAM Part MT41J256M8. Micron Technology Inc.Google ScholarGoogle Scholar
  14. The Khronos Group Inc. [n.d.]. OpenCL. Retrieved from https://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  15. Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the Design Automation Conference (DAC ’12). ACM, New York, NY, 850--855. DOI:https://doi.org/10.1145/2228360.2228513Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. [n.d.]. Balancing DRAM locality and parallelism in shared memory CMP systems. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. DOI:https://doi.org/10.1109/HPCA.2012.6168944Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 15--24. DOI:https://doi.org/10.1145/2304576.2304582Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 395--406. DOI:https://doi.org/10.1145/2451116.2451158Google ScholarGoogle Scholar
  19. Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 332--343. DOI:https://doi.org/10.1145/2485922.2485951Google ScholarGoogle Scholar
  20. Onur Kayıran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'14). IEEE, Cambridge, 114--126. DOI:https://doi.org/10.1109/MICRO.2014.62Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). IEEE Computer Society, Washington, DC, 65--76. DOI:https://doi.org/10.1109/MICRO.2010.51Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture. IEEE, 367--378. DOI:https://doi.org/10.1109/HPCA.2008.4658653Google ScholarGoogle Scholar
  23. Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. 2012. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 367--376. DOI:https://doi.org/10.1145/2370816.2370869Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'05). ACM, New York, 190--200. DOI:https://doi.org/10.1145/1065010.1065034Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai (Helen) Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write-based racetrack memory. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY. DOI:https://doi.org/10.1145/2593069.2593137Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wei Mi, Xiaobing Feng, Jingling Xue, and Yaocang Jia. 2010. Software-hardware cooperative DRAM bank partitioning for chip multiprocessors. In Network and Parallel Computing. Springer, Berlin, 329--343.Google ScholarGoogle Scholar
  27. Micron. [n.d.]. Micron system power calculators. Retrieved from http://www.micron.com/products/support/power-calc/Google ScholarGoogle Scholar
  28. Micron. [n.d.]. Micron TN-ED-01: GDDR5 SGRAM Introduction. Retrieved from http://www.micron.com/products/dram/gddr5/Google ScholarGoogle Scholar
  29. Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, Washington, DC, 146--160. DOI:https://doi.org/10.1109/MICRO.2007.40Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Washington, DC, 63--74. DOI:https://doi.org/10.1109/ISCA.2008.7Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. NVIDIA. [n.d.]. CUDA. Retrieved from http://www.nvidia.com/object/cuda_home_new.html/.Google ScholarGoogle Scholar
  32. NVIDIA. [n.d.]. CUDA SDK. Retrieved from https://developer.nvidia.com/cuda-downloads/.Google ScholarGoogle Scholar
  33. NVIDIA. 2009. Nvidia Fermi Architecture. Retrieved from http://www.nvidia.com/object/fermi-architecture.html.Google ScholarGoogle Scholar
  34. John D. Owens, William J. Dally, Scott Rixner, Peter Mattson, and Ujval J. Kapasi. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 128--138. DOI:https://doi.org/10.1145/339647.339668Google ScholarGoogle Scholar
  35. Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. 2015. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Comput. Architect. Lett. 14, 1 (Jan. 2015), 34--36. DOI:https://doi.org/10.1109/LCA.2014.2299539Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’14). 568--578. DOI:https://doi.org/10.1109/HPCA.2014.6835965Google ScholarGoogle Scholar
  37. Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 72--83. DOI:https://doi.org/10.1109/MICRO.2012.16Google ScholarGoogle Scholar
  38. Anand Lal Shimpi. 2012. Inside the titan supercomputer: 299k amd x86 cores and 18.6 k nvidia gpus. Retrieved on December 11, 2019 from https://www.anandtech.com/show/6421/inside-the-titan-supercomputer-299k-amd-x86-cores-and-186k-nvidia-gpu-cores.Google ScholarGoogle Scholar
  39. John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W. M. W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center Reliable High-Perform. Comput. 127 (2012).Google ScholarGoogle Scholar
  40. I.-Jui Sung, John A. Stratton, and Wen-Mei W. Hwu. 2010. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 513--522.Google ScholarGoogle Scholar
  41. Noriaki Suzuki, Hyoseung Kim, Dionisio De Niz, Bjorn Andersson, Lutz Wrage, Mark Klein, and Ragunathan Rajkumar. 2013. Coordinated bank and cache coloring for temporal protection of memory accesses. In Proceedings of the IEEE 16th International Conference on Computational Science and Engineering. IEEE, 685--692. DOI:https://doi.org/10.1109/CSE.2013.106Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu. 2016. DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Trans. Architect. Code Optim. 12, 4 (Jan. 2016). DOI:https://doi.org/10.1145/2847255Google ScholarGoogle Scholar
  43. Mingli Xie, Dong Tong, Kan Huang, and Xu Cheng. 2014. Improving system throughput and fairness simultaneously in shared memory CMP systems via dynamic bank partitioning. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 344--355. DOI:https://doi.org/10.1109/HPCA.2014.6835945Google ScholarGoogle ScholarCross RefCross Ref
  44. Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395--406. DOI:https://doi.org/10.1145/2830772.2830813Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 86--97. DOI:https://doi.org/10.1145/1806596.1806606Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 34--44. DOI:https://doi.org/10.1145/1669112.1669119Google ScholarGoogle Scholar

Index Terms

  1. Thread Batching for High-performance Energy-efficient GPU Memory Design

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Journal on Emerging Technologies in Computing Systems
        ACM Journal on Emerging Technologies in Computing Systems  Volume 15, Issue 4
        Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers
        October 2019
        226 pages
        ISSN:1550-4832
        EISSN:1550-4840
        DOI:10.1145/3365594
        • Editor:
        • Ramesh Karri
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 December 2019
        • Accepted: 1 May 2019
        • Revised: 1 April 2019
        • Received: 1 January 2019
        Published in jetc Volume 15, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!