Abstract
Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Second, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.
- Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power-efficient register file for GPGPUs. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE Computer Society, Washington, DC, 412--423. DOI:https://doi.org/10.1109/HPCA.2013.6522337Google Scholar
Digital Library
- Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W. Keckler. 2015. Page placement strategies for GPUs within heterogeneous memory systems. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 607--618. DOI:https://doi.org/10.1145/2694344.2694381Google Scholar
- Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12). IEEE Computer Society, Washington, DC, 416--427. Retrieved from http://dl.acm.org/citation.cfm?id=2337159.2337207.Google Scholar
Digital Library
- Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). IEEE Computer Society, Washington, DC, 421--432. DOI:https://doi.org/10.1109/MICRO.2010.50Google Scholar
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 163--174. DOI:https://doi.org/10.1109/ISPASS.2009.4919648Google Scholar
- Alexander Branover, Denis Foley, and Maurice Steinman. 2012. AMD fusion APU: Llano. IEEE Micro 32, 2 (Mar. 2012), 28--37. DOI:https://doi.org/10.1109/MM.2012.2Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Austin, TX, 44--54. DOI:https://doi.org/10.1109/IISWC.2009.5306797Google Scholar
Digital Library
- Hanjin Chu. 2013. AMD heterogeneous uniform memory access. In Proceedings of the APU 13th Developer Summit. 11--13.Google Scholar
- Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A. Joao, Onur Mutlu, and Yale N. Patt. 2011. Parallel application memory scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 362--373. DOI:https://doi.org/10.1145/2155620.2155663Google Scholar
- Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (May 2008), 42--53. DOI:https://doi.org/10.1109/MM.2008.44Google Scholar
Digital Library
- Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A mapreduce framework on graphics processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 260--269.Google Scholar
Digital Library
- Advanced Micro Devices Inc. [n.d.]. AMD Quad-Core A10-Series APU for Desktops. Retrieved from http://products.amd.com/en-us/DesktopAPUDetail.aspx?id=100/.Google Scholar
- Micron Technology Inc. [n.d.]. Micron DDR3 SDRAM Part MT41J256M8. Micron Technology Inc.Google Scholar
- The Khronos Group Inc. [n.d.]. OpenCL. Retrieved from https://www.khronos.org/opencl/.Google Scholar
- Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the Design Automation Conference (DAC ’12). ACM, New York, NY, 850--855. DOI:https://doi.org/10.1145/2228360.2228513Google Scholar
Digital Library
- Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. [n.d.]. Balancing DRAM locality and parallelism in shared memory CMP systems. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. DOI:https://doi.org/10.1109/HPCA.2012.6168944Google Scholar
Digital Library
- Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 15--24. DOI:https://doi.org/10.1145/2304576.2304582Google Scholar
Digital Library
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 395--406. DOI:https://doi.org/10.1145/2451116.2451158Google Scholar
- Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 332--343. DOI:https://doi.org/10.1145/2485922.2485951Google Scholar
- Onur Kayıran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'14). IEEE, Cambridge, 114--126. DOI:https://doi.org/10.1109/MICRO.2014.62Google Scholar
Digital Library
- Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). IEEE Computer Society, Washington, DC, 65--76. DOI:https://doi.org/10.1109/MICRO.2010.51Google Scholar
Digital Library
- Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture. IEEE, 367--378. DOI:https://doi.org/10.1109/HPCA.2008.4658653Google Scholar
- Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. 2012. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 367--376. DOI:https://doi.org/10.1145/2370816.2370869Google Scholar
Digital Library
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'05). ACM, New York, 190--200. DOI:https://doi.org/10.1145/1065010.1065034Google Scholar
Digital Library
- Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai (Helen) Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write-based racetrack memory. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY. DOI:https://doi.org/10.1145/2593069.2593137Google Scholar
Digital Library
- Wei Mi, Xiaobing Feng, Jingling Xue, and Yaocang Jia. 2010. Software-hardware cooperative DRAM bank partitioning for chip multiprocessors. In Network and Parallel Computing. Springer, Berlin, 329--343.Google Scholar
- Micron. [n.d.]. Micron system power calculators. Retrieved from http://www.micron.com/products/support/power-calc/Google Scholar
- Micron. [n.d.]. Micron TN-ED-01: GDDR5 SGRAM Introduction. Retrieved from http://www.micron.com/products/dram/gddr5/Google Scholar
- Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, Washington, DC, 146--160. DOI:https://doi.org/10.1109/MICRO.2007.40Google Scholar
Digital Library
- Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Washington, DC, 63--74. DOI:https://doi.org/10.1109/ISCA.2008.7Google Scholar
Digital Library
- NVIDIA. [n.d.]. CUDA. Retrieved from http://www.nvidia.com/object/cuda_home_new.html/.Google Scholar
- NVIDIA. [n.d.]. CUDA SDK. Retrieved from https://developer.nvidia.com/cuda-downloads/.Google Scholar
- NVIDIA. 2009. Nvidia Fermi Architecture. Retrieved from http://www.nvidia.com/object/fermi-architecture.html.Google Scholar
- John D. Owens, William J. Dally, Scott Rixner, Peter Mattson, and Ujval J. Kapasi. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 128--138. DOI:https://doi.org/10.1145/339647.339668Google Scholar
- Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. 2015. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Comput. Architect. Lett. 14, 1 (Jan. 2015), 34--36. DOI:https://doi.org/10.1109/LCA.2014.2299539Google Scholar
Digital Library
- Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’14). 568--578. DOI:https://doi.org/10.1109/HPCA.2014.6835965Google Scholar
- Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 72--83. DOI:https://doi.org/10.1109/MICRO.2012.16Google Scholar
- Anand Lal Shimpi. 2012. Inside the titan supercomputer: 299k amd x86 cores and 18.6 k nvidia gpus. Retrieved on December 11, 2019 from https://www.anandtech.com/show/6421/inside-the-titan-supercomputer-299k-amd-x86-cores-and-186k-nvidia-gpu-cores.Google Scholar
- John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W. M. W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center Reliable High-Perform. Comput. 127 (2012).Google Scholar
- I.-Jui Sung, John A. Stratton, and Wen-Mei W. Hwu. 2010. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 513--522.Google Scholar
- Noriaki Suzuki, Hyoseung Kim, Dionisio De Niz, Bjorn Andersson, Lutz Wrage, Mark Klein, and Ragunathan Rajkumar. 2013. Coordinated bank and cache coloring for temporal protection of memory accesses. In Proceedings of the IEEE 16th International Conference on Computational Science and Engineering. IEEE, 685--692. DOI:https://doi.org/10.1109/CSE.2013.106Google Scholar
Digital Library
- Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu. 2016. DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Trans. Architect. Code Optim. 12, 4 (Jan. 2016). DOI:https://doi.org/10.1145/2847255Google Scholar
- Mingli Xie, Dong Tong, Kan Huang, and Xu Cheng. 2014. Improving system throughput and fairness simultaneously in shared memory CMP systems via dynamic bank partitioning. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 344--355. DOI:https://doi.org/10.1109/HPCA.2014.6835945Google Scholar
Cross Ref
- Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395--406. DOI:https://doi.org/10.1145/2830772.2830813Google Scholar
Digital Library
- Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 86--97. DOI:https://doi.org/10.1145/1806596.1806606Google Scholar
Digital Library
- George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 34--44. DOI:https://doi.org/10.1145/1669112.1669119Google Scholar
Index Terms
Thread Batching for High-performance Energy-efficient GPU Memory Design
Recommendations
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing FrontiersThe HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing SystemsWith fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingGeneral-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-...






Comments