Abstract
Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) technologies, hybrid memory combining both DRAM and NVM achieves high performance, low power, and high density simultaneously, which provides a promising main memory design for GPGPUs. In this article, we explore the shared last-level cache management for GPGPUs with consideration of the underlying hybrid main memory. To improve the overall memory subsystem performance, we exploit the characteristics of both the asymmetric read/write latency of the hybrid main memory architecture, as well as the memory coalescing feature of GPGPUs. In particular, to reduce the average cost of L2 cache misses, we prioritize cache blocks from DRAM or NVM based on observations that operations to NVM part of main memory have a large impact on the system performance. Furthermore, the cache management scheme also integrates the GPU memory coalescing and cache bypassing techniques to improve the overall system performance. To minimize the impact of memory divergence behaviors among simultaneously executed groups of threads, we propose a hybrid main memory and warp aware memory scheduling mechanism for GPGPUs. Experimental results show that in the context of a hybrid main memory system, our proposed L2 cache management policy and memory scheduling mechanism improve performance by 15.69% on average for memory intensive benchmarks, whereas the maximum gain can be up to 29% and achieve an average memory subsystem energy reduction of 21.27%.
- Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 416--427. Google Scholar
Digital Library
- Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). IEEE, 25--38. Google Scholar
Digital Library
- Sara S. Baghsorkhi, Isaac Gelado, Matthieu Delahaye, and Wen-mei W. Hwu. 2012. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 23--34. Google Scholar
Digital Library
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 163--174.Google Scholar
- Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’12). IEEE, 141--151. Google Scholar
Digital Library
- Niladrish Chatterjee, Naveen Muralimanohar, Rajeev Balasubramonian, Al Davis, and Norman P. Jouppi. 2012. Staged reads: Mitigating the impact of DRAM writes on DRAM reads. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1--12. Google Scholar
Digital Library
- Niladrish Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonian. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 128--139. Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization. IEEE, 44--54. Google Scholar
Digital Library
- Ganesh Dasika, Ankit Sethia, Trevor Mudge, and Scott Mahlke. 2011. PEPSC: A power-efficient processor for scientific computing. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, 101--110. Google Scholar
Digital Library
- Bruce Jacob, Spencer Ng, and David Wang. 2010. Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann. Google Scholar
Digital Library
- Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture. ACM, 60--71. Google Scholar
Digital Library
- Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 272--283.Google Scholar
Cross Ref
- Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2014. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 1. Google Scholar
Digital Library
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 395--406. Google Scholar
Digital Library
- Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutluy, and Daniel A. Jimenezz. 2014. Improving cache performance using read-write partitioning. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 452--463.Google Scholar
- Hoda Aghaei Khouzani, Fateme S. Hosseini, and Chengmo Yang. 2017. Segment and conflict aware page allocation and migration in DRAM-PCM hybrid main memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 9 (2017), 1458--1470.Google Scholar
Digital Library
- Dongki Kim, Sungkwang Lee, Jaewoong Chung, Dae Hyun Kim, Dong Hyuk Woo, Sungjoo Yoo, and Sunggu Lee. 2012. Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU. In Proceedings of the 49th Annual Design Automation Conference. ACM, 888--896. Google Scholar
Digital Library
- David B. Kirk and W. Hwu Wen-Mei. 2016. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann. Google Scholar
Digital Library
- Nagesh B. Lakshminarayana and Hyesoon Kim. 2010. Effect of instruction fetch and memory scheduling on gpu performance. In Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU. 1--10.Google Scholar
- Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, 1--12. Google Scholar
Digital Library
- Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 213--224. Google Scholar
Digital Library
- Soyoon Lee, Hyokyung Bahn, and Sam H. Noh. 2014. Clock-dwf: A write-history-aware page replacement algorithm for hybrid PCM and dram memory architectures. IEEE Transactions on Computers 63, 9 (2014), 2187--2200. Google Scholar
Digital Library
- Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12. Google Scholar
Digital Library
- Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 89--100.Google Scholar
- Yun Liang, Xiaolong Xie, Guangyu Sun, and Deming Chen. 2015. An efficient compiler framework for cache bypassing on GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (2015), 1677--1690.Google Scholar
Digital Library
- Vineeth Mekkat, Anup Holey, Pen-Chung Yew, and Antonia Zhai. 2013. Managing shared last-level cache in a heterogeneous multicore processor. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 225--234. Google Scholar
Digital Library
- Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. Proceedings of the 37th Annual International Symposium on Computer Architecture (2010), 235--246. Google Scholar
Digital Library
- Shuai Mu, Yandong Deng, Yubei Chen, Huaiming Li, Jianming Pan, Wenjun Zhang, and Zhihua Wang. 2014. Orchestrating cache management and memory scheduling for GPGPU applications. IEEE Transactions on Very Large Scale Integration Systems 22, 8 (2014), 1803--1814.Google Scholar
Cross Ref
- Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 146--160. Google Scholar
Digital Library
- Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture. IEEE, 63--74. Google Scholar
Digital Library
- NVIDIA. 2011. NVIDIA, CUDA SDK. (May 2011). https://developer.nvidia.com/cuda-toolkit-40.Google Scholar
- Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 423--432. Google Scholar
Digital Library
- Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. ACM SIGARCH Computer Architecture News 37, 3 (2009), 24--33. Google Scholar
Digital Library
- Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing. ACM, 85--95. Google Scholar
Digital Library
- Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and Jeffrey S. Vetter. 2013. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 93--102. Google Scholar
Digital Library
- Bin Wang, Weikuan Yu, Xian-He Sun, and Xinning Wang. 2015. Dacache: Memory divergence-aware GPU cache management. In Proceedings of the 29th ACM International Conference on Supercomputing. ACM, 89--98. Google Scholar
Digital Library
- Zhu Wang, Zonghua Gu, and Zili Shao. 2014. Optimizated allocation of data variables to PCM/DRAM-based hybrid main memory for real-time embedded systems. IEEE Embedded Systems Letters 6, 3 (2014), 61--64.Google Scholar
Cross Ref
- Wei Wei, Dejun Jiang, Jin Xiong, and Mingyu Chen. 2014. HAP: Hybrid-memory-aware partition in shared last-level cache. In Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD’14). IEEE, 28--35.Google Scholar
Cross Ref
- Nicholas Wilt. 2013. The Cuda Handbook: A Comprehensive Guide to GPU Programming. Pearson Education.Google Scholar
- Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 76--88.Google Scholar
Cross Ref
- Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. ACM, 174--183. Google Scholar
Digital Library
- Chun Jason Xue, Guangyu Sun, Youtao Zhang, J. Joshua Yang, Yiran Chen, and Hai Li. 2011. Emerging non-volatile memories: Opportunities and challenges. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS’11). IEEE, 325--334. Google Scholar
Digital Library
- Deshan Zhang, Lei Ju, Mengying Zhao, Xiang Gao, and Zhiping Jia. 2016. Write-back aware shared last-level cache management for hybrid main memory. In Proceedings of the 53rd ACM/EDAC/IEEE on Design Automation Conference (DAC’16). IEEE, 1--6. Google Scholar
Digital Library
- Jishen Zhao and Yuan Xie. 2012. Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration. In Proceedings of the International Conference on Computer-Aided Design. ACM, 81--87. Google Scholar
Digital Library
Index Terms
Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory
Recommendations
Write-back aware shared last-level cache management for hybrid main memory
DAC '16: Proceedings of the 53rd Annual Design Automation ConferenceHybrid main memory with both DRAM and emerging non-volatile memory (NVM) becomes a promising solution for high performance and energy-efficient embedded systems. Cache plays an important role and highly affects the number of write backs to NVM and DRAM ...
Shared last-level cache management for GPGPUs with hybrid main memory
DATE '17: Proceedings of the Conference on Design, Automation & Test in EuropeMemory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of nonvolatile memory (NVM) ...
Efficient utilization of GPGPU cache hierarchy
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUsRecent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe ...






Comments