skip to main content
research-article

Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

Published:31 July 2018Publication History
Skip Abstract Section

Abstract

Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) technologies, hybrid memory combining both DRAM and NVM achieves high performance, low power, and high density simultaneously, which provides a promising main memory design for GPGPUs. In this article, we explore the shared last-level cache management for GPGPUs with consideration of the underlying hybrid main memory. To improve the overall memory subsystem performance, we exploit the characteristics of both the asymmetric read/write latency of the hybrid main memory architecture, as well as the memory coalescing feature of GPGPUs. In particular, to reduce the average cost of L2 cache misses, we prioritize cache blocks from DRAM or NVM based on observations that operations to NVM part of main memory have a large impact on the system performance. Furthermore, the cache management scheme also integrates the GPU memory coalescing and cache bypassing techniques to improve the overall system performance. To minimize the impact of memory divergence behaviors among simultaneously executed groups of threads, we propose a hybrid main memory and warp aware memory scheduling mechanism for GPGPUs. Experimental results show that in the context of a hybrid main memory system, our proposed L2 cache management policy and memory scheduling mechanism improve performance by 15.69% on average for memory intensive benchmarks, whereas the maximum gain can be up to 29% and achieve an average memory subsystem energy reduction of 21.27%.

References

  1. Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 416--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). IEEE, 25--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sara S. Baghsorkhi, Isaac Gelado, Matthieu Delahaye, and Wen-mei W. Hwu. 2012. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 163--174.Google ScholarGoogle Scholar
  5. Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’12). IEEE, 141--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Niladrish Chatterjee, Naveen Muralimanohar, Rajeev Balasubramonian, Al Davis, and Norman P. Jouppi. 2012. Staged reads: Mitigating the impact of DRAM writes on DRAM reads. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Niladrish Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonian. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 128--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization. IEEE, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ganesh Dasika, Ankit Sethia, Trevor Mudge, and Scott Mahlke. 2011. PEPSC: A power-efficient processor for scientific computing. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, 101--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bruce Jacob, Spencer Ng, and David Wang. 2010. Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture. ACM, 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 272--283.Google ScholarGoogle ScholarCross RefCross Ref
  13. Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2014. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 395--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutluy, and Daniel A. Jimenezz. 2014. Improving cache performance using read-write partitioning. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 452--463.Google ScholarGoogle Scholar
  16. Hoda Aghaei Khouzani, Fateme S. Hosseini, and Chengmo Yang. 2017. Segment and conflict aware page allocation and migration in DRAM-PCM hybrid main memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 9 (2017), 1458--1470.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dongki Kim, Sungkwang Lee, Jaewoong Chung, Dae Hyun Kim, Dong Hyuk Woo, Sungjoo Yoo, and Sunggu Lee. 2012. Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU. In Proceedings of the 49th Annual Design Automation Conference. ACM, 888--896. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. David B. Kirk and W. Hwu Wen-Mei. 2016. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nagesh B. Lakshminarayana and Hyesoon Kim. 2010. Effect of instruction fetch and memory scheduling on gpu performance. In Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU. 1--10.Google ScholarGoogle Scholar
  20. Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 213--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Soyoon Lee, Hyokyung Bahn, and Sam H. Noh. 2014. Clock-dwf: A write-history-aware page replacement algorithm for hybrid PCM and dram memory architectures. IEEE Transactions on Computers 63, 9 (2014), 2187--2200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 89--100.Google ScholarGoogle Scholar
  25. Yun Liang, Xiaolong Xie, Guangyu Sun, and Deming Chen. 2015. An efficient compiler framework for cache bypassing on GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (2015), 1677--1690.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Vineeth Mekkat, Anup Holey, Pen-Chung Yew, and Antonia Zhai. 2013. Managing shared last-level cache in a heterogeneous multicore processor. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. Proceedings of the 37th Annual International Symposium on Computer Architecture (2010), 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shuai Mu, Yandong Deng, Yubei Chen, Huaiming Li, Jianming Pan, Wenjun Zhang, and Zhihua Wang. 2014. Orchestrating cache management and memory scheduling for GPGPU applications. IEEE Transactions on Very Large Scale Integration Systems 22, 8 (2014), 1803--1814.Google ScholarGoogle ScholarCross RefCross Ref
  29. Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 146--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture. IEEE, 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. NVIDIA. 2011. NVIDIA, CUDA SDK. (May 2011). https://developer.nvidia.com/cuda-toolkit-40.Google ScholarGoogle Scholar
  32. Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 423--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. ACM SIGARCH Computer Architecture News 37, 3 (2009), 24--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing. ACM, 85--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and Jeffrey S. Vetter. 2013. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 93--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Bin Wang, Weikuan Yu, Xian-He Sun, and Xinning Wang. 2015. Dacache: Memory divergence-aware GPU cache management. In Proceedings of the 29th ACM International Conference on Supercomputing. ACM, 89--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zhu Wang, Zonghua Gu, and Zili Shao. 2014. Optimizated allocation of data variables to PCM/DRAM-based hybrid main memory for real-time embedded systems. IEEE Embedded Systems Letters 6, 3 (2014), 61--64.Google ScholarGoogle ScholarCross RefCross Ref
  38. Wei Wei, Dejun Jiang, Jin Xiong, and Mingyu Chen. 2014. HAP: Hybrid-memory-aware partition in shared last-level cache. In Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD’14). IEEE, 28--35.Google ScholarGoogle ScholarCross RefCross Ref
  39. Nicholas Wilt. 2013. The Cuda Handbook: A Comprehensive Guide to GPU Programming. Pearson Education.Google ScholarGoogle Scholar
  40. Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 76--88.Google ScholarGoogle ScholarCross RefCross Ref
  41. Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. ACM, 174--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Chun Jason Xue, Guangyu Sun, Youtao Zhang, J. Joshua Yang, Yiran Chen, and Hai Li. 2011. Emerging non-volatile memories: Opportunities and challenges. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS’11). IEEE, 325--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Deshan Zhang, Lei Ju, Mengying Zhao, Xiang Gao, and Zhiping Jia. 2016. Write-back aware shared last-level cache management for hybrid main memory. In Proceedings of the 53rd ACM/EDAC/IEEE on Design Automation Conference (DAC’16). IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jishen Zhao and Yuan Xie. 2012. Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration. In Proceedings of the International Conference on Computer-Aided Design. ACM, 81--87. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!