skip to main content
research-article
Public Access

Data Convection: A GPU-Driven Case Study for Thermal-Aware Data Placement in 3D DRAMs

Published:28 February 2022Publication History
Skip Abstract Section

Abstract

Stacked DRAMs have been studied, evaluated in multiple scenarios, and even productized in the last decade. The large available bandwidth they offer make them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM increases the capacity and the bandwidth without increasing the package size. This integration of 3D stacked DRAMs aids in satisfying the capacity requirements of emerging workloads like deep learning. Though this vertical 3D integration of stacked DRAMs also increases the total available bandwidth, we observe that the bandwidth offered by these 3D stacked DRAMs is severely limited by the heat generated on the GPU. Based on our experiments on a cycle-level simulator, we make a key observation that the sections of the 3D stacked DRAM that are closer to the GPU have lower retention-times compared to the farther layers of stacked DRAM. This thermal-induced variable retention-times causes certain sections of 3D stacked DRAM to be refreshed more frequently compared to the others, thereby resulting in thermal-induced NUMA paradigms. To alleviate such thermal-induced NUMA behavior, we propose and experimentally evaluate three different incarnations of Data Convection, i.e., Intra-layer, Inter-layer, and Intra + Inter-layer, that aim at placing the most-frequently accessed data in a thermal-induced retention-aware fashion, taking into account both bank-level and channel-level parallelism. Our evaluations on a cycle-level GPU simulator indicate that, in a multi-application scenario, our Intra-layer, Inter-layer and Intra + Inter-layer algorithms improve the overall performance by 1.8%, 11.7%, and 14.4%, respectively, over a baseline that already encompasses 3D+2.5D stacked DRAMs.

References

  1. Aditya Agrawal, Josep Torrellas, and Sachin Idgunji. 2017. Xylem: enhancing vertical thermal conduction in 3D processor-memory stacks. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. 2014. Dynamic power management of off-chip links for hybrid memory cubes. In 51st ACM/EDAC/IEEE Design Automation Conference (DAC) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Vijay Narayanan, and Yuan Xie. 2009. 3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis. In IEEE International Conference on Computer Design .Google ScholarGoogle ScholarCross RefCross Ref
  4. AMD Inc. [n.,d.]. The Polaris Architecture . https://www.amd.com/system/files/documents/polaris-whitepaper.pdfGoogle ScholarGoogle Scholar
  5. AMD Inc. 2017. Radeon RX Vega 64 . https://www.amd.com/en/products/graphics/radeon-rx-vega-64Google ScholarGoogle Scholar
  6. AMD Inc. 2019. AMD RADEON VII . https://www.amd.com/en/products/graphics/amd-radeon-viiGoogle ScholarGoogle Scholar
  7. AMD Inc. 2020. HIGH BANDWIDTH MEMORY (HBM) JEDEC 235C . https://www.jedec.org/standards-documents/docs/jesd235a . [Online; accessed 15-April-2020].Google ScholarGoogle Scholar
  8. S. Baek, S. Cho , and R. Melhem. 2014. Refresh Now and Then. IEEE Trans. Comput. , Vol. 63, 12 (2014), 3114--3126. https://doi.org/10.1109/TC.2013.164Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software .Google ScholarGoogle ScholarCross RefCross Ref
  10. Prasanna Balaprakash, Darius Buntinas, Anthony Chan, Apala Guha, Rinku Gupta, Sri Hari Krishna Narayanan, Andrew A Chien, Paul Hovland, and Boyana Norris. 2013. Exascale workload characterization and architecture implications. In Proceedings of the High Performance Computing Symposium (HPC) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ishwar Bhati, Mu-Tien Chang, Zeshan Chishti, Shih-Lien Lu, and Bruce Jacob. 2015. DRAM refresh mechanisms, penalties, and trade-offs. IEEE Trans. Comput. (2015).Google ScholarGoogle Scholar
  12. Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H Loh, Don McCaule, Pat Morrow, Donald W Nelson, Daniel Pantuso, et almbox. 2006. Die stacking (3D) microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. K. Chang, D. Lee , Z. Chishti, A. R. Alameldeen , C. Wilkerson, Y. Kim, and O. Mutlu. 2014. Improving DRAM performance by parallelizing refreshes with accesses. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) . 356--367.Google ScholarGoogle Scholar
  14. Yi-Jung Chen, Chia-Lin Yang, Ping-Sheng Lin, and Yi-Chang Lu. 2015. Thermal/performance characterization of CMPs with 3D-stacked DRAMs under synergistic voltage-frequency control of cores and DRAMs. In Proceedings of the 2015 Conference on research in adaptive and convergent systems .Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. C. Chou, A. Jaleel, and M. K. Qureshi. 2014. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google ScholarGoogle Scholar
  16. Joachim Clabes, J. Friedrich, Mark Sweet, Jack DiLullo, Sam Chu, Donald Plass, James Dawson, Paul Muench, Larry Powell, Michael Floyd, Balaram Sinharoy, Mike Lee, Michael Goulet, James Wagoner, Nicole Schwartz, Stephen Runyon, Gary Gorman, P.J. Restle, Ronald Kalla, and J. Dodson. 2004. Design and implementation of the POWER5? microprocessor. 670--672. https://doi.org/10.1109/DAC.2004.240518Google ScholarGoogle Scholar
  17. Ayse Coskun, David Atienza, Mohamed Sabry, and Jie Meng. 2011. Attaining single-chip, high-performance computing through 3D systems with active cooling. IEEE Micro , Vol. 31, 4 (2011), 63--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units .Google ScholarGoogle Scholar
  19. Wei Ding, Diana Guttman, and Mahmut Kandemir. 2014. Compiler support for optimizing memory bank-level parallelism. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, United Kingdom) (MICRO-47). IEEE Computer Society, USA, 571--582. https://doi.org/10.1109/MICRO.2014.34Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ronald G Dreslinski, David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, et almbox. 2013. Centip3de: A 64-core, 3d stacked near-threshold system. IEEE Micro , Vol. 33, 2 (2013), 8--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. David Fick, Ronald G Dreslinski, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, et almbox. 2012. Centip3De: A 3930DMIPS/W configurable near-threshold 3D stacked system with 64 ARM Cortex-M3 cores. In 2012 IEEE International Solid-State Circuits Conference. IEEE, 190--192.Google ScholarGoogle ScholarCross RefCross Ref
  22. Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu. 2019. Demystifying Complex Workload-DRAM Interactions: An Experimental Study. In Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems (Phoenix, AZ, USA) (SIGMETRICS '19). Association for Computing Machinery, New York, NY, USA, 93. https://doi.org/10.1145/3309697.3331482Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yongkui Han, Israel Koren, and C. Krishna. 2007. TILTS: A Fast Architectural-Level Transient Thermal Simulation Method. J. Low Power Electronics , Vol. 3 (04 2007), 13--21. https://doi.org/10.1166/jolpe.2007.106Google ScholarGoogle Scholar
  24. Syed Minhaj Hassan and Sudhakar Yalamanchili. 2016. Understanding the impact of air and microfluidics cooling on performance of 3d stacked memory systems. In Proceedings of the Second International Symposium on Memory Systems. 387--394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. HBM. [n.,d.]. High Bandwidth Memory (HBM) DRAM. https://www.jedec.org/standards-documents/docs/jesd235aGoogle ScholarGoogle Scholar
  26. HBM. 2015. High-Bandwidth Memory (HBM) reinventing memory technology. https://www.amd.com/system/files/documents/high-bandwidth-memory-hbm.pdfGoogle ScholarGoogle Scholar
  27. HMC. [n.,d.]. Hybrid Memory Cube Consortium. http://www.hybridmemorycube.org/Google ScholarGoogle Scholar
  28. Wei Huang, Mircea R. Stan, Kevin Skadron, Karthik Sankaranarayanan, Shougata Ghosh, and Sivakumar Velusam. 2004. Compact thermal modeling for temperature-aware design. In Proceedings of the 41st Annual Design Automation Conference .Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. JEDEC. [n.,d.]. JEDEC publishes Wide I/O 2 mobile DRAM standard. https://www.jedec.org/news/pressreleases/jedec-publishes-wide-io-2-mobile-dram-standardGoogle ScholarGoogle Scholar
  30. Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. 2012. Balancing DRAM locality and parallelism in shared memory CMP systems. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA '12). IEEE Computer Society, USA, 1--12. https://doi.org/10.1109/HPCA.2012.6168944Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Jiang, N. Madan , L. Zhao, M. Upton , R. Iyer, S. Makineni , D. Newell, Y. Solihin, and R. Balasubramonian. 2010. CHOP: adaptive filter-based DRAM caching for CMP server platforms. In In proceedings of The Sixteenth International Symposium on High-Performance Computer Architecture (HPCA). 1--12.Google ScholarGoogle Scholar
  32. Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2015. Anatomy of GPU memory system for multi-application execution. 1st International Symposium on Memory Systems (MEMSYS) (2015).Google ScholarGoogle Scholar
  33. Mushfique Junayed Khurshid and Mikko Lipasti. 2013. Data compression for thermal mitigation in the hybrid memory cube. In IEEE 31st International Conference on Computer Design (ICCD) .Google ScholarGoogle ScholarCross RefCross Ref
  34. Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: a fast and extensible DRAM simulator. IEEE Comput. Archit. Lett. (2016).Google ScholarGoogle Scholar
  35. J. B. Kotra, S. Kim , K. Madduri, and M. T. Kandemir. 2017. Congestion-aware memory management on NUMA platforms: A VMware ESXi case study. In IEEE International Symposium on Workload Characterization (IISWC) .Google ScholarGoogle Scholar
  36. Jagadish B. Kotra, Narges Shahidi, Zeshan A. Chishti, and Mahmut T. Kandemir. 2017. Hardware-Software co-design to mitigate DRAM refresh overheads: a case for refresh-aware process scheduling. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .Google ScholarGoogle Scholar
  37. J. B. Kotra, H. Zhang, A. R. Alameldeen, C. Wilkerson, and M. T. Kandemir. 2018. CHAMELEON: a dynamically reconfigurable heterogeneous memory system. In 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google ScholarGoogle Scholar
  38. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture .Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Liu and T. Li. 2014. Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads. In ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) .Google ScholarGoogle Scholar
  40. Gabriel H Loh. 2008. 3D-stacked memory architectures for multi-core processors. In ACM SIGARCH computer architecture news .Google ScholarGoogle Scholar
  41. Gian Luca Loi, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Timothy Sherwood, and Kaustav Banerjee. 2006. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In Proceedings of the 43rd annual Design Automation Conference .Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jieyi Long, Seda Ogrenci Memik, Gokhan Memik, and Rajarshi Mukherjee. 2008. Thermal monitoring mechanisms for chip multiprocessors. ACM Transactions on Architecture and Code Optimization (TACO) , Vol. 5, 2 (2008), 1--33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Lym, H. Ha , Y. Kwon, C. Chang , J. Kim, and M. Erez. 2018. ERUCA: efficient DRAM resource utilization and resource conflict avoidance for memory system parallelism. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 670--682.Google ScholarGoogle Scholar
  44. A. Majumdar, L. Piga , I. Paul, J. L. Greathouse , W. Huang, and D. H. Albonesi. 2017. Dynamic GPGPU power management using adaptive model predictive control. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) . 613--624.Google ScholarGoogle Scholar
  45. Seda Ogrenci Memik, Rajarshi Mukherjee, Min Ni, and Jieyi Long. 2008. Optimizing Thermal Sensor Allocation for Microprocessors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , Vol. 27, 3 (2008), 516--527. https://doi.org/10.1109/TCAD.2008.915538Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jie Meng, Katsutoshi Kawakami, and Ayse K Coskun. 2012. Optimizing energy efficiency of 3-D multicore systems with stacked DRAM under power and thermal constraints. In Proceedings of the 49th Annual Design Automation Conference (DAC) .Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Micron. [n.,d.]. DDR4 SDRAM. https://www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr4/8gb_ddr4_sdram.pdfGoogle ScholarGoogle Scholar
  48. Makoto Motoyoshi. 2009. Through-silicon via (TSV). Proc. IEEE (2009).Google ScholarGoogle Scholar
  49. Lifeng Nai, Ramyad Hadidi, He Xiao, Hyojong Kim, Jaewoong Sim, and Hyesoon Kim. 2018. CoolPIM: Thermal-Aware Source Throttling for Efficient PIM Instruction Offloading. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) . 680--689. https://doi.org/10.1109/IPDPS.2018.00077Google ScholarGoogle Scholar
  50. Prashant Nair, Chia-Chen Chou, and Moinuddin K. Qureshi. 2013. A case for Refresh Pausing in DRAM memory systems. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 627--638. https://doi.org/10.1109/HPCA.2013.6522355Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. NVIDIA Corporation. 2011. CUDA C/CGoogle ScholarGoogle Scholar
  52. SDK Code Samples .Google ScholarGoogle Scholar
  53. NVIDIA Corporation. 2017. NVIDIA TITAN V . https://www.nvidia.com/en-us/titan/titan-v/Google ScholarGoogle Scholar
  54. Jong S. Park. U.S. Patent 5 583 823, Dec. 1996. Dram refresh circuit.Google ScholarGoogle Scholar
  55. Indrani Paul, Srilatha Manne, Manish Arora, W Lloyd Bircher, and Sudhakar Yalamanchili. 2013. Cooperative boosting: needy versus greedy power management. In ACM SIGARCH Computer Architecture News .Google ScholarGoogle Scholar
  56. Kiran Puttaswamy and Gabriel H Loh. 2007. Thermal herding: microarchitecture techniques for controlling hotspots in high-performance 3D-integrated processors. In IEEE 13th International Symposium on High Performance Computer Architecture .Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Karthik Rao. 2018. Coordinated management of the processor and memory for optimizing energy efficiency . Ph.,D. Dissertation. Georgia Institute of Technology. https://smartech.gatech.edu/handle/1853/60234Google ScholarGoogle Scholar
  58. Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. VDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Press, Article 18, bibinfonumpages13 pages.Google ScholarGoogle Scholar
  59. J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim. 2014. Transparent hardware management of stacked DRAM as part of memory. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google ScholarGoogle Scholar
  60. Chong Sun, Li Shang, and Robert P. Dick. 2007. Three-dimensional multiprocessor system-on-chip thermal optimization. In 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODESGoogle ScholarGoogle Scholar
  61. ISSS). 117--122.Google ScholarGoogle Scholar
  62. Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. 2016. Improving bank-Level parallelism for irregular applications. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Press, Article 57, bibinfonumpages12 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. T. Vijayaraghavan, Y. Eckert, G. H. Loh, M. J. Schulte , M. Ignatowski, B. M. Beckmann , W. C. Brantley, J. L. Greathouse , W. Huang, A. Karunanithi , O. Kayiran, M. Meswani , I. Paul, M. Poremba , S. Raasch, S. K. Reinhardt , G. Sadowski, and V. Sridharan. 2017. Design and Analysis of an APU for Exascale Computing. In IEEE International Symposium on High Performance Computer Architecture (HPCA) .Google ScholarGoogle Scholar
  64. Christian Weis, Igor Loi, Luca Benini, and Norbert Wehn. 2013. Exploration and optimization of 3-D integrated DRAM subsystems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Gene Y. Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (2015), 564--576.Google ScholarGoogle ScholarCross RefCross Ref
  66. Tiansheng Zhang, Jie Meng, and Ayse K Coskun. 2015b. Dynamic cache pooling in 3D multicore processors. ACM Journal on Emerging Technologies in Computing Systems (JETC) (2015).Google ScholarGoogle Scholar
  67. Yuang Zhang, Li Li, Axel Jantsch, Zhonghai Lu, Minglun Gao, Yuxiang Fu, and Hongbing Pan. 2015a. Exploring stacked main memory architecture for 3D GPGPUs. In IEEE 11th International Conference on ASIC (ASICON) .Google ScholarGoogle ScholarCross RefCross Ref
  68. Jishen Zhao, Guangyu Sun, Gabriel H Loh, and Yuan Xie. 2013. Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface. ACM Transactions on Architecture and Code Optimization (TACO) (2013).Google ScholarGoogle Scholar

Index Terms

  1. Data Convection: A GPU-Driven Case Study for Thermal-Aware Data Placement in 3D DRAMs

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Article Metrics

                • Downloads (Last 12 months)93
                • Downloads (Last 6 weeks)12

                Other Metrics

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!