Abstract
Stacked DRAMs have been studied, evaluated in multiple scenarios, and even productized in the last decade. The large available bandwidth they offer make them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM increases the capacity and the bandwidth without increasing the package size. This integration of 3D stacked DRAMs aids in satisfying the capacity requirements of emerging workloads like deep learning. Though this vertical 3D integration of stacked DRAMs also increases the total available bandwidth, we observe that the bandwidth offered by these 3D stacked DRAMs is severely limited by the heat generated on the GPU. Based on our experiments on a cycle-level simulator, we make a key observation that the sections of the 3D stacked DRAM that are closer to the GPU have lower retention-times compared to the farther layers of stacked DRAM. This thermal-induced variable retention-times causes certain sections of 3D stacked DRAM to be refreshed more frequently compared to the others, thereby resulting in thermal-induced NUMA paradigms. To alleviate such thermal-induced NUMA behavior, we propose and experimentally evaluate three different incarnations of Data Convection, i.e., Intra-layer, Inter-layer, and Intra + Inter-layer, that aim at placing the most-frequently accessed data in a thermal-induced retention-aware fashion, taking into account both bank-level and channel-level parallelism. Our evaluations on a cycle-level GPU simulator indicate that, in a multi-application scenario, our Intra-layer, Inter-layer and Intra + Inter-layer algorithms improve the overall performance by 1.8%, 11.7%, and 14.4%, respectively, over a baseline that already encompasses 3D+2.5D stacked DRAMs.
- Aditya Agrawal, Josep Torrellas, and Sachin Idgunji. 2017. Xylem: enhancing vertical thermal conduction in 3D processor-memory stacks. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google Scholar
Digital Library
- Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. 2014. Dynamic power management of off-chip links for hybrid memory cubes. In 51st ACM/EDAC/IEEE Design Automation Conference (DAC) .Google Scholar
Digital Library
- Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Vijay Narayanan, and Yuan Xie. 2009. 3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis. In IEEE International Conference on Computer Design .Google Scholar
Cross Ref
- AMD Inc. [n.,d.]. The Polaris Architecture . https://www.amd.com/system/files/documents/polaris-whitepaper.pdfGoogle Scholar
- AMD Inc. 2017. Radeon RX Vega 64 . https://www.amd.com/en/products/graphics/radeon-rx-vega-64Google Scholar
- AMD Inc. 2019. AMD RADEON VII . https://www.amd.com/en/products/graphics/amd-radeon-viiGoogle Scholar
- AMD Inc. 2020. HIGH BANDWIDTH MEMORY (HBM) JEDEC 235C . https://www.jedec.org/standards-documents/docs/jesd235a . [Online; accessed 15-April-2020].Google Scholar
- S. Baek, S. Cho , and R. Melhem. 2014. Refresh Now and Then. IEEE Trans. Comput. , Vol. 63, 12 (2014), 3114--3126. https://doi.org/10.1109/TC.2013.164Google Scholar
Digital Library
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software .Google Scholar
Cross Ref
- Prasanna Balaprakash, Darius Buntinas, Anthony Chan, Apala Guha, Rinku Gupta, Sri Hari Krishna Narayanan, Andrew A Chien, Paul Hovland, and Boyana Norris. 2013. Exascale workload characterization and architecture implications. In Proceedings of the High Performance Computing Symposium (HPC) .Google Scholar
Digital Library
- Ishwar Bhati, Mu-Tien Chang, Zeshan Chishti, Shih-Lien Lu, and Bruce Jacob. 2015. DRAM refresh mechanisms, penalties, and trade-offs. IEEE Trans. Comput. (2015).Google Scholar
- Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H Loh, Don McCaule, Pat Morrow, Donald W Nelson, Daniel Pantuso, et almbox. 2006. Die stacking (3D) microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google Scholar
Digital Library
- K. K. Chang, D. Lee , Z. Chishti, A. R. Alameldeen , C. Wilkerson, Y. Kim, and O. Mutlu. 2014. Improving DRAM performance by parallelizing refreshes with accesses. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) . 356--367.Google Scholar
- Yi-Jung Chen, Chia-Lin Yang, Ping-Sheng Lin, and Yi-Chang Lu. 2015. Thermal/performance characterization of CMPs with 3D-stacked DRAMs under synergistic voltage-frequency control of cores and DRAMs. In Proceedings of the 2015 Conference on research in adaptive and convergent systems .Google Scholar
Digital Library
- C. C. Chou, A. Jaleel, and M. K. Qureshi. 2014. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google Scholar
- Joachim Clabes, J. Friedrich, Mark Sweet, Jack DiLullo, Sam Chu, Donald Plass, James Dawson, Paul Muench, Larry Powell, Michael Floyd, Balaram Sinharoy, Mike Lee, Michael Goulet, James Wagoner, Nicole Schwartz, Stephen Runyon, Gary Gorman, P.J. Restle, Ronald Kalla, and J. Dodson. 2004. Design and implementation of the POWER5? microprocessor. 670--672. https://doi.org/10.1109/DAC.2004.240518Google Scholar
- Ayse Coskun, David Atienza, Mohamed Sabry, and Jie Meng. 2011. Attaining single-chip, high-performance computing through 3D systems with active cooling. IEEE Micro , Vol. 31, 4 (2011), 63--75.Google Scholar
Digital Library
- Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units .Google Scholar
- Wei Ding, Diana Guttman, and Mahmut Kandemir. 2014. Compiler support for optimizing memory bank-level parallelism. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, United Kingdom) (MICRO-47). IEEE Computer Society, USA, 571--582. https://doi.org/10.1109/MICRO.2014.34Google Scholar
Digital Library
- Ronald G Dreslinski, David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, et almbox. 2013. Centip3de: A 64-core, 3d stacked near-threshold system. IEEE Micro , Vol. 33, 2 (2013), 8--16.Google Scholar
Digital Library
- David Fick, Ronald G Dreslinski, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, et almbox. 2012. Centip3De: A 3930DMIPS/W configurable near-threshold 3D stacked system with 64 ARM Cortex-M3 cores. In 2012 IEEE International Solid-State Circuits Conference. IEEE, 190--192.Google Scholar
Cross Ref
- Saugata Ghose, Tianshi Li, Nastaran Hajinazar, Damla Senol Cali, and Onur Mutlu. 2019. Demystifying Complex Workload-DRAM Interactions: An Experimental Study. In Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems (Phoenix, AZ, USA) (SIGMETRICS '19). Association for Computing Machinery, New York, NY, USA, 93. https://doi.org/10.1145/3309697.3331482Google Scholar
Digital Library
- Yongkui Han, Israel Koren, and C. Krishna. 2007. TILTS: A Fast Architectural-Level Transient Thermal Simulation Method. J. Low Power Electronics , Vol. 3 (04 2007), 13--21. https://doi.org/10.1166/jolpe.2007.106Google Scholar
- Syed Minhaj Hassan and Sudhakar Yalamanchili. 2016. Understanding the impact of air and microfluidics cooling on performance of 3d stacked memory systems. In Proceedings of the Second International Symposium on Memory Systems. 387--394.Google Scholar
Digital Library
- HBM. [n.,d.]. High Bandwidth Memory (HBM) DRAM. https://www.jedec.org/standards-documents/docs/jesd235aGoogle Scholar
- HBM. 2015. High-Bandwidth Memory (HBM) reinventing memory technology. https://www.amd.com/system/files/documents/high-bandwidth-memory-hbm.pdfGoogle Scholar
- HMC. [n.,d.]. Hybrid Memory Cube Consortium. http://www.hybridmemorycube.org/Google Scholar
- Wei Huang, Mircea R. Stan, Kevin Skadron, Karthik Sankaranarayanan, Shougata Ghosh, and Sivakumar Velusam. 2004. Compact thermal modeling for temperature-aware design. In Proceedings of the 41st Annual Design Automation Conference .Google Scholar
Digital Library
- JEDEC. [n.,d.]. JEDEC publishes Wide I/O 2 mobile DRAM standard. https://www.jedec.org/news/pressreleases/jedec-publishes-wide-io-2-mobile-dram-standardGoogle Scholar
- Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. 2012. Balancing DRAM locality and parallelism in shared memory CMP systems. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA '12). IEEE Computer Society, USA, 1--12. https://doi.org/10.1109/HPCA.2012.6168944Google Scholar
Digital Library
- X. Jiang, N. Madan , L. Zhao, M. Upton , R. Iyer, S. Makineni , D. Newell, Y. Solihin, and R. Balasubramonian. 2010. CHOP: adaptive filter-based DRAM caching for CMP server platforms. In In proceedings of The Sixteenth International Symposium on High-Performance Computer Architecture (HPCA). 1--12.Google Scholar
- Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2015. Anatomy of GPU memory system for multi-application execution. 1st International Symposium on Memory Systems (MEMSYS) (2015).Google Scholar
- Mushfique Junayed Khurshid and Mikko Lipasti. 2013. Data compression for thermal mitigation in the hybrid memory cube. In IEEE 31st International Conference on Computer Design (ICCD) .Google Scholar
Cross Ref
- Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: a fast and extensible DRAM simulator. IEEE Comput. Archit. Lett. (2016).Google Scholar
- J. B. Kotra, S. Kim , K. Madduri, and M. T. Kandemir. 2017. Congestion-aware memory management on NUMA platforms: A VMware ESXi case study. In IEEE International Symposium on Workload Characterization (IISWC) .Google Scholar
- Jagadish B. Kotra, Narges Shahidi, Zeshan A. Chishti, and Mahmut T. Kandemir. 2017. Hardware-Software co-design to mitigate DRAM refresh overheads: a case for refresh-aware process scheduling. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .Google Scholar
- J. B. Kotra, H. Zhang, A. R. Alameldeen, C. Wilkerson, and M. T. Kandemir. 2018. CHAMELEON: a dynamically reconfigurable heterogeneous memory system. In 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google Scholar
- Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture .Google Scholar
Digital Library
- M. Liu and T. Li. 2014. Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads. In ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) .Google Scholar
- Gabriel H Loh. 2008. 3D-stacked memory architectures for multi-core processors. In ACM SIGARCH computer architecture news .Google Scholar
- Gian Luca Loi, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Timothy Sherwood, and Kaustav Banerjee. 2006. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In Proceedings of the 43rd annual Design Automation Conference .Google Scholar
Digital Library
- Jieyi Long, Seda Ogrenci Memik, Gokhan Memik, and Rajarshi Mukherjee. 2008. Thermal monitoring mechanisms for chip multiprocessors. ACM Transactions on Architecture and Code Optimization (TACO) , Vol. 5, 2 (2008), 1--33.Google Scholar
Digital Library
- S. Lym, H. Ha , Y. Kwon, C. Chang , J. Kim, and M. Erez. 2018. ERUCA: efficient DRAM resource utilization and resource conflict avoidance for memory system parallelism. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 670--682.Google Scholar
- A. Majumdar, L. Piga , I. Paul, J. L. Greathouse , W. Huang, and D. H. Albonesi. 2017. Dynamic GPGPU power management using adaptive model predictive control. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) . 613--624.Google Scholar
- Seda Ogrenci Memik, Rajarshi Mukherjee, Min Ni, and Jieyi Long. 2008. Optimizing Thermal Sensor Allocation for Microprocessors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , Vol. 27, 3 (2008), 516--527. https://doi.org/10.1109/TCAD.2008.915538Google Scholar
Digital Library
- Jie Meng, Katsutoshi Kawakami, and Ayse K Coskun. 2012. Optimizing energy efficiency of 3-D multicore systems with stacked DRAM under power and thermal constraints. In Proceedings of the 49th Annual Design Automation Conference (DAC) .Google Scholar
Digital Library
- Micron. [n.,d.]. DDR4 SDRAM. https://www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr4/8gb_ddr4_sdram.pdfGoogle Scholar
- Makoto Motoyoshi. 2009. Through-silicon via (TSV). Proc. IEEE (2009).Google Scholar
- Lifeng Nai, Ramyad Hadidi, He Xiao, Hyojong Kim, Jaewoong Sim, and Hyesoon Kim. 2018. CoolPIM: Thermal-Aware Source Throttling for Efficient PIM Instruction Offloading. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) . 680--689. https://doi.org/10.1109/IPDPS.2018.00077Google Scholar
- Prashant Nair, Chia-Chen Chou, and Moinuddin K. Qureshi. 2013. A case for Refresh Pausing in DRAM memory systems. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 627--638. https://doi.org/10.1109/HPCA.2013.6522355Google Scholar
Digital Library
- NVIDIA Corporation. 2011. CUDA C/CGoogle Scholar
- SDK Code Samples .Google Scholar
- NVIDIA Corporation. 2017. NVIDIA TITAN V . https://www.nvidia.com/en-us/titan/titan-v/Google Scholar
- Jong S. Park. U.S. Patent 5 583 823, Dec. 1996. Dram refresh circuit.Google Scholar
- Indrani Paul, Srilatha Manne, Manish Arora, W Lloyd Bircher, and Sudhakar Yalamanchili. 2013. Cooperative boosting: needy versus greedy power management. In ACM SIGARCH Computer Architecture News .Google Scholar
- Kiran Puttaswamy and Gabriel H Loh. 2007. Thermal herding: microarchitecture techniques for controlling hotspots in high-performance 3D-integrated processors. In IEEE 13th International Symposium on High Performance Computer Architecture .Google Scholar
Digital Library
- Karthik Rao. 2018. Coordinated management of the processor and memory for optimizing energy efficiency . Ph.,D. Dissertation. Georgia Institute of Technology. https://smartech.gatech.edu/handle/1853/60234Google Scholar
- Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. VDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Press, Article 18, bibinfonumpages13 pages.Google Scholar
- J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim. 2014. Transparent hardware management of stacked DRAM as part of memory. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) .Google Scholar
- Chong Sun, Li Shang, and Robert P. Dick. 2007. Three-dimensional multiprocessor system-on-chip thermal optimization. In 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODESGoogle Scholar
- ISSS). 117--122.Google Scholar
- Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. 2016. Improving bank-Level parallelism for irregular applications. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Press, Article 57, bibinfonumpages12 pages.Google Scholar
Digital Library
- T. Vijayaraghavan, Y. Eckert, G. H. Loh, M. J. Schulte , M. Ignatowski, B. M. Beckmann , W. C. Brantley, J. L. Greathouse , W. Huang, A. Karunanithi , O. Kayiran, M. Meswani , I. Paul, M. Poremba , S. Raasch, S. K. Reinhardt , G. Sadowski, and V. Sridharan. 2017. Design and Analysis of an APU for Exascale Computing. In IEEE International Symposium on High Performance Computer Architecture (HPCA) .Google Scholar
- Christian Weis, Igor Loi, Luca Benini, and Norbert Wehn. 2013. Exploration and optimization of 3-D integrated DRAM subsystems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2013).Google Scholar
Digital Library
- Gene Y. Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (2015), 564--576.Google Scholar
Cross Ref
- Tiansheng Zhang, Jie Meng, and Ayse K Coskun. 2015b. Dynamic cache pooling in 3D multicore processors. ACM Journal on Emerging Technologies in Computing Systems (JETC) (2015).Google Scholar
- Yuang Zhang, Li Li, Axel Jantsch, Zhonghai Lu, Minglun Gao, Yuxiang Fu, and Hongbing Pan. 2015a. Exploring stacked main memory architecture for 3D GPGPUs. In IEEE 11th International Conference on ASIC (ASICON) .Google Scholar
Cross Ref
- Jishen Zhao, Guangyu Sun, Gabriel H Loh, and Yuan Xie. 2013. Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface. ACM Transactions on Architecture and Code Optimization (TACO) (2013).Google Scholar
Index Terms
Data Convection: A GPU-Driven Case Study for Thermal-Aware Data Placement in 3D DRAMs
Recommendations
Data Convection: A GPU-Driven Case Study for Thermal-Aware Data Placement in 3D DRAMs
SIGMETRICS '22Stacked DRAMs have been studied and productized in the last decade. The large available bandwidth they offer makes them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have ...
Data Convection: A GPU-Driven Case Study for Thermal-Aware Data Placement in 3D DRAMs
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsStacked DRAMs have been studied and productized in the last decade. The large available bandwidth they offer makes them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have ...
Optimizing data placement and size configuration for morphable NVM based SPM in embedded multicore systems
AbstractEmbedded multicore systems are widely designed to meet the high-performance requirement. Meanwhile, many embedded multicore systems are equipped with multiple scratchpad memories (SPM) because of their advantages in power efficiency ...
Highlights- In this paper, we explore how to efficiently use morphable NVM based SPMs in multicore systems to improve performance and minimize memory access cost.






Comments