Abstract
It has become increasingly difficult to understand the complex interactions between modern applications and main memory, composed of Dynamic Random Access Memory (DRAM) chips. Manufacturers are now selling and proposing many different types of DRAM, with each DRAM type catering to different needs (e.g., high throughput, low power, high memory density). At the same time, memory access patterns of prevalent and emerging applications are rapidly diverging, as these applications manipulate larger data sets in very different ways. As a result, the combined DRAM-workload behavior is often difficult to intuitively determine today, which can hinder memory optimizations in both hardware and software. In this work, we identify important families of workloads, as well as prevalent types of DRAM chips, and rigorously analyze the combined DRAM-workload behavior. To this end, we perform a comprehensive experimental study of the interaction between nine different DRAM types and 115 modern applications and multiprogrammed workloads. We draw 12 key observations from our characterization, enabled in part by our development of new metrics that take into account contention between memory requests due to hardware design. Notably, we find that (1) newer DRAM technologies such as DDR4 and HMC often do not outperform older technologies such as DDR3, due to higher access latencies and, also in the case of HMC, poor exploitation of locality; (2) there is no single memory type that can effectively cater to all of the components of a heterogeneous system (e.g., GDDR5 significantly outperforms other memories for multimedia acceleration, while HMC significantly outperforms other memories for network acceleration); and (3) there is still a strong need to lower DRAM latency, but unfortunately the current design trend of commodity DRAM is toward higher latencies to obtain other benefits. We hope that the trends we identify can drive optimizations in both hardware and software design. To aid further study, we open-source our extensively-modified simulator, as well as a benchmark suite containing our applications.
- Advanced Micro Devices, Inc., “High Bandwidth Memory (HBM) DRAM,” 2013.Google Scholar
- K. K. Agaram, S. W. Keckler, C. Lin, and K. S. McKinley, “Decomposing Memory Performance: Data Structures and Phases,” in ISMM, 2006.Google Scholar
Digital Library
- J. Ahn, N. Jouppi, C. Kozyrakis, J. Leverich, and R. Schreiber, “Future Scaling of Processor-Memory Interfaces,” in SC, 2009.Google Scholar
- M. A. Z. Alves, C. Villavieja, M. Diener, F. B. Moreira, and P. O. A. Navaux, “SiNUCA: A Validated Micro-Architecture Simulator,” in HPCC/CSS/ICESS, 2015.Google Scholar
Digital Library
- Apache Foundation, “Apache Hadoop,” http://hadoop.apache.org/.Google Scholar
- Apache Foundation, “Apache HTTP Server Project,” http://www.apache.org/.Google Scholar
- R. Ausavarungnirun, K. K. Chang, L. Subramanian, G. H. Loh, and O. Mutlu, “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” in ISCA, 2012.Google Scholar
Digital Library
- A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA Workloads Using a Detailed GPU Simulator,” in ISPASS, 2009.Google Scholar
Cross Ref
- P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K. Narayanan, A. A. Chien, P. Hovland, and B. Norris, “Exascale Workload Characterization and Architecture Implications,” in ISPASS, 2013.Google Scholar
Cross Ref
- C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC Benchmark Suite: Characterization and Architectural Implications,” Princeton Univ. Dept. of Computer Science, Tech. Rep. TR-811-08, 2008.Google Scholar
Digital Library
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “gem5: A Multiple-ISA Full System Simulator with Detailed Memory Model,” CAN, 2011.Google Scholar
- M. Burtscher, R. Nasre, and K. Pingali, “A Quantitative Study of Irregular Programs on GPUs ,” in IISWC, 2012.Google Scholar
Digital Library
- Canonical Ltd., “Ubuntu 14.04 LTS (Trusty Tahr),” http://releases.ubuntu.com/14.04/, 2014.Google Scholar
- Canonical Ltd., “Ubuntu 16.04 LTS (Xenial Xerus),” http://releases.ubuntu.com/16.04/, 2016.Google Scholar
- K. Chandrasekar, S. Goossens, C. Weis, M. Koedam, B. Akesson, N. Wehn, and K. Goossens, “Exploiting Expendable Process-Margins in DRAMs for Run-Time Performance Optimization,” in DATE, 2014.Google Scholar
Digital Library
- K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji, B. Akesson, N. Wehn, and K. Goossens, “DRAMPower: Open-Source DRAM Power & Energy Estimation Tool,” http://www.drampower.info.Google Scholar
- K. K. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu, “Improving DRAM Performance by Parallelizing Refreshes With Accesses,” in HPCA, 2014.Google Scholar
Cross Ref
- K. K. Chang, P. J. Nair, S. Ghose, D. Lee, M. K. Qureshi, and O. Mutlu, “Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM,” in HPCA, 2016.Google Scholar
Cross Ref
- K. K. Chang, A. G. Yauglikcci, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap, D. Lee, M. O'Connor, H. Hassan, and O. Mutlu, “Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms,” in SIGMETRICS, 2017.Google Scholar
Digital Library
- K. K. Chang, “Understanding and Improving the Latency of DRAM-Based Memory Systems,” Ph.D. dissertation, Carnegie Mellon Univ., 2017.Google Scholar
- K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhimenko, S. Khan, and O. Mutlu, “Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in SIGMETRICS, 2016.Google Scholar
Digital Library
- M. J. Charney and T. R. Puzak, “Prefetching and Memory System Behavior of the SPEC95 Benchmark Suite,” IBM JRD, 1997.Google Scholar
Digital Library
- N. Chatterjee, M. O'Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian, “Managing DRAM Latency Divergence in Irregular GPGPU Applications,” in SC, 2014.Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in IISWC, 2009.Google Scholar
Digital Library
- J. Choi, W. Shin, J. Jang, J. Suh, Y. Kwon, Y. Moon, and L.-S. Kim, “Multiple Clone Row DRAM: A Low Latency and Area Optimized DRAM,” in ISCA, 2015.Google Scholar
Digital Library
- Y. Chou, B. Fahs, and S. Abraham, “Microarchitecture Optimizations for Exploiting Memory-Level Parallelism,” in ISCA, 2004.Google Scholar
Digital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking Cloud Serving Systems with YCSB,” in SoCC, 2010.Google Scholar
Digital Library
- V. Cuppu, B. Jacob, B. Davis, and T. Mudge, “A Performance Comparison of Contemporary DRAM Architectures,” in ISCA, 1999.Google Scholar
- V. Cuppu, B. Jacob, B. Davis, and T. Mudge, “High-Performance DRAMs in Workstation Environments,” in IEEE Transactions on Computers, 2001.Google Scholar
- V. Cuppu and B. Jacob, “Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance?” in ISCA, 2001.Google Scholar
- A. Das, H. Hassan, and O. Mutlu, “VRL-DRAM: Improving DRAM Performance via Variable Refresh Latency,” in DAC, 2018.Google Scholar
Digital Library
- R. Das, O. Mutlu, T. Moscibroda, and C. Das, “Application-Aware Prioritization Mechanisms for On-Chip Networks,” in MICRO, 2009.Google Scholar
Digital Library
- H. David, C. Fallin, E. Gorbatov, U. R. Hanebutte, and O. Mutlu, “Memory Power Management via Dynamic Voltage/Frequency Scaling,” in ICAC, 2011.Google Scholar
Digital Library
- J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in OSDI, 2004.Google Scholar
Digital Library
- R. Desikan, D. Burger, and S. W. Keckler, “Measuring Experimental Error in Microprocessor Simulation,” in ISCA, 2001.Google Scholar
- D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux, “OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases,” in VLDB, 2004.Google Scholar
- X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory,” TCAD, 2012.Google Scholar
- Dormando, “Memcached: High-Performance Distributed Memory Object Caching System,” http://memcached.org/.Google Scholar
- E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Prefetch-Aware Shared Resource Management for Multi-Core Systems,” in ISCA, 2011.Google Scholar
- E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt, “Parallel Application Memory Scheduling,” in MICRO, 2011.Google Scholar
Digital Library
- F. A. Endo, D. Coroussé, and H.-P. Charles, “Micro-Architectural Simulation of In-Order and Out-of-Order ARM Microprocessors with gem5,” in SAMOS, 2014.Google Scholar
- S. Eyerman and L. Eeckhout, “System-Level Performance Metrics for Multiprogram Workloads ,” IEEE Micro, 2008.Google Scholar
Digital Library
- J. Fritts and B. Mangione-Smith, “MediaBench II - Technology, Status, and Cooperation,” in The Workshop on Media and Stream Processors, 2002.Google Scholar
- S. Ghose, H. Lee, and J. F. Mart'inez, “Improving Memory Scheduling via Processor-Side Load Criticality Information,” in ISCA, 2013.Google Scholar
- S. Ghose, A. G. Yauglikcci, R. Gupta, D. Lee, K. Kudrolli, W. X. Liu, H. Hassan, K. K. Chang, N. Chatterjee, A. Agrawal, M. O'Connor, and O. Mutlu, “What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study,” in SIGMETRICS, 2018.Google Scholar
Digital Library
- B. Giridhar, M. Cieslak, D. Duggal, R. Dreslinski, H. Chen, R. Patti, B. Hold, C. Chakrabarti, T. Mudge, and D. Blaauw, “Exploring DRAM Organizations for Energy-Efficient and Resilient Exascale Memories,” in SC, 2013.Google Scholar
- A. Glew, “MLP Yes! ILP No! Memory Level Parallelism, or Why I No Longer Care About Instruction Level Parallelism,” in ASPLOS WACI, 1998.Google Scholar
- M. D. Gomony, C. Weis, B. Akesson, N. Wehn, and K. Goossens, “DRAM Selection and Configuration for Real-Time Mobile Systems,” in DATE, 2012.Google Scholar
- G. Hamerly, E. Perelman, J. Lau, and B. Calder, “Simpoint 3.0: Faster and More Flexible Program Phase Analysis,” JILP, 2005.Google Scholar
- H. Hassan, M. Patel, J. S. Kim, A. G. Yauglikcci, N. Vijaykumar, N. M. Ghiasi, S. Ghose, and O. Mutlu, “CROW: A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability,” in ISCA, 2019.Google Scholar
- H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee, O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.Google Scholar
Cross Ref
- H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality,” in HPCA, 2016.Google Scholar
Cross Ref
- B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, “Mars: A MapReduce Framework on Graphics Processors,” in PACT, 2008.Google Scholar
Digital Library
- J. L. Henning, “SPEC CPU2000: Measuring CPU Performance in the New Millennium,” IEEE Computer, 2000.Google Scholar
Digital Library
- Hewlett-Packard, “Netperf: A Network Performance Benchmark (Rev. 2.1),” 1996.Google Scholar
- U. Holzle and L. A. Barroso, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines .hskip 1em plus 0.5em minus 0.4emrelax Morgan & Claypool, 2009.Google Scholar
- I. Hur and C. Lin, “Adaptive History-Based Memory Schedulers,” in MICRO, 2004.Google Scholar
- A. Hwang, I. Stefanovici, and B. Schroeder, “Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design,” in ASPLOS, 2012.Google Scholar
Digital Library
- Hybrid Memory Cube Consortium, “Hybrid Memory Cube Specification 2.1,” 2015.Google Scholar
- IBM Corp., POWER9 Processor RegistersSpecification, Vol. 3, May 2017.Google Scholar
- Intel Corp., “Product Specification: Inteltextsuperscript® Core#8482; i7--2600K,” https://ark.intel.com/products/52214/.Google Scholar
- Intel Corp., “Product Specification: Inteltextsuperscript® Core#8482; i7--975 Processor Extreme Edition,” https://ark.intel.com/products/37153/.Google Scholar
- Intel Corp., “Product Specification: Inteltextsuperscript® Xeontextsuperscript® Processor E5--2630 v4,” https://ark.intel.com/products/92981/.Google Scholar
- Intel Corp., 7th Generation Inteltextsuperscript® Processor Families for S Platforms and Inteltextsuperscript® Core#8482; X-Series Processor Family Datasheet, Vol. 1, December 2018.Google Scholar
- Intel Corp., Inteltextsuperscript® Xeontextsuperscript® Processor E5--1600/2400/2600/4600 (E5-Product Family) Product Families Datasheet Vol. 2, May 2018.Google Scholar
- IOzone Lab, “IOzone Filesystem Benchmark,” http://www.iozone.org/, 2016.Google Scholar
- E. .Ipek, O. Mutlu, J. F. Mart'inez, and R. Caruana, “Self-Optimizing Memory Controllers: A Reinforcement Learning Approach,” in ISCA, 2008.Google Scholar
- C. Isen and L. John, “ESKIMO -- Energy Savings Using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM Subsystem,” in MICRO, 2009.Google Scholar
- J. Jeddeloh and B. Keeth, “Hybrid Memory Cube New DRAM Architecture Increases Density and Performance,” in VLSIT, 2012.Google Scholar
Cross Ref
- JEDEC Solid State Technology Assn., JESD206: FBDIMM Architecture and Protocol, January 2007.Google Scholar
- JEDEC Solid State Technology Assn., JESD79--2F: DDR2 SDRAM Standard, November 2009.Google Scholar
- JEDEC Solid State Technology Assn., JESD229: Wide I/O Single Data Rate (Wide I/O SDR) Standard, December 2011.Google Scholar
- JEDEC Solid State Technology Assn., JESD79--3F: DDR3 SDRAM Standard, July 2012.Google Scholar
- JEDEC Solid State Technology Assn., JESD235: High Bandwidth Memory (HBM) DRAM, October 2013.Google Scholar
- JEDEC Solid State Technology Assn., JESD229--2: Wide I/O 2 (WideIO2) Standard, August 2014.Google Scholar
- JEDEC Solid State Technology Assn., JESD209--3C: Low Power Double Data Rate 3 (LPDDR3) Standard, August 2015.Google Scholar
- JEDEC Solid State Technology Assn., JESD212C: Graphics Double Data Rate (GDDR5) SGRAM Standard, February 2016.Google Scholar
- JEDEC Solid State Technology Assn., JESD209--4B: Low Power Double Data Rate 4 (LPDDR4) Standard, March 2017.Google Scholar
- JEDEC Solid State Technology Assn., JESD79--4B: DDR4 SDRAM Standard, June 2017.Google Scholar
- M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez, “Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems,” in HPCA, 2012.Google Scholar
Digital Library
- M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, “A QoS-Aware Memory Controller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC,” in DAC, 2012.Google Scholar
Digital Library
- A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, “Exploiting Core Criticality for Enhanced GPU Performance,” in SIGMETRICS, 2016.Google Scholar
Digital Library
- U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. Choi, “Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling,” in The Memory Forum, 2014.Google Scholar
- D. Kaseridis, J. Stuecheli, and L. K. John, “Minimalist Open-Page: A DRAM Page-Mode Scheduling Policy for the Many-Core Era,” in MICRO , 2011.Google Scholar
Digital Library
- S. Khan, D. Lee, and O. Mutlu, “PARBOR: An Efficient System-Level Technique to Detect Data Dependent Failures in DRAM,” in DSN, 2016.Google Scholar
Cross Ref
- S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilkerson, and O. Mutlu, “The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study,” in SIGMETRICS, 2014.Google Scholar
Digital Library
- S. Khan, C. Wilkerson, D. Lee, A. R. Alameldeen, and O. Mutlu, “A Case for Memory Content-Based Detection and Mitigation of Data-Dependent Failures in DRAM ,” CAL, 2016.Google Scholar
- S. Khan, C. Wilkerson, Z. Wang, A. Alameldeen, D. Lee, and O. Mutlu, “Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content,” in MICRO, 2017.Google Scholar
Digital Library
- G. Kim, J. Kim, J. H. Ahn, and J. Kim, “Memory-Centric System Interconnect Design with Hybrid Memory Cubes,” in PACT, 2013.Google Scholar
Digital Library
- J. S. Kim, C. Oh, H. Lee, D. Lee, H. R. Hwang, S. Hwang, B. Na, J. Moon, J. G. Kim, H. Park, J. W. Ryu, K. Park, S. K. Kang, S. Y. Kim, H. Kim, J. M. Bang, H. Cho, M. Jang, C. Han, J. B. Lee, K. Kyung, J. S. Choi, and Y. H. Jun, “A 1.2V 12.8GB/s 2Gb Mobile Wide-I/O DRAM with 4x128 I/Os Using TSV-Based Stacking,” in ISSCC, 2011.Google Scholar
- J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “Solar-DRAM: Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines,” in ICCD, 2018.Google Scholar
Cross Ref
- J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency--Reliability Tradeoff in Modern DRAM Devices,” in HPCA, 2018.Google Scholar
Cross Ref
- J. S. Kim, M. Patel, H. Hassan, L. Orosa, and O. Mutlu, “D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput,” in HPCA, 2019.Google Scholar
Cross Ref
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” in HPCA, 2010.Google Scholar
- Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior,” in MICRO, 2010.Google Scholar
Digital Library
- Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simulator,” CAL, 2015.Google Scholar
- Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” in ISCA, 2014.Google Scholar
Digital Library
- Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” in ISCA, 2012.Google Scholar
Digital Library
- N. Kirman, M. Kirman, M. Chaudhuri, and J. F. Martínez, “Checkpointed Early Load Retirement,” in HPCA, 2005.Google Scholar
Digital Library
- J. Kloosterman, J. Beaumont, M. Wollman, A. Sethia, R. Dreslinski, T. Mudge, and S. Mahlke, “WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors,” in MICRO, 2015.Google Scholar
Digital Library
- K. Lawton, B. Denney, and C. Bothamy, “The Bochs IA-32 emulator project,” http://bochs.sourceforge.net, 2006.Google Scholar
- C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware DRAM Controllers,” in MICRO, 2008.Google Scholar
Digital Library
- C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-Level Parallelism in the Presence of Prefetching,” in MICRO, 2009.Google Scholar
Digital Library
- C. J. Lee, E. Ebrahimi, V. Narasiman, O. Mutlu, and Y. N. Patt, “DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep. TR-HPS-2010-002, 2010.Google Scholar
- D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko, V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIGMETRICS, 2017.Google Scholar
Digital Library
- D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM,” in PACT, 2015.Google Scholar
Digital Library
- D. Lee, “Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity,” Ph.D. dissertation, Carnegie Mellon Univ., 2016.Google Scholar
- D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost,” TACO, 2016.Google Scholar
Digital Library
- D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu, “Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” in HPCA, 2015.Google Scholar
Cross Ref
- D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.Google Scholar
- C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. Keller, “Energy Management for Commercial Servers,” Computer, 2003.Google Scholar
- Lenovo Group Ltd., “Intel Xeon Scalable Family Balanced Memory Configurations,” https://lenovopress.com/lp0742.pdf, 2017.Google Scholar
- A. Li, W. Liu, M. R. B. Kistensen, B. Vinter, H. Wang, K. Hou, A. Marquez, and S. L. Song, “Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels,” in SC, 2017.Google Scholar
- S. Li, D. Reddy, and B. Jacob, “A Performance & Power Comparison of Modern High-Speed DRAM Architectures,” in MEMSYS, 2018.Google Scholar
Digital Library
- J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” in ISCA, 2012.Google Scholar
Digital Library
- J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms,” in ISCA, 2013.Google Scholar
- G. H. Loh, “3D-Stacked Memory Architectures for Multi-Core Processors,” in ISCA, 2008.Google Scholar
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,” in PLDI, 2005.Google Scholar
Digital Library
- K. Luo, J. Gummaraju, and M. Franklin, “Balancing Throughput and Fairness in SMT Processors,” in ISPASS, 2001.Google Scholar
- K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz, “Towards Energy-Proportional Datacenter Memory with Mobile DRAM,” in ISCA, 2012.Google Scholar
Digital Library
- J. A. Mandelman, R. H. Dennard, G. B. Bronner, J. K. DeBrosse, R. Divakaruni, Y. Li, and C. J. Radens, “Challenges and Future Directions for the Scaling of Dynamic Random-Access Memory (DRAM),” IBM JRD, 2002.Google Scholar
Digital Library
- J. D. McCalpin, “Memory Bandwidth and Machine Balance in Current High Performance Computers,” TCCA Newsletter, 1995.Google Scholar
- J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field,” in DSN, 2015.Google Scholar
Digital Library
- Micron Technology, Inc., Technical Note TN-46--12: Mobile DRAM Power-Saving Features and Calculations, May 2009, https://www.micron.com/ /media/documents/products/technical-note/dram/tn4612.pdf.Google Scholar
- Micron Technology, Inc., “DDR3 SDRAM Verilog Model, v. 1.74,” https://www.micron.com/-/media/client/global/documents/products/sim-model/dram/ddr3/ddr3-sdram-verilog-model.zip, 2015.Google Scholar
- Micron Technology, Inc., 178-Ball 2E0F Mobile LPDDR3 SDRAM Data Sheet, April 2016.Google Scholar
- Micron Technology, Inc., 2Gb: x4, x8, x16 DDR3 SDRAM Data Sheet , February 2016.Google Scholar
- Micron Technology, Inc., 200-Ball Z01M LPDDR4 SDRAM Automotive Data Sheet, May 2018.Google Scholar
- Micron Technology, Inc., 4Gb: x4, x8, x16 DDR4 SDRAM Data Sheet , June 2018.Google Scholar
- T. Moscibroda and O. Mutlu, “Distributed Order Scheduling and Its Application to Multi-Core DRAM Controllers,” in PODC, 2008.Google Scholar
Digital Library
- J. Mukundan and J. F. Mart'inez, “MORSE: Multi-objective Reconfigurable Self-Optimizing Memory Scheduler,” in HPCA, 2012.Google Scholar
Digital Library
- S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda, “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” in MICRO, 2011.Google Scholar
Digital Library
- R. C. Murphy and P. M. Kogge, “On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications,” TC, 2007.Google Scholar
- O. Mutlu, “The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser,” in DATE, 2017.Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.Google Scholar
Digital Library
- O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in IMW, 2013.Google Scholar
Cross Ref
- O. Mutlu, H. Kim, and Y. N. Patt, “Techniques for Efficient Processing in Runahead Execution Engines,” in ISCA, 2005.Google Scholar
- O. Mutlu, H. Kim, and Y. N. Patt, “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” IEEE Micro, 2006.Google Scholar
Digital Library
- O. Mutlu and J. S. Kim, “RowHammer: A Retrospective,” TCAD, 2019.Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” in MICRO, 2007.Google Scholar
Digital Library
- O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors,” in HPCA, 2003.Google Scholar
Digital Library
- K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith, “Fair Queuing Memory Systems,” in MICRO, 2006.Google Scholar
Digital Library
- NVIDIA Corp., “GeForce GTX 480: Specifications,” https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/specifications.Google Scholar
- NXP Semiconductors, “QorIQ Processing Platforms: 64-Bit Multicore SoCs,” https://www.nxp.com/products/processors-and-microcontrollers/applications-processors/qoriq-platforms:QORIQ_HOME.Google Scholar
- M. Patel, J. S. Kim, H. Hassan, and O. Mutlu, “Understanding and Modeling On-Die Error Correction in Modern DRAM: An Experimental Study Using Real Devices,” in DSN, 2019.Google Scholar
Cross Ref
- M. Patel, J. Kim, and O. Mutlu, “The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions,” in ISCA, 2017.Google Scholar
- I. Paul, W. Huang, M. Arora, and S. Yalamanchili, “Harmonia: Balancing Compute and Memory Power in High-Performance GPUs,” in ISCA, 2015.Google Scholar
Digital Library
- J. T. Pawlowski, “Hybrid Memory Cube (HMC),” in HC, 2011.Google Scholar
- S. Pelley, “atomic-memory-trace,” https://github.com/stevenpelley/atomic-memory-trace, 2013.Google Scholar
- S. Peter, J. Li, I. Zhang, D. R. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe, “Arrakis: The Operating System Is the Control Plane,” TOCS, 2016.Google Scholar
Digital Library
- M. K. Qureshi, D. H. Kim, S. Khan, P. J. Nair, and O. Mutlu, “AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems,” in DSN, 2015.Google Scholar
Digital Library
- M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, “A Case for MLP-Aware Cache Replacement,” in ISCA, 2006.Google Scholar
- M. Radulovic, D. Zivanovic, D. Ruiz, B. R. de Supinski, S. A. McKee, P. Radojković, and E. Ayaguadé, “Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC?” in MEMSYS, 2015.Google Scholar
Digital Library
- S. Rixner, “Memory Controller Optimizations for Web Servers,” in MICRO, 2004.Google Scholar
Digital Library
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory Access Scheduling,” in ISCA, 2000.Google Scholar
- T. Rokicki, “Indexing Memory Banks to Maximize Page Mode Hit Percentage and Minimize Memory Latency,” HP Laboratories Palo Alto, Tech. Rep. HPL-96--95, 1996.Google Scholar
- P. Rosenfeld, E. Cooper-Balis, T. Farrell, D. Resnick, and B. Jacob, “Peering Over the Memory Wall: Design Space and Performance Analysis of the Hybrid Memory Cube,” Univ. of Maryland Systems and Computer Architecture Group, Tech. Rep. UMD-SCA-2012--10-01, 2012.Google Scholar
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A Cycle Accurate Memory System Simulator,” CAL, 2011.Google Scholar
Digital Library
- SAFARI Research Group, “GPGPUSimGoogle Scholar
- Ramulator -- GitHub Repository,” https://github.com/Carnegie Mellon University-SAFARI/GPGPUSim-Ramulator.Google Scholar
- SAFARI Research Group, “MemBen: A Memory Benchmark Suite for Ramulator -- GitHub Repository,” https://github.com/Carnegie Mellon University-SAFARI/MemBen.Google Scholar
- SAFARI Research Group, “Ramulator: A DRAM Simulator -- GitHub Repository,” https://github.com/Carnegie Mellon University-SAFARI/ramulator.Google Scholar
- B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM Errors in the Wild: A Large-Scale Field Study,” in SIGMETRICS, 2009.Google Scholar
Digital Library
- V. Seshadri, A. Bhowmick, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “The Dirty-Block Index,” in ISCA, 2014.Google Scholar
Digital Library
- V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization,” in MICRO, 2013.Google Scholar
Digital Library
- V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.Google Scholar
Digital Library
- V. Seshadri and O. Mutlu, “In-DRAM Bulk Bitwise Execution Engine,” in Advances in Computers, 2020, available as arXiv:1905.09822 [cs.AR].Google Scholar
- S. Singh and M. Awasthi, “Memory Centric Characterization and Analysis of SPEC CPU2017 Suite,” in ICPE, 2019.Google Scholar
Digital Library
- SK Hynix Inc., 2Gb (64Mx32) GDDR5 SGRAM Data Sheet, November 2011.Google Scholar
- A. Snavely and D. M. Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” in ASPLOS, 2000.Google Scholar
- Y. H. Son, S. O, Y. Ro, J. W. Lee, and J. H. Ahn, “Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations,” in ISCA, 2013.Google Scholar
- V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi, “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” in SC, 2013.Google Scholar
- V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi, “Memory Errors in Modern Systems: The Good, The Bad, and the Ugly,” in ASPLOS, 2015.Google Scholar
Digital Library
- Standard Performance Evaluation Corp., “SPEC CPU2006 Benchmarks,” http://www.spec.org/cpu2006/.Google Scholar
- J. Stuecheli, D. Kaseridis, H. C. Hunter, and L. K. John, “Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory,” in MICRO, 2010.Google Scholar
Digital Library
- J. Stuecheli, D. Kaseridis, D. Daly, H. C. Hunter, and L. K. John, “The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies,” in ISCA, 2010.Google Scholar
- L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling,” TPDS, 2016.Google Scholar
Digital Library
- L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost,” in ICCD, 2014.Google Scholar
Cross Ref
- B. Sun, X. Li, Z. Zhu, and X. Zhou, “Behavior Gaps and Relations between Operating System and Applications on Accessing DRAM,” in ICECCS, 2014.Google Scholar
Digital Library
- A. Suresh, P. Cicotti, and L. Carrington, “Evaluation of Emerging Memory Technologies for HPC, Data Intensive Applications,” in CLUSTER, 2014.Google Scholar
Cross Ref
- X. Tang, M. Kandemir, P. Yedlapalli, and J. Kotra, “Improving Bank-Level Parallelism for Irregular Applications,” in MICRO, 2016.Google Scholar
Cross Ref
- J. Tuck, L. Ceze, and J. Torrellas, “Scalable Cache Miss Handling for High Memory-Level Parallelism,” in MICRO, 2006.Google Scholar
Digital Library
- R. Ubal, B. Jand, P. Mistry, D. Schaa, and D. Kaeli, “Multi2Sim: A Simulation Framework for CPU--GPU Computing,” in PACT, 2012.Google Scholar
Digital Library
- United States Department of Energy, “CORAL Benchmark Codes,” https://asc.llnl.gov/CORAL-benchmarks/, 2014.Google Scholar
- United States Department of Energy, “CORAL-2 Benchmarks,” https://asc.llnl.gov/coral-2-benchmarks/, 2017.Google Scholar
- H. Usui, L. Subramanian, K. K. Chang, and O. Mutlu, “DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators,” TACO, 2016.Google Scholar
Digital Library
- R. K. Venkatesan, S. Herr, and E. Rotenberg, “Retention-Aware Placement in DRAM (RAPID): Software Methods for Quasi-Non-Volatile DRAM,” in HPCA, 2006.Google Scholar
Cross Ref
- Y. Wang, A. Tavakkol, L. Orosa, S. Ghose, N. Mansouri Ghiasi, M. Patel, J. S. Kim, H. Hassan, M. Sadrosadati, and O. Mutlu, “Reducing DRAM Latency via Charge-Level-Aware Look-Ahead Partial Restoration,” in MICRO, 2018.Google Scholar
Digital Library
- M. Ware, K. Rajamani, M. Floyd, B. Brock, J. C. Rubio, F. Rawson, and J. B. Carter, “Architecting for Power Management: The IBM POWER7 Approach,” in HPCA, 2010.Google Scholar
- D. H. Yoon, J. Chang, N. Muralimanohar, and P. Ranganathan, “BOOM: Enabling Mobile Memory Based Low-Power Server DIMMs,” in ISCA, 2012.Google Scholar
- G. L. Yuan, A. Bakhoda, and T. M. Aamodt, “Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures,” in MICRO, 2009.Google Scholar
Digital Library
- J. Zawodny, “Redis: Lightweight Key/Value Store That Goes the Extra Mile,” in Linux Magazine, 2009.Google Scholar
- X. Zhang, Y. Zhang, B. R. Childers, and J. Yang, “Restore Truncation for Performance Improvement in Future DRAM Systems,” in HPCA, 2016.Google Scholar
Cross Ref
- Z. Zhang, Z. Zhu, and X. Zhang, “A Permutation-Based Page Interleaving Scheme to Reduce Row-Buffer Conflicts and Exploit Data Locality,” in MICRO, 2000.Google Scholar
Digital Library
- J. Zhao, O. Mutlu, and Y. Xie, “FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems,” in MICRO, 2014.Google Scholar
Digital Library
- H. Zheng and Z. Zhu, “Power and Performance Trade-Offs in Contemporary DRAM System Designs for Multicore Processors,” TC, 2010.Google Scholar
- Z. Zhu and Z. Zhang, “A Performance Comparison of DRAM Memory System Optimizations for SMT Processors,” in HPCA, 2005.Google Scholar
- W. Zuravleff and T. Robinson, “Controller for a Synchronous DRAM That Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order,” U.S. Patent No. 5,630,096, 1997.Google Scholar
Index Terms
Demystifying Complex Workload-DRAM Interactions: An Experimental Study
Recommendations
What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study
Main memory (DRAM) consumes as much as half of the total system power in a computer today, due to the increasing demand for memory capacity and bandwidth. There is a growing need to understand and analyze DRAM power consumption, which can be used to ...
Demystifying Complex Workload-DRAM Interactions: An Experimental Study
SIGMETRICS '19: Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsIt has become increasingly difficult to understand the complex interaction between modern applications and main memory, composed of Dynamic Random Access Memory (DRAM) chips. Manufacturers and researchers are developing many different types of DRAM, ...
What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study
SIGMETRICS '18Main memory (DRAM) consumes as much as half of the total system power in a computer today, due to the increasing demand for memory capacity and bandwidth. There is a growing need to understand and analyze DRAM power consumption, which can be used to ...






Comments