ABSTRACT
Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small fraction of these bits are ever returned back to the CPU. This ends up wasting energy and time to read (and subsequently write back) bits which are used rarely. Traditionally, an open-page policy has been used for uni-processor systems and it has worked well because of spatial and temporal locality in the access stream. In future multi-core processors, the possibly independent access streams of each core are interleaved, thus destroying the available locality and significantly under-utilizing the contents of the row buffer. In this work, we attempt to improve row-buffer utilization for future multi-core systems.
The schemes presented here are motivated by our observations that a large number of accesses within heavily accessed OS pages are to small, contiguous "chunks" of cache blocks. Thus, the co-location of chunks (from different OS pages) in a row-buffer will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time. Such co-location can be achieved in many ways, notably involving a reduction in OS page size and software or hardware assisted migration of data within DRAM. We explore these mechanisms and discuss the trade-offs involved along with energy and performance improvements from each scheme. On average, for applications with room for improvement, our best performing scheme increases performance by 9% (max. 18%) and reduces memory energy consumption by 15% (max. 70%).
- STREAM -- Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream/.Google Scholar
- Virtutech Simics Full System Simulator. http://www.virtutech.com.Google Scholar
- Java Server Benchmark, 2005. Available at http://www.spec.org/jbb2005/.Google Scholar
- K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. BioBench: A Benchmark Suite of Bioinformatics Applications. In Proceedings of ISPASS, 2005. Google Scholar
Digital Library
- K. Asanovic and et. al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical report, EECS Department, University of California, Berkeley, 2006.Google Scholar
- M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter. Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches. In Proceedings of HPCA, 2009.Google Scholar
Cross Ref
- D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, D. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. International Journal of Supercomputer Applications, 5(3): 63.73, Fall 1991.Google Scholar
- L. Barroso and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool, 2009. Google Scholar
Digital Library
- C. Benia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical report, Department of Computer Science, Princeton University, 2008.Google Scholar
- B. Bershad, B. Chen, D. Lee, and T. Romer. Avoiding Conflict Misses Dynamically in Large Direct-Mapped Caches. In Proceedings of ASPLOS, 1994. Google Scholar
Digital Library
- J. Carter, W. Hsieh, L. Stroller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In Proceedings of HPCA, 1999. Google Scholar
Digital Library
- R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proceedings of ASPLOS, 1994. Google Scholar
Digital Library
- M. Chaudhuri. PageNUCA: Selected Policies For Page-Grain Locality Management In Large Shared Chip-Multiprocessor Caches. In Proceedings of HPCA, 2009.Google Scholar
- S. Cho and L. Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In Proceedings of MICRO, 2006. Google Scholar
Digital Library
- J. Corbalan, X. Martorell, and J. Labarta. Page Migration with Dynamic Space-Sharing Scheduling Policies: The case of SGI 02000. International Journal of Parallel Programming, 32(4), 2004. Google Scholar
Digital Library
- R. Crisp. Direct Rambus Technology: The New Main Memory Standard. In Proceedings of MICRO, 1997.Google Scholar
Digital Library
- V. Cuppu and B. Jacob. Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance. In Proceedings of ISCA, 2001. Google Scholar
Digital Library
- V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proceedings of ISCA, 1999. Google Scholar
Digital Library
- V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. Irwin. DRAM Energy Management Using Software and Hardware Directed Power Mode Control. In Proceedings of HPCA, 2001. Google Scholar
Digital Library
- X. Ding, D. S. Nikopoulosi, S. Jiang, and X. Zhang. MESA: Reducing Cache Conflicts by Integrating Static and Run-Time Methods. In Proceedings of ISPASS, 2006.Google Scholar
- X. Fan, H. Zeng, and C. Ellis. Memory Controller Policies for DRAM Power Management. In Proceedings of ISLPED, 2001. Google Scholar
Digital Library
- Z. Fang, L. Zhang, J. Carter, S. McKee, and W. Hsieh. Online Superpage Promotion Revisited (Poster Session). SIGMETRICS Perform. Eval. Rev., 2000. Google Scholar
Digital Library
- N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement And Replication In Distributed Caches. In Proceedings of ISCA, 2009. Google Scholar
Digital Library
- J. L. Henning. SPEC CPU2006 Benchmark Descriptions. In Proceedings of ACM SIGARCH Computer Architecture News, 2005. Google Scholar
Digital Library
- H. Huang, P. Pillai, and K. G. Shin. Design And Implementation Of Power-Aware Virtual Memory. In Proceedings Of The Annual Conference On Usenix Annual Technical Conference, 2003. Google Scholar
Digital Library
- H. Huang, K. Shin, C. Lefurgy, and T. Keller. Improving Energy Efficiency by Making DRAM Less Randomly Accessed. In Proceedings of ISLPED, 2005. Google Scholar
Digital Library
- Intel 845G/845GL/845GV Chipset Datasheet: Intel 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH). Intel Corporation, 2002. http://download.intel.com/design/chipsets/datashts/29074602.pdf.Google Scholar
- ITRS. International Technology Roadmap for Semiconductors, 2007 Edition. http://www.itrs.net/Links/2007ITRS/Home2007.htm.Google Scholar
- B. Jacob, S.W. Ng, and D. T.Wang. Memory Systems -- Cache, DRAM, Disk. Elsevier, 2008. Google Scholar
Digital Library
- JEDEC. JESD79: Double Data Rate (DDR) SDRAM Specification. JEDEC Solid State Technology Association, Virginia, USA, 2003.Google Scholar
- N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of ISCA-17, pages 364.373, May 1990. Google Scholar
Digital Library
- R. E. Kessler and M. D. Hill. Page Placement Algorithms for Large Real-Indexed Caches. ACM Trans. Comput. Syst., 10(4), 1992. Google Scholar
Digital Library
- D. E. Knuth. The Art of Computer Programming: Fundamental Algorithms, volume 1. Addison-Wesley, third edition, 1997. Google Scholar
Digital Library
- R. LaRowe and C. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. Technical report, 1990. Google Scholar
Digital Library
- R. LaRowe and C. Ellis. Page Placement policies for NUMA multiprocessors. J. Parallel Distrib. Comput., 11(2), 1991. Google Scholar
Digital Library
- R. LaRowe, J. Wilkes, and C. Ellis. Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor. In Proceedings of PPOPP, 1991. Google Scholar
Digital Library
- K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. Reinhardt, and T. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of ISCA, 2009. Google Scholar
Digital Library
- K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse--Computing Environments. In Proceedings of ISCA, 2008. Google Scholar
Digital Library
- W. Lin, S. Reinhardt, and D. Burger. Designing a Modern Memory Hierarchy with Hardware Prefetching. In Proceedings of IEEE Transactions on Computers, 2001. Google Scholar
Digital Library
- P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50.58, February 2002. Google Scholar
Digital Library
- Micron DDR2 SDRAM Part MT47H64M8. Micron Technology Inc., 2004.Google Scholar
- R. Min and Y. Hu. Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses. IEEE Trans. Comput., 50(11), 2001. Google Scholar
Digital Library
- N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of MICRO, 2007. Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda. Stall--Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of MICRO, 2007. Google Scholar
Digital Library
- O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems. In Proceedings of ISCA, 2008. Google Scholar
Digital Library
- J. Navarro, S. Iyer, P. Druschel, and A. Cox. Practical, Transparent Operating SystemGoogle Scholar
- N. Rafique, W. Lim, and M. Thottethodi. Architectural Support for Operating System Driven CMP Cache Management. In Proceedings of PACT, 2006. Google Scholar
Digital Library
- S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proceedings of ISCA, 2000. Google Scholar
Digital Library
- T. Romer, W. Ohlrich, A. Karlin, and B. Bershad. Reducing TLB and Memory Overhead Using Online Superpage Promotion. In Proceedings of ISCA-22, 1995. Google Scholar
Digital Library
- T. Sherwood, B. Calder, and J. Emer. Reducing Cache Misses Using Hardware and Software Page Placement. In Proceedings of SC, 1999. Google Scholar
Digital Library
- A. Snavely, D. Tullsen, and G. Voelker. Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor. In Proceedings of SIGMETRICS, 2002. Google Scholar
Digital Library
- M. Swanson, L. Stoller, and J. Carter. Increasing TLB Reach using Superpages Backed by Shadow Memory. In Proceedings of ISCA, 1998. Google Scholar
Digital Library
- M. Talluri and M. D. Hill. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of ASPLOS-VI, 1994. Google Scholar
Digital Library
- S. Thoziyoor, N. Muralimanohar, and N. Jouppi. CACTI 5.0. Technical report, HP Laboratories, 2007.Google Scholar
- B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. SIGPLAN Not., 31(9), 1996. Google Scholar
Digital Library
- D. Wallin, H. Zeffer, M. Karlsson, and E. Hagersten. VASA: A Simulator Infrastructure with Adjustable Fidelity. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems, 2005.Google Scholar
- D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A Memory-System Simulator. In SIGARCH Computer Architecture News, volume 33, September 2005. Google Scholar
Digital Library
- X. Zhang, S. Dwarkadas, and K. Shen. Hardware Execution Throttling for Multi-core Resource Management. In Proceedings of USENIX, 2009. Google Scholar
Digital Library
- Z. Zhang, Z. Zhu, and X. Zhand. A Permutation-Based Page Interleaving Scheme to Reduce Row--Buffer Conflicts and Exploit Data Locality. In Proceedings of MICRO, 2000. Google Scholar
Digital Library
- H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu. Mini-Rank: Adaptive DRAM Architecture For Improving Memory Power Efficiency. In Proceedings of MICRO, 2008. Google Scholar
Digital Library
- H. Zheng, J. Lin, Z. Zhang, and Z. Zhu. Decoupled DIMM: Building High-Bandwidth Memory System from Low-Speed DRAM Devices. In Proceedings of ISCA, 2009. Google Scholar
Digital Library
- Z. Zhu and Z. Zhang. A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. In Proceedings of HPCA, 2005. Google Scholar
Digital Library
- Z. Zhu, Z. Zhang, and X. Zhang. Fine-grain Priority Scheduling on Multi-channel Memory Systems. In Proceedings of HPCA, 2002 Google Scholar
Digital Library
Index Terms
Micro-pages: increasing DRAM efficiency with locality-aware data placement
Recommendations
Micro-pages: increasing DRAM efficiency with locality-aware data placement
ASPLOS '10Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems ...
Micro-pages: increasing DRAM efficiency with locality-aware data placement
ASPLOS '10Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems ...
Health-Binning: Maximizing the Performance and the Endurance of Consumer-Level NAND Flash
SYSTOR '16: Proceedings of the 9th ACM International on Systems and Storage ConferenceIn recent years, the adoption of NAND flash in enterprise storage systems has been progressing rapidly. Todays all-flash storage arrays exhibit excellent I/O throughput, latency, storage density, and energy efficiency. However, the advancements in NAND ...








Comments