skip to main content
10.1145/1736020.1736045acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Micro-pages: increasing DRAM efficiency with locality-aware data placement

Published:13 March 2010Publication History

ABSTRACT

Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small fraction of these bits are ever returned back to the CPU. This ends up wasting energy and time to read (and subsequently write back) bits which are used rarely. Traditionally, an open-page policy has been used for uni-processor systems and it has worked well because of spatial and temporal locality in the access stream. In future multi-core processors, the possibly independent access streams of each core are interleaved, thus destroying the available locality and significantly under-utilizing the contents of the row buffer. In this work, we attempt to improve row-buffer utilization for future multi-core systems.

The schemes presented here are motivated by our observations that a large number of accesses within heavily accessed OS pages are to small, contiguous "chunks" of cache blocks. Thus, the co-location of chunks (from different OS pages) in a row-buffer will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time. Such co-location can be achieved in many ways, notably involving a reduction in OS page size and software or hardware assisted migration of data within DRAM. We explore these mechanisms and discuss the trade-offs involved along with energy and performance improvements from each scheme. On average, for applications with room for improvement, our best performing scheme increases performance by 9% (max. 18%) and reduces memory energy consumption by 15% (max. 70%).

References

  1. STREAM -- Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream/.Google ScholarGoogle Scholar
  2. Virtutech Simics Full System Simulator. http://www.virtutech.com.Google ScholarGoogle Scholar
  3. Java Server Benchmark, 2005. Available at http://www.spec.org/jbb2005/.Google ScholarGoogle Scholar
  4. K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. BioBench: A Benchmark Suite of Bioinformatics Applications. In Proceedings of ISPASS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Asanovic and et. al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical report, EECS Department, University of California, Berkeley, 2006.Google ScholarGoogle Scholar
  6. M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter. Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches. In Proceedings of HPCA, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  7. D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, D. Dagum, R. Fatoohi, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. International Journal of Supercomputer Applications, 5(3): 63.73, Fall 1991.Google ScholarGoogle Scholar
  8. L. Barroso and U. Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Benia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Technical report, Department of Computer Science, Princeton University, 2008.Google ScholarGoogle Scholar
  10. B. Bershad, B. Chen, D. Lee, and T. Romer. Avoiding Conflict Misses Dynamically in Large Direct-Mapped Caches. In Proceedings of ASPLOS, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Carter, W. Hsieh, L. Stroller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In Proceedings of HPCA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proceedings of ASPLOS, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Chaudhuri. PageNUCA: Selected Policies For Page-Grain Locality Management In Large Shared Chip-Multiprocessor Caches. In Proceedings of HPCA, 2009.Google ScholarGoogle Scholar
  14. S. Cho and L. Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In Proceedings of MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Corbalan, X. Martorell, and J. Labarta. Page Migration with Dynamic Space-Sharing Scheduling Policies: The case of SGI 02000. International Journal of Parallel Programming, 32(4), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Crisp. Direct Rambus Technology: The New Main Memory Standard. In Proceedings of MICRO, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Cuppu and B. Jacob. Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance. In Proceedings of ISCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proceedings of ISCA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. Irwin. DRAM Energy Management Using Software and Hardware Directed Power Mode Control. In Proceedings of HPCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Ding, D. S. Nikopoulosi, S. Jiang, and X. Zhang. MESA: Reducing Cache Conflicts by Integrating Static and Run-Time Methods. In Proceedings of ISPASS, 2006.Google ScholarGoogle Scholar
  21. X. Fan, H. Zeng, and C. Ellis. Memory Controller Policies for DRAM Power Management. In Proceedings of ISLPED, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Z. Fang, L. Zhang, J. Carter, S. McKee, and W. Hsieh. Online Superpage Promotion Revisited (Poster Session). SIGMETRICS Perform. Eval. Rev., 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement And Replication In Distributed Caches. In Proceedings of ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. L. Henning. SPEC CPU2006 Benchmark Descriptions. In Proceedings of ACM SIGARCH Computer Architecture News, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Huang, P. Pillai, and K. G. Shin. Design And Implementation Of Power-Aware Virtual Memory. In Proceedings Of The Annual Conference On Usenix Annual Technical Conference, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Huang, K. Shin, C. Lefurgy, and T. Keller. Improving Energy Efficiency by Making DRAM Less Randomly Accessed. In Proceedings of ISLPED, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Intel 845G/845GL/845GV Chipset Datasheet: Intel 82845G/82845GL/82845GV Graphics and Memory Controller Hub (GMCH). Intel Corporation, 2002. http://download.intel.com/design/chipsets/datashts/29074602.pdf.Google ScholarGoogle Scholar
  28. ITRS. International Technology Roadmap for Semiconductors, 2007 Edition. http://www.itrs.net/Links/2007ITRS/Home2007.htm.Google ScholarGoogle Scholar
  29. B. Jacob, S.W. Ng, and D. T.Wang. Memory Systems -- Cache, DRAM, Disk. Elsevier, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. JEDEC. JESD79: Double Data Rate (DDR) SDRAM Specification. JEDEC Solid State Technology Association, Virginia, USA, 2003.Google ScholarGoogle Scholar
  31. N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of ISCA-17, pages 364.373, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. E. Kessler and M. D. Hill. Page Placement Algorithms for Large Real-Indexed Caches. ACM Trans. Comput. Syst., 10(4), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. E. Knuth. The Art of Computer Programming: Fundamental Algorithms, volume 1. Addison-Wesley, third edition, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. LaRowe and C. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. Technical report, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. LaRowe and C. Ellis. Page Placement policies for NUMA multiprocessors. J. Parallel Distrib. Comput., 11(2), 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. LaRowe, J. Wilkes, and C. Ellis. Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor. In Proceedings of PPOPP, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. Reinhardt, and T. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse--Computing Environments. In Proceedings of ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. Lin, S. Reinhardt, and D. Burger. Designing a Modern Memory Hierarchy with Hardware Prefetching. In Proceedings of IEEE Transactions on Computers, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50.58, February 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Micron DDR2 SDRAM Part MT47H64M8. Micron Technology Inc., 2004.Google ScholarGoogle Scholar
  42. R. Min and Y. Hu. Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses. IEEE Trans. Comput., 50(11), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. O. Mutlu and T. Moscibroda. Stall--Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems. In Proceedings of ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Navarro, S. Iyer, P. Druschel, and A. Cox. Practical, Transparent Operating SystemGoogle ScholarGoogle Scholar
  47. N. Rafique, W. Lim, and M. Thottethodi. Architectural Support for Operating System Driven CMP Cache Management. In Proceedings of PACT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proceedings of ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. T. Romer, W. Ohlrich, A. Karlin, and B. Bershad. Reducing TLB and Memory Overhead Using Online Superpage Promotion. In Proceedings of ISCA-22, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. T. Sherwood, B. Calder, and J. Emer. Reducing Cache Misses Using Hardware and Software Page Placement. In Proceedings of SC, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. A. Snavely, D. Tullsen, and G. Voelker. Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor. In Proceedings of SIGMETRICS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Swanson, L. Stoller, and J. Carter. Increasing TLB Reach using Superpages Backed by Shadow Memory. In Proceedings of ISCA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. M. Talluri and M. D. Hill. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of ASPLOS-VI, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Thoziyoor, N. Muralimanohar, and N. Jouppi. CACTI 5.0. Technical report, HP Laboratories, 2007.Google ScholarGoogle Scholar
  55. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. SIGPLAN Not., 31(9), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. D. Wallin, H. Zeffer, M. Karlsson, and E. Hagersten. VASA: A Simulator Infrastructure with Adjustable Fidelity. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems, 2005.Google ScholarGoogle Scholar
  57. D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A Memory-System Simulator. In SIGARCH Computer Architecture News, volume 33, September 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. X. Zhang, S. Dwarkadas, and K. Shen. Hardware Execution Throttling for Multi-core Resource Management. In Proceedings of USENIX, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Z. Zhang, Z. Zhu, and X. Zhand. A Permutation-Based Page Interleaving Scheme to Reduce Row--Buffer Conflicts and Exploit Data Locality. In Proceedings of MICRO, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu. Mini-Rank: Adaptive DRAM Architecture For Improving Memory Power Efficiency. In Proceedings of MICRO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. H. Zheng, J. Lin, Z. Zhang, and Z. Zhu. Decoupled DIMM: Building High-Bandwidth Memory System from Low-Speed DRAM Devices. In Proceedings of ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Z. Zhu and Z. Zhang. A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. In Proceedings of HPCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Z. Zhu, Z. Zhang, and X. Zhang. Fine-grain Priority Scheduling on Multi-channel Memory Systems. In Proceedings of HPCA, 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Micro-pages: increasing DRAM efficiency with locality-aware data placement

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!