Abstract
We propose translation-enabled memory prefetching optimizations or TEMPO, a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem. We are the first to make the following observations: (1) a substantial fraction (20-40%) of DRAM references in modern big- data workloads are devoted to accessing page tables; and (2) when memory references require page table lookups in DRAM, the vast majority of them (98%+) also look up DRAM for the subsequent data access. TEMPO exploits these observations to enable DRAM row-buffer and on-chip cache prefetching of the data that page tables point to. TEMPO requires trivial changes to the memory controller (under 3% additional area), no OS or application changes, and improves performance by 10-30% and energy by 1-14%.
- O. Mutlu and L. Subramaniam, "Research Problems and Opportunities in Memory Systems," SUPERFRI, 2015.Google Scholar
- G. Cox and A. Bhattacharjee, "Efficient Address Translation with Multiple PageSizes," ASPLOS, 2017.Google Scholar
- R. Cooksey, S. Jourdan, and D. Grunwald, "A Stateless, Content-Directed Data Prefetching Mechanism," ASPLOS, 2002. Google Scholar
Digital Library
- O. Mutlu, "Memory Scaling: A Systems Architecture Perspective," MEMCON, 2015.Google Scholar
- B. Jacob, "The Memory System: You Can't Avoid It; You Can't Ignore It; You Can't Fake It," Morgan Claypool Synthesis Lectures Series, 2009.Google Scholar
- K. Chang, P. Nair, S. Ghose, D. Lee, M. Qureshi, and O. Mutlu, "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM," HPCA, 2016.Google Scholar
- V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. Gibbons, M. Kozuch, and T. Mowry, "Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses," MICRO, 2015. Google Scholar
Digital Library
- K. K.-W. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu, "Improving DRAM Performance by Parallelizing Refreshes with Accesses," HPCA, 2014. Google Scholar
Cross Ref
- V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, M. Kozuch, P. Gibbons, and T. Mowry, "RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization," MICRO, 2013. Google Scholar
Digital Library
- D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramaniam, and O. Mutlu, "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture," HPCA, 2013.Google Scholar
- S.-L. Lu, Ying-Chen, and C.-L. Yang, "Improving DRAM Latency with Dynamic Asymmetric Subarray," MICRO, 2015.Google Scholar
- Y. H. Son, O. Seongil, Y. Ro, J. Lee, and J. H. Ahn, "Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations," ISCA, 2013. Google Scholar
Digital Library
- A. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis, and N. Jouppi, "Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores," ISCA, 2010. Google Scholar
Digital Library
- Y. Kim, V. Seshadri, D. Lee, J. lee, and O. Mutlu, "A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM," ISCA, 2012.Google Scholar
- H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and O. Mutlu, "ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality," HPCA, 2016.Google Scholar
- X. Shen, F. Shong, H. Meng, S. An, and Z. Zhang, "Rbpp: A Row Based DRAM Page Policy for the Manycore Era," ICPADS, 2014.Google Scholar
- M. Awasthi, D. Nellans, R. Balasubramonian, and A. Davis, "Prediction based DRAM Row-Buffer Management in the Many-Core Era," PACT, 2011. Google Scholar
Digital Library
- N. Dwarkanath, Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan, "Multiple Sub-Row Buffers in DRAM: Unlocking Performance and Energy Improvement Opportunities," ICS, 2012.Google Scholar
- Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, "Atlas: A scalable and high-performance scheduling algorithm for multiple memory constrollers," HPCA, 2010.Google Scholar
- O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems," ISCA, 2008.Google Scholar
- K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith, "Fair Queueing Memory Systems," MICRO, 2006.Google Scholar
- D. Abts, N. Enright-Jerger, J. Kim, D. Gibson, and M. Lipasti, "Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs," ISCA, 2009. Google Scholar
Digital Library
- L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, "BLISS: Balancing Performance, Fairness, and Complexity in Memory Access Scheduling," TPDS, 2016.Google Scholar
- L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, "The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost," ICCD, 2014.Google Scholar
- K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and A. Davis, "Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement," ASPLOS, 2010. Google Scholar
Digital Library
- H. Huang, P. Pillai, and K. Shin, "Design and Implementation of Power-Aware Virtual Memory," USENIX ATC, 2003.Google Scholar
- L. Peeled, S. Mannor, U. Weiser, and Y. Etsion, "Semantic Locality and Context-based Prefetching Using Reinforcement Learning," ISCA, 2015. Google Scholar
Digital Library
- M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. Pugsley, and Z. Chisti, "Efficiently Prefetching Complex Address Patterns," MICRO, 2015. Google Scholar
Digital Library
- A. Fuchs, S. Mannor, U. Weiser, and Y. Etsion, "Loop-Aware Memory Prefetching Using Code Block Working Sets," MICRO, 2014. Google Scholar
Digital Library
- T. Barr, A. Cox, and S. Rixner, "SpecTLB: A Mechanism for Speculative Address Translation," ISCA, 2011. Google Scholar
Digital Library
- A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last-Level TLBs for Chip Multiprocessors," HPCA, 2011. Google Scholar
Cross Ref
- D. Lustig, A. Bhattacharjee, and M. Martonosi, "TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs," TACO, 2012.Google Scholar
- A. Bhattacharjee and M. Martonosi, "Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors," ASPLOS, 2010.Google Scholar
- B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, "CoLT: Coalesced Large-Reach TLBs," MICRO, 2012.Google Scholar
- B. Pham, A. Bhattacharjee, Y. Eckert, and G. Loh, "Increasing TLB Reach by Exploiting Clustering in Page Translations," HPCA, 2014. Google Scholar
Cross Ref
- B. Pham, J. Vesely, G. Loh, and A. Bhattacharjee, "Large Pages and Lightweight Memory Management in Virtualized Systems: Can You Have it Both Ways?," MICRO, 2015.Google Scholar
- V. Karakostas, J. Gandhi, A. Cristal, M. Hill, K. McKinley, M. Nemirovsky, M. Swift, and O. Unsal, "Energy-Efficient Address Translation," HPCA, 2016. Google Scholar
Cross Ref
- V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. Hill, K. McKinley, M. Nemirovsky, M. Swift, and O. Unsal, "Redundant Memory Mappings for Fast Access to Large Memories," ISCA, 2015. Google Scholar
Digital Library
- A. Basu, J. Gandhi, J. Chang, M. Hill, and M. Swift, "Efficient Virtual Memory for Big Memory Servers," ISCA, 2013. Google Scholar
Digital Library
- J. Gandhi, A. Basu, M. Hill, and M. Swift, "Efficient Memory Virtualization," MICRO, 2014.Google Scholar
- M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, "Prediction-Based Superpage-Friendly TLB Designs," HPCA, 2014.Google Scholar
- A. Arcangeli, "Transparent Hugepage Support," KVM Forum, 2010.Google Scholar
- S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, "Memory Access Scheduling," ISCA, 2000. Google Scholar
Digital Library
- X. Yu, C. Hughes, N. Satish, and S. Devadas, "IMP: Indirect Memory Prefetcher," MICRO, 2015. Google Scholar
Digital Library
- T. Barr, A. Cox, and S. Rixner, "Translation Caching: Skip, Don't Walk (the Page Table)," ISCA, 2010. Google Scholar
Digital Library
- A. Bhattacharjee, "Large-Reach Memory Management Unit Caches," MICRO, 2013. Google Scholar
Digital Library
- A. Clements, F. Kaashoek, and N. Zeldovich, "RadixVM: Scalable Address Spaces for Multithreaded Applications," Eurosys, 2013. Google Scholar
Digital Library
- A. Clements, F. Kaashoek, and N. Zeldovich, "Scalable Address Spaces Using RCU Balanced Trees," ASPLOS, 2012. Google Scholar
Digital Library
- D. Lustig, G. Sethi, M. Martonosi, and A. Bhattacharjee, "COATCheck: Verifying Memory Ordering at the Hardware-OS Interface," ASPLOS, 2016. Google Scholar
Digital Library
- R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, "Accelerating Two-Dimensional Page Walks for Virtualized Systems," ASPLOS, 2008. Google Scholar
Digital Library
- A. Basu, M. Hill, and M. Swift, "Reducing Memory Reference Energy with Opportunistic Virtual Caching," ISCA, 2012. Google Scholar
Cross Ref
- Intel, "Haswell microarchitecture," www.7-cpu.com/cpu/Haswell.html.Google Scholar
- Intel, "Skylake microarchitecture," www.7-cpu.com/cpu/Skylake.html.Google Scholar
- H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu, "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD, 2012. Google Scholar
Digital Library
- J. Vesely, A. Basu, M. Oskin, G. Loh, and A. Bhattacharjee, "Observations and Opportunities in Architecting Shared Virtual Memory for Heterogeneous Systems," ISPASS, 2016. Google Scholar
Cross Ref
- D. Nelson, A. Pillepich, S. Genel, M. Vogelsberger, V. Springel, P. Torrey, V. Rodriguez-Gomez, D. Sijacki, G. Snyder, B. Griffen, F. Marinacci, L. Blecha, L. Sales, D. Xu, and L. Hernquist, "The Illustris Simulation: Public Data Release," Arxiv, 2015.Google Scholar
- Q. Deng, D. Meisner, L. Ramos, T. Wenisch, and R. Bianchini, "MemScale: Active Low-Power Modes for Main Memory," ASPLOS, 2011. Google Scholar
Digital Library
- Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, "CoScale: Coordinatd CPU and Memory System DVFS in Server Systems," MICRO, 2012.Google Scholar
- J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, Transparent Operating System Support for Superpages," OSDI, 2002. Google Scholar
Cross Ref
Index Terms
Translation-Triggered Prefetching
Recommendations
Translation-Triggered Prefetching
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsWe propose translation-enabled memory prefetching optimizations or TEMPO, a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem. We are the first to make the following ...
Translation-Triggered Prefetching
Asplos'17We propose translation-enabled memory prefetching optimizations or TEMPO, a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem. We are the first to make the following ...
Effective cache prefetching on bus-based multiprocessors
Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...







Comments