Abstract
Power efficiency has become one of the most important design constraints for high-performance systems. In this paper, we revisit the design of low-power virtually-addressed caches. While virtually-addressed caches enable significant power savings by obviating the need for Translation Lookaside Buffer (TLB) lookups, they suffer from several challenging design issues that curtail their widespread commercial adoption. We focus on one of these challenges–cache flushes due to virtual page remappings. We use detailed studies on an ARM many-core server to show that this problem degrades performance by up to 25 percent for a mix of multi-programmed and multi-threaded workloads. Interestingly, we observe that many of these flushes are spurious, and caused by an indiscriminate invalidation broadcast on ARM architecture. In response, we propose a low-overhead and readily implementable hardware mechanism using bloom filters to reduce spurious invalidations and mitigate their ill effects.
- N. Agarwal, D. Nellans, M. O'Connor, S. Keckler, and T. Wenisch, “Unlocking bandwidth for GPUs in CC-NUMA systems,” in Proc. Int. Symp. High Performance Comput. Archit., 2015, pp. 354–365.Google Scholar
- A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient virtual memory for big memory servers,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 237–248. Google Scholar
Digital Library
- A. Basu, M. D. Hill, and M. M. Swift, “Reducing memory reference energy with opportunistic virtual caching,” in Proc. 39th Annu. Int. Symp. Comput. Archit., 2012, pp. 297–308. Google Scholar
Digital Library
- B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. Volume 13, no. Issue 7, pp. 422–426, 1970. Google Scholar
- J. L. Carter and M. N. Wegman, “Universal classes of hash functions (extended abstract),” in Proc. 9th Annu. ACM Symp. Theory Comput., 1977, pp. 106–112. Google Scholar
- , ThunderX Family of Workload Optimized Processors . 2015.Google Scholar
- M. Cekleov and M. Dubois, “Virtual-address caches, part 2: Multiprocessor issues” IEEE Micro, vol. Volume 17, no. Issue 6, pp. 69–74, 1997. Google Scholar
- J. L. Henning, “SPEC CPU2006 benchmark descriptions,” SIGARCH Comput. Archit. News, 2006. Google Scholar
- S. Kaxiras and A. Ros, “A new perspective for efficient virtual-cache coherence,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 353–546. Google Scholar
Digital Library
- R. C. Murphy, K. B. Wheele, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,” 2010.Google Scholar
- M. Oskin and G. H. Loh, “A software managed approach to die-stacked DRAM,” in Proc. Int. Conf. Parallel Archit. Compilation, 2015, pp. 188–200. Google Scholar
- C. H. Park, T. Heo, and J. Huh, “Efficient synonym filtering and scalable delayed translation for hybrid virtual caching,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit., 2016, pp. 90–102. Google Scholar
- B. Pham, A. Bhattacharjee, Y. Eckert, and G. Loh, “Increasing TLB reach by exploiting clustering in page translations,” in Proc. IEEE 20th Int. Symp. High Performance Comput. Archit., 2014, pp. 558–567.Google Scholar
- B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “CoLT: Coalesced large-reach TLBs,” in Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2012, pp. 258–269. Google Scholar
- B. Pham, J. Vesely, G. H. Loh, and A. Bhattacharjee, “Large pages and lightweight memory management in virtualized environments: Can you have it both ways?” in Proc. 48th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2015, pp. 1–12. Google Scholar
Digital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating MapReduce for multi-core and multiprocessor systems,” in Proc. IEEE 13th Int. Symp. High Performance Comput. Archit., 2007, pp. 13–24. Google Scholar
- B. Romanescu, A. Lebeck, D. Sorin, and A. Bracy, “Unified instruction/translation/data (UNITD) coherence: One protocol to rule them all,” in Proc. 16th Int. Symp. High-Performance Comput. Archit., 2010, pp. 1–12.Google Scholar
- D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam, “Implementing signatures for transactional memory,” in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2007, pp. 123–133. Google Scholar
- C. Villavieja et al., “Didi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory,” in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2011, pp. 340–349. Google Scholar
- Z. Yan, J. Vesely, G. Cox, and A. Bhattacharjee, “Hardware translation coherence for virtualized systems,” in Proc. 40th Annu. Int. Symp. Comput. Archit, 2017. Google Scholar
- H. Yoon and G. S. Sohi, “Revisiting virtual l1 caches: A practical design using dynamic synonym remapping” in Proc. IEEE Int. Symp. High Performance Comput. Archit., 2016, pp. 212–224.Google Scholar
Recommendations
Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Reducing L1 caches power by exploiting software semantics
ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and designTo access a set-associative L1 cache in a high-performance processor, all ways of the selected set are searched and fetched in parallel using physical address bits. Such a cache is oblivious of memory references' software semantics such as stack-heap ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsTranslation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...




Comments