skip to main content
research-article

TLB Shootdown Mitigation for Low-Power Many-Core Servers with L1 Virtual Caches

Published:01 January 2018Publication History
Skip Abstract Section

Abstract

Power efficiency has become one of the most important design constraints for high-performance systems. In this paper, we revisit the design of low-power virtually-addressed caches. While virtually-addressed caches enable significant power savings by obviating the need for Translation Lookaside Buffer (TLB) lookups, they suffer from several challenging design issues that curtail their widespread commercial adoption. We focus on one of these challenges–cache flushes due to virtual page remappings. We use detailed studies on an ARM many-core server to show that this problem degrades performance by up to 25 percent for a mix of multi-programmed and multi-threaded workloads. Interestingly, we observe that many of these flushes are spurious, and caused by an indiscriminate invalidation broadcast on ARM architecture. In response, we propose a low-overhead and readily implementable hardware mechanism using bloom filters to reduce spurious invalidations and mitigate their ill effects.

References

  1. N. Agarwal, D. Nellans, M. O'Connor, S. Keckler, and T. Wenisch, “Unlocking bandwidth for GPUs in CC-NUMA systems,” in Proc. Int. Symp. High Performance Comput. Archit., 2015, pp. 354–365.Google ScholarGoogle Scholar
  2. A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient virtual memory for big memory servers,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 237–248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Basu, M. D. Hill, and M. M. Swift, “Reducing memory reference energy with opportunistic virtual caching,” in Proc. 39th Annu. Int. Symp. Comput. Archit., 2012, pp. 297–308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. Volume 13, no. Issue 7, pp. 422–426, 1970. Google ScholarGoogle Scholar
  5. J. L. Carter and M. N. Wegman, “Universal classes of hash functions (extended abstract),” in Proc. 9th Annu. ACM Symp. Theory Comput., 1977, pp. 106–112. Google ScholarGoogle Scholar
  6. , ThunderX Family of Workload Optimized Processors . 2015.Google ScholarGoogle Scholar
  7. M. Cekleov and M. Dubois, “Virtual-address caches, part 2: Multiprocessor issues” IEEE Micro, vol. Volume 17, no. Issue 6, pp. 69–74, 1997. Google ScholarGoogle Scholar
  8. J. L. Henning, “SPEC CPU2006 benchmark descriptions,” SIGARCH Comput. Archit. News, 2006. Google ScholarGoogle Scholar
  9. S. Kaxiras and A. Ros, “A new perspective for efficient virtual-cache coherence,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 353–546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. C. Murphy, K. B. Wheele, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,” 2010.Google ScholarGoogle Scholar
  11. M. Oskin and G. H. Loh, “A software managed approach to die-stacked DRAM,” in Proc. Int. Conf. Parallel Archit. Compilation, 2015, pp. 188–200. Google ScholarGoogle Scholar
  12. C. H. Park, T. Heo, and J. Huh, “Efficient synonym filtering and scalable delayed translation for hybrid virtual caching,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit., 2016, pp. 90–102. Google ScholarGoogle Scholar
  13. B. Pham, A. Bhattacharjee, Y. Eckert, and G. Loh, “Increasing TLB reach by exploiting clustering in page translations,” in Proc. IEEE 20th Int. Symp. High Performance Comput. Archit., 2014, pp. 558–567.Google ScholarGoogle Scholar
  14. B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “CoLT: Coalesced large-reach TLBs,” in Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2012, pp. 258–269. Google ScholarGoogle Scholar
  15. B. Pham, J. Vesely, G. H. Loh, and A. Bhattacharjee, “Large pages and lightweight memory management in virtualized environments: Can you have it both ways?” in Proc. 48th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2015, pp. 1–12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating MapReduce for multi-core and multiprocessor systems,” in Proc. IEEE 13th Int. Symp. High Performance Comput. Archit., 2007, pp. 13–24. Google ScholarGoogle Scholar
  17. B. Romanescu, A. Lebeck, D. Sorin, and A. Bracy, “Unified instruction/translation/data (UNITD) coherence: One protocol to rule them all,” in Proc. 16th Int. Symp. High-Performance Comput. Archit., 2010, pp. 1–12.Google ScholarGoogle Scholar
  18. D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam, “Implementing signatures for transactional memory,” in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2007, pp. 123–133. Google ScholarGoogle Scholar
  19. C. Villavieja et al., “Didi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory,” in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2011, pp. 340–349. Google ScholarGoogle Scholar
  20. Z. Yan, J. Vesely, G. Cox, and A. Bhattacharjee, “Hardware translation coherence for virtualized systems,” in Proc. 40th Annu. Int. Symp. Comput. Archit, 2017. Google ScholarGoogle Scholar
  21. H. Yoon and G. S. Sohi, “Revisiting virtual l1 caches: A practical design using dynamic synonym remapping” in Proc. IEEE Int. Symp. High Performance Comput. Archit., 2016, pp. 212–224.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image IEEE Computer Architecture Letters
    IEEE Computer Architecture Letters  Volume 17, Issue 1
    January 2018
    99 pages

    Copyright © 2018

    Publisher

    IEEE Computer Society

    United States

    Publication History

    • Published: 1 January 2018

    Qualifiers

    • research-article