skip to main content
research-article
Public Access

LATR: Lazy Translation Coherence

Published:19 March 2018Publication History
Skip Abstract Section

Abstract

We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid expensive IPIs which are required for delivering a shootdown signal to remote cores, and the performance overhead of associated interrupt handlers. Therefore, virtual memory operations, such as free and page migration operations, can benefit significantly from LATR's mechanism. For example, LATR improves the latency of munmap() by 70.8% on a 2-socket machine, a widely used configuration in modern data centers. Real-world, performance-critical applications such as web servers can also benefit from LATR: without any application-level changes, LATR improves Apache by 59.9% compared to Linux, and by 37.9% compared to ABIS, a highly optimized, state-of-the-art TLB coherence technique.

References

  1. Lluc Alvarez, Llu'ıs Vilanova, Miquel Moreto, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, and Mateo Valero. Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 720--732, Portland, OR, June 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Nadav Amit. Optimizing the TLB Shootdown Algorithm with Page Access Tracking. In Proceedings of the 2017 USENIX Annual Technical Conference (ATC), pages 27--39, Santa Clara, CA, July 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Lukasz Anaczkowski. Linux VM workaround for Knights Landing A/D leak, 2016. https://lkml.org/lkml/2016/6/14/505.Google ScholarGoogle Scholar
  4. Apache. Apache HTTP Server Project, 2017. https://httpd.apache.org/.Google ScholarGoogle Scholar
  5. Ravi Arimilli, Guy Guthrie, and Kirk Livingston. Multiprocessor system supporting multiple outstanding TLBI operations per partition, October 2004. US Patent App. 10/425,425.Google ScholarGoogle Scholar
  6. ARM. ARM Compiler Reference Guide: TLBI, 2014. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/TLBI_SYS.html.Google ScholarGoogle Scholar
  7. Amro Awad, Arkaprava Basu, Sergey Blagodurov, Yan Solihin, and Gabriel H. Loh. Avoiding TLB Shootdowns through Self-invalidating TLB Entries. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 273--287, Portland, OR, September 2017.Google ScholarGoogle ScholarCross RefCross Ref
  8. Ramesh Balan and Kurt Gollhard. A Scalable Implementation of Virtual Memory HAT Layer for Shared Memory Multiprocessor Machine. In Proceedings of the Summer 1992 USENIX Annual Technical Conference (ATC), pages 107--115, San Antonio, TX, June 1992.Google ScholarGoogle Scholar
  9. Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the Killer Microseconds. Communications of the ACM, 60(4):48--54, March 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), pages 29--44, Big Sky, MT, October 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Abhishek Bhattacharjee. Translation-Triggered Prefetching. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 63--76, Xi'an, China, April 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. Shared Last-Level TLBs for Chip Multiprocessors. In Proceedings of the 17th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 62--73, San Antonio, TX, February 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Toronto, Canada, October 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. Die Stacking (3D) Microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469--479, Orlando, FL, December 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David L. Black, Richard F. Rashid, David B. Golub, Charles R. Hill, and Robert V. Baron. Translation Lookaside Buffer Consistency: A Software Approach. In Proceedings of the 3rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 113--122, Boston, MA, April 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. Corey: An Operating System for Many Cores. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 43--57, San Diego, CA, December 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1--16, Vancouver, Canada, October 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. RadixVM: Scalable Address Spaces for Multithreaded Applications. In Proceedings of the 8th European Conference on Computer Systems (EuroSys), pages 211--224, Prague, Czech Republic, April 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jonathan Corbet. Memory compaction, 2010. https://lwn.net/Articles/368869/.Google ScholarGoogle Scholar
  20. Jonathan Corbet. AutoNUMA: the other approach to NUMA scheduling, 2012. https://lwn.net/Articles/488709/.Google ScholarGoogle Scholar
  21. Jonathan Corbet. (Nearly) full tickless operation in 3.10, 2013. https://lwn.net/Articles/549580/.Google ScholarGoogle Scholar
  22. Christopher Covington. arm64: Work around Falkor erratum 1003, 2016. https://lkml.org/lkml/2016/12/29/267.Google ScholarGoogle Scholar
  23. Guilherme Cox and Abhishek Bhattacharjee. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 435--448, Xi'an, China, April 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Linux Kernel Driver Database. CONFIG_ARM_ERRATA_720789, 2017. http://cateee.net/lkddb/web-lkddb/ARM_ERRATA_720789.html.Google ScholarGoogle Scholar
  25. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 137--150, San Francisco, CA, December 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. FreeBSD. FreeBSD - PCID implementation, 2015. https://reviews.freebsd.org/rS282684.Google ScholarGoogle Scholar
  27. Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network Requirements for Resource Disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 249--264, Savannah, GA, November 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jeff Gilchrist. Parallel Compression with BZIP2. In Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pages 559--564, Cambridge, MA, November 2004.Google ScholarGoogle Scholar
  29. Will Glozer. wrk - a HTTP benchmarking tool, 2015. https://github.com/wg/wrk.Google ScholarGoogle Scholar
  30. Mel Gorman. TLB flush multiple pages per IPI, 2015. https://lkml.org/lkml/2015/4/25/125.Google ScholarGoogle Scholar
  31. Graph500 Reference Implementations, 2017. http://graph500.org/?page_id=47.Google ScholarGoogle Scholar
  32. Intel. Multiprocessor Specification, 1997.Google ScholarGoogle Scholar
  33. Intel Xeon Processor E5--4610 v2, 2014. http://ark.intel.com/products/75285/Intel-Xeon-Processor-E5--4610-v2--16M-Cache-2_30-GHz.Google ScholarGoogle Scholar
  34. Introduction to Cache Allocation Technology in the Intel Xeon Processor E5 v4 Family, 2016. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology.Google ScholarGoogle Scholar
  35. Intel Xeon Processor E7--8894 v4, 2017. http://ark.intel.com/products/96900/Intel-Xeon-Processor-E7--8894-v4--60M-Cache-2_40-GHz.Google ScholarGoogle Scholar
  36. Gu Juncheng, Lee Youngmoon, Zhang Yiwen, Chowdhury Mosharaf, and Shin Kang. Efficient Memory Disaggregation with Infiniswap. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, April 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Ünsal. Energy-Efficient Address Translation. In Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA), pages 631--643, Barcelona, Spain, March 2016.Google ScholarGoogle Scholar
  38. Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 705--721, Savannah, GA, November 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Transactions on Architecture and Code Optimization (TACO), 10(1):2:1--2:38, April 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface. In Proceedings of the 21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 233--247, Atlanta, GA, April 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yandong Mao, Robert Morris, and Frans Kaashoek. Optimizing MapReduce for Multicore Architectures. Technical Report MIT-CSAIL-TR-2010-020, MIT, May 2010.Google ScholarGoogle Scholar
  42. Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-Stacked and Off-Package Memories. In Proceedings of the 21st IEEE Symposium on High Performance Computer Architecture (HPCA), pages 126--136, San Francisco, CA, February 2015.Google ScholarGoogle ScholarCross RefCross Ref
  43. Timothy Prickett Morgan. AMD Disrupts The Two-Socket Server Status Quo, 2017. https://www.nextplatform.com/2017/05/17/amd-disrupts-two-socket-server-status-quo/.Google ScholarGoogle Scholar
  44. Mark Oskin and Gabriel H. Loh. A Software-Managed Approach to Die-Stacked DRAM. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 188--200, San Francisco, CA, September 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Kent Peacock, Sunil Saxena, Dean Thomas, Fred Yang, and Wilfred Yu. Experiences from Multithreading System V Release 4. In Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems, SEDMS III, pages 77--91, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. Increasing TLB Reach by Exploiting Clustering in Page Translations. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 558--567, Orlando, FL, USA, February 2014.Google ScholarGoogle ScholarCross RefCross Ref
  47. Binh Pham, Derek Hower, Abhishek Bhattacharjee, and Trey Cain. TLB Shootdown Mitigation for Low-Power, Many-Core Servers with L1 Virtual Caches. IEEE Computer Architecture Letters, PP(99), June 2017.Google ScholarGoogle Scholar
  48. Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 258--269, Vancouver, Canada, December 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee. Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways? In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--12, Waikiki, Hawaii, December 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 743--758, Salt Lake City, UT, March 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jason Power, Mark D. Hill, and David A. Wood. Supporting x86--64 Address Translation for 100s of GPU Lanes. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 568--578, Orlando, FL, USA, February 2014.Google ScholarGoogle ScholarCross RefCross Ref
  52. Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 13th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 13--24, Phoenix, AZ, February 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. Specifying and Dynamically Verifying Address Translation-aware Memory Consistency. In Proceedings of the 15th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 323--334, Pittsburgh, PA, March 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. UNified Instruction/Translation/Data (UNITD) Coherence: One Protocol to Rule Them All. In Proceedings of the 16th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 1--12, Bangalore, India, January 2010.Google ScholarGoogle Scholar
  55. Anand Lal Shimpi. AMD's B3 stepping Phenom previewed, TLB hardware fix tested., 2008. http://www.anandtech.com/show/2477/2.Google ScholarGoogle Scholar
  56. Patricia Teller. Translation-Lookaside Buffer Consistency. Computer, 23(6):26--36, June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Patricia J. Teller, Richard Kenner, and Marc Snir. TLB Consistency on Highly-Parallel Shared-Memory Multiprocessors. In Proceedings of the 21st Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track, volume 1, pages 184--193, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Scott Rixner Thomas Barr, Alan Cox. SpecTLB: a Mechanism for Speculative Address Translation. In Proceedings of the 38th ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 307--318, San Jose, California, USA, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Michael Y Thompson, JM Barton, TA Jermoluk, and JC Wagner. Translation Lookaside Buffer Synchronization in a Multiprocessor System. In Proceedings of the Winter 1988 USENIX Annual Technical Conference (ATC), Dallas, TX, 1988.Google ScholarGoogle Scholar
  60. Linus Torvalds. Linux Kernel, 2017. https://github.com/torvalds/linux.Google ScholarGoogle Scholar
  61. Theo Valich. Intel explains the Core 2 CPU errata., 2007. http://www.theinquirer.net/inquirer/news/1031406/intel-explains-core-cpu-errata.Google ScholarGoogle Scholar
  62. Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrián Cristal, and Osman S. Ünsal. DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 340--349, Galveston Island, TX, October 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Carl A. Waldspurger. Memory Resource Management in VMware ESX Server. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 181--194, Boston, MA, December 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Zi Yan, Ján Veselý, Guilherme Cox, and Abhishek Bhattacharjee. Hardware Translation Coherence for Virtualized Systems. In Proceedings of the 44th ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 430--443, Toronto, Canada, June 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. LATR: Lazy Translation Coherence

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 53, Issue 2
    ASPLOS '18
    February 2018
    809 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3296957
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
      March 2018
      827 pages
      ISBN:9781450349116
      DOI:10.1145/3173162

    Copyright © 2018 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 19 March 2018

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!