Abstract
We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid expensive IPIs which are required for delivering a shootdown signal to remote cores, and the performance overhead of associated interrupt handlers. Therefore, virtual memory operations, such as free and page migration operations, can benefit significantly from LATR's mechanism. For example, LATR improves the latency of munmap() by 70.8% on a 2-socket machine, a widely used configuration in modern data centers. Real-world, performance-critical applications such as web servers can also benefit from LATR: without any application-level changes, LATR improves Apache by 59.9% compared to Linux, and by 37.9% compared to ABIS, a highly optimized, state-of-the-art TLB coherence technique.
- Lluc Alvarez, Llu'ıs Vilanova, Miquel Moreto, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, and Mateo Valero. Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 720--732, Portland, OR, June 2015. Google Scholar
Digital Library
- Nadav Amit. Optimizing the TLB Shootdown Algorithm with Page Access Tracking. In Proceedings of the 2017 USENIX Annual Technical Conference (ATC), pages 27--39, Santa Clara, CA, July 2017. Google Scholar
Digital Library
- Lukasz Anaczkowski. Linux VM workaround for Knights Landing A/D leak, 2016. https://lkml.org/lkml/2016/6/14/505.Google Scholar
- Apache. Apache HTTP Server Project, 2017. https://httpd.apache.org/.Google Scholar
- Ravi Arimilli, Guy Guthrie, and Kirk Livingston. Multiprocessor system supporting multiple outstanding TLBI operations per partition, October 2004. US Patent App. 10/425,425.Google Scholar
- ARM. ARM Compiler Reference Guide: TLBI, 2014. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/TLBI_SYS.html.Google Scholar
- Amro Awad, Arkaprava Basu, Sergey Blagodurov, Yan Solihin, and Gabriel H. Loh. Avoiding TLB Shootdowns through Self-invalidating TLB Entries. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 273--287, Portland, OR, September 2017.Google Scholar
Cross Ref
- Ramesh Balan and Kurt Gollhard. A Scalable Implementation of Virtual Memory HAT Layer for Shared Memory Multiprocessor Machine. In Proceedings of the Summer 1992 USENIX Annual Technical Conference (ATC), pages 107--115, San Antonio, TX, June 1992.Google Scholar
- Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the Killer Microseconds. Communications of the ACM, 60(4):48--54, March 2017. Google Scholar
Digital Library
- Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), pages 29--44, Big Sky, MT, October 2009. Google Scholar
Digital Library
- Abhishek Bhattacharjee. Translation-Triggered Prefetching. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 63--76, Xi'an, China, April 2017. Google Scholar
Digital Library
- Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. Shared Last-Level TLBs for Chip Multiprocessors. In Proceedings of the 17th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 62--73, San Antonio, TX, February 2011. Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Toronto, Canada, October 2008. Google Scholar
Digital Library
- Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. Die Stacking (3D) Microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469--479, Orlando, FL, December 2006. Google Scholar
Digital Library
- David L. Black, Richard F. Rashid, David B. Golub, Charles R. Hill, and Robert V. Baron. Translation Lookaside Buffer Consistency: A Software Approach. In Proceedings of the 3rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 113--122, Boston, MA, April 1989. Google Scholar
Digital Library
- Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. Corey: An Operating System for Many Cores. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 43--57, San Diego, CA, December 2008. Google Scholar
Digital Library
- Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 1--16, Vancouver, Canada, October 2010. Google Scholar
Digital Library
- Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. RadixVM: Scalable Address Spaces for Multithreaded Applications. In Proceedings of the 8th European Conference on Computer Systems (EuroSys), pages 211--224, Prague, Czech Republic, April 2013. Google Scholar
Digital Library
- Jonathan Corbet. Memory compaction, 2010. https://lwn.net/Articles/368869/.Google Scholar
- Jonathan Corbet. AutoNUMA: the other approach to NUMA scheduling, 2012. https://lwn.net/Articles/488709/.Google Scholar
- Jonathan Corbet. (Nearly) full tickless operation in 3.10, 2013. https://lwn.net/Articles/549580/.Google Scholar
- Christopher Covington. arm64: Work around Falkor erratum 1003, 2016. https://lkml.org/lkml/2016/12/29/267.Google Scholar
- Guilherme Cox and Abhishek Bhattacharjee. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 435--448, Xi'an, China, April 2017. Google Scholar
Digital Library
- Linux Kernel Driver Database. CONFIG_ARM_ERRATA_720789, 2017. http://cateee.net/lkddb/web-lkddb/ARM_ERRATA_720789.html.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 137--150, San Francisco, CA, December 2004. Google Scholar
Digital Library
- FreeBSD. FreeBSD - PCID implementation, 2015. https://reviews.freebsd.org/rS282684.Google Scholar
- Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network Requirements for Resource Disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 249--264, Savannah, GA, November 2016. Google Scholar
Digital Library
- Jeff Gilchrist. Parallel Compression with BZIP2. In Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pages 559--564, Cambridge, MA, November 2004.Google Scholar
- Will Glozer. wrk - a HTTP benchmarking tool, 2015. https://github.com/wg/wrk.Google Scholar
- Mel Gorman. TLB flush multiple pages per IPI, 2015. https://lkml.org/lkml/2015/4/25/125.Google Scholar
- Graph500 Reference Implementations, 2017. http://graph500.org/?page_id=47.Google Scholar
- Intel. Multiprocessor Specification, 1997.Google Scholar
- Intel Xeon Processor E5--4610 v2, 2014. http://ark.intel.com/products/75285/Intel-Xeon-Processor-E5--4610-v2--16M-Cache-2_30-GHz.Google Scholar
- Introduction to Cache Allocation Technology in the Intel Xeon Processor E5 v4 Family, 2016. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology.Google Scholar
- Intel Xeon Processor E7--8894 v4, 2017. http://ark.intel.com/products/96900/Intel-Xeon-Processor-E7--8894-v4--60M-Cache-2_40-GHz.Google Scholar
- Gu Juncheng, Lee Youngmoon, Zhang Yiwen, Chowdhury Mosharaf, and Shin Kang. Efficient Memory Disaggregation with Infiniswap. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, April 2017. Google Scholar
Digital Library
- Vasileios Karakostas, Jayneel Gandhi, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Ünsal. Energy-Efficient Address Translation. In Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA), pages 631--643, Barcelona, Spain, March 2016.Google Scholar
- Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 705--721, Savannah, GA, November 2016. Google Scholar
Digital Library
- Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Transactions on Architecture and Code Optimization (TACO), 10(1):2:1--2:38, April 2013. Google Scholar
Digital Library
- Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface. In Proceedings of the 21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 233--247, Atlanta, GA, April 2016. Google Scholar
Digital Library
- Yandong Mao, Robert Morris, and Frans Kaashoek. Optimizing MapReduce for Multicore Architectures. Technical Report MIT-CSAIL-TR-2010-020, MIT, May 2010.Google Scholar
- Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-Stacked and Off-Package Memories. In Proceedings of the 21st IEEE Symposium on High Performance Computer Architecture (HPCA), pages 126--136, San Francisco, CA, February 2015.Google Scholar
Cross Ref
- Timothy Prickett Morgan. AMD Disrupts The Two-Socket Server Status Quo, 2017. https://www.nextplatform.com/2017/05/17/amd-disrupts-two-socket-server-status-quo/.Google Scholar
- Mark Oskin and Gabriel H. Loh. A Software-Managed Approach to Die-Stacked DRAM. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 188--200, San Francisco, CA, September 2015. Google Scholar
Digital Library
- J. Kent Peacock, Sunil Saxena, Dean Thomas, Fred Yang, and Wilfred Yu. Experiences from Multithreading System V Release 4. In Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems, SEDMS III, pages 77--91, 1992. Google Scholar
Digital Library
- Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. Increasing TLB Reach by Exploiting Clustering in Page Translations. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 558--567, Orlando, FL, USA, February 2014.Google Scholar
Cross Ref
- Binh Pham, Derek Hower, Abhishek Bhattacharjee, and Trey Cain. TLB Shootdown Mitigation for Low-Power, Many-Core Servers with L1 Virtual Caches. IEEE Computer Architecture Letters, PP(99), June 2017.Google Scholar
- Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 258--269, Vancouver, Canada, December 2012. Google Scholar
Digital Library
- Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee. Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways? In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--12, Waikiki, Hawaii, December 2015. Google Scholar
Digital Library
- Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 743--758, Salt Lake City, UT, March 2014. Google Scholar
Digital Library
- Jason Power, Mark D. Hill, and David A. Wood. Supporting x86--64 Address Translation for 100s of GPU Lanes. In Proceedings of the 20th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 568--578, Orlando, FL, USA, February 2014.Google Scholar
Cross Ref
- Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 13th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 13--24, Phoenix, AZ, February 2007. Google Scholar
Digital Library
- Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. Specifying and Dynamically Verifying Address Translation-aware Memory Consistency. In Proceedings of the 15th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 323--334, Pittsburgh, PA, March 2010. Google Scholar
Digital Library
- Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. UNified Instruction/Translation/Data (UNITD) Coherence: One Protocol to Rule Them All. In Proceedings of the 16th IEEE Symposium on High Performance Computer Architecture (HPCA), pages 1--12, Bangalore, India, January 2010.Google Scholar
- Anand Lal Shimpi. AMD's B3 stepping Phenom previewed, TLB hardware fix tested., 2008. http://www.anandtech.com/show/2477/2.Google Scholar
- Patricia Teller. Translation-Lookaside Buffer Consistency. Computer, 23(6):26--36, June 1990. Google Scholar
Digital Library
- Patricia J. Teller, Richard Kenner, and Marc Snir. TLB Consistency on Highly-Parallel Shared-Memory Multiprocessors. In Proceedings of the 21st Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track, volume 1, pages 184--193, 1988. Google Scholar
Digital Library
- Scott Rixner Thomas Barr, Alan Cox. SpecTLB: a Mechanism for Speculative Address Translation. In Proceedings of the 38th ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 307--318, San Jose, California, USA, June 2011. Google Scholar
Digital Library
- Michael Y Thompson, JM Barton, TA Jermoluk, and JC Wagner. Translation Lookaside Buffer Synchronization in a Multiprocessor System. In Proceedings of the Winter 1988 USENIX Annual Technical Conference (ATC), Dallas, TX, 1988.Google Scholar
- Linus Torvalds. Linux Kernel, 2017. https://github.com/torvalds/linux.Google Scholar
- Theo Valich. Intel explains the Core 2 CPU errata., 2007. http://www.theinquirer.net/inquirer/news/1031406/intel-explains-core-cpu-errata.Google Scholar
- Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrián Cristal, and Osman S. Ünsal. DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 340--349, Galveston Island, TX, October 2011. Google Scholar
Digital Library
- Carl A. Waldspurger. Memory Resource Management in VMware ESX Server. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 181--194, Boston, MA, December 2002. Google Scholar
Digital Library
- Zi Yan, Ján Veselý, Guilherme Cox, and Abhishek Bhattacharjee. Hardware Translation Coherence for Virtualized Systems. In Proceedings of the 44th ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 430--443, Toronto, Canada, June 2017. Google Scholar
Digital Library
Index Terms
LATR: Lazy Translation Coherence
Recommendations
LATR: Lazy Translation Coherence
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsWe propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid ...
ECOTLB: Eventually Consistent TLBs
We propose ecoTLB—software-based eventual translation lookaside buffer (TLB) coherence—which eliminates the overhead of the synchronous TLB shootdown mechanism in operating systems that use address space identifiers (ASIDs). With an eventual TLB ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and designWhile set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...







Comments