Abstract
Accelerators are increasingly recognized as one of the major drivers of future computational growth. For accelerators, shared virtual memory (VM) promises to simplify programming and provide safe data sharing with CPUs. Unfortunately, the overheads of virtual memory, which are high for general-purpose processors, are even higher for accelerators. Providing accelerators with direct access to physical memory (PM) in contrast, provides high performance but is both unsafe and more difficult to program. We propose Devirtualized Memory (DVM) to combine the protection of VM with direct access to PM. By allocating memory such that physical and virtual addresses are almost always identical (VA==PA), DVM mostly replaces page-level address translation with faster region-level Devirtualized Access Validation (DAV). Optionally on read accesses, DAV can be overlapped with data fetch to hide VM overheads. DVM requires modest OS and IOMMU changes, and is transparent to the application. Implemented in Linux 4.10, DVM reduces VM overheads in a graph-processing accelerator to just 1.6% on average. DVM also improves performance by 2.1X over an optimized conventional VM implementation, while consuming 3.9X less dynamic energy for memory management. We further discuss DVM's potential to extend beyond accelerators to CPUs, where it reduces VM overheads to 5% on average, down from 29% for conventional VM.
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- AMD. 2016. AMD I/O Virtualization Technology (IOMMU) Specification, Revision 3.00. http://support.amd.com/TechDocs/48882_IOMMU.pdf. (Dec. 2016).Google Scholar
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Benchmarks - Summary and Preliminary Results Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (SC). Google Scholar
Digital Library
- Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don'T Walk (the Page Table) Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- James Bennett and Stan Lanning. 2017. The Netflix Prize. In KDD Cup and Workshop in conjunction with KDD, CA.Google Scholar
- Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating Two-dimensional Page Walks for Virtualized Systems Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Abhishek Bhattacharjee. 2013. Large-reach Memory Management Unit Caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). Google Scholar
Digital Library
- Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-core Cooperative TLB for Chip Multiprocessors Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University. Google Scholar
Digital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News Vol. 39, 2 (Aug.. 2011). Google Scholar
Digital Library
- Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2017. Meltdown. https://meltdownattack.com/meltdown.pdf. (2017).Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Richard C. Murphy, Kyle B. Wheeler, Brian W. Barret, and James A. Ang. 2010. Introducing the Graph 500. In Cray User's Group (CUG).Google Scholar
- Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Lena E. Olson, Jason Power, Mark D. Hill, and David A. Wood. 2015. Border control: Sandboxing accelerators. In 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 470--481. Google Scholar
Digital Library
- Lena E. Olson, Simha Sethumadhavan, and Mark D. Hill. 2016. Security Implications of Third-Party Accelerators. IEEE Comput. Archit. Lett. Vol. 15, 1 (Jan.. 2016). 1556--6056 Google Scholar
Digital Library
- John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2010. The Case for RAMClouds: Scalable High-performance Storage Entirely in DRAM. SIGOPS Oper. Syst. Rev. Vol. 43, 4 (Jan.. 2010). 0163--5980 Google Scholar
Digital Library
- Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86--64 address translation for 100s of GPU lanes 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 568--578. 1530-0897Google Scholar
- Parthasarathy Ranganathan. 2011. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer (Jan. 2011). 0018-9162 Google Scholar
Digital Library
- RedHat. 2012. Position Independent Executables (PIE). https://access.redhat.com/blogs/766093/posts/1975793. (2012).Google Scholar
- Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. 2012. Return-Oriented Programming: Systems, Languages, and Applications. ACM Trans. Inf. Syst. Secur., Article 2 (March. 2012). 1094--9224 Google Scholar
Digital Library
- Phil Rogers. 2011. The programmer's guide to the apu galaxy. (2011).Google Scholar
- Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, and Pradeep Dubey. 2014. Navigating the Maze of Graph Analytics Frameworks Using Massive Graph Datasets Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD). Google Scholar
Digital Library
- Hovav Shacham, Matthew Page, Ben Pfaff, Eu-Jin Goh, Nagendra Modadugu, and Dan Boneh. 2004. On the Effectiveness of Address-space Randomization Proceedings of the 11th ACM Conference on Computer and Communications Security (CCS). Google Scholar
Digital Library
- Kirill A. Shutemov. 2005. 5-level paging. https://lwn.net/Articles/708526/. (Jan.. 2005).Google Scholar
- Madhusudhan Talluri, Shing Kong, Mark D. Hill, and David A. Patterson. 1992. Tradeoffs in Supporting Two Page Sizes. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Ian Lance Taylor. 2011. Split Stacks in GCC. https://gcc.gnu.org/wiki/SplitStacks. (Feb.. 2011).Google Scholar
- John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In The Role of Reactor Physics toward a Sustainable Future (PHYSOR). Kyoto.Google Scholar
- Arjan van de Ven. 2005. Linux patch for virtual address space randomization. https://lwn.net/Articles/120966/. (Jan.. 2005).Google Scholar
- Oracle Vijay Tatkar. 2016. What Is the SPARC M7 Data Analytics Accelerator? https://community.oracle.com/docs/DOC-994842. (Feb.. 2016).Google Scholar
- Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2015. Lightweight Virtual Memory Support for Many-core Accelerators in Heterogeneous Embedded SoCs. In Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis (CODES). Google Scholar
Digital Library
- Emmett Witchel, Josh Cates, and Krste Asanoviç. 2002. Mondrian Memory Protection. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. 2014. Q100: The Architecture and Design of a Database Processing Unit Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Sam Likun Xi, Oreoluwa Babarinsa, Manos Athanassoulis, and Stratos Idreos. 2015. Beyond the Wall: Near-Data Processing for Databases Proceedings of the 11th International Workshop on Data Management on New Hardware (DAMON). Article 2. Google Scholar
Digital Library
Index Terms
Devirtualizing Memory in Heterogeneous Systems
Recommendations
Devirtualizing Memory in Heterogeneous Systems
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsAccelerators are increasingly recognized as one of the major drivers of future computational growth. For accelerators, shared virtual memory (VM) promises to simplify programming and provide safe data sharing with CPUs. Unfortunately, the overheads of ...
Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on MicroarchitectureVirtualization provides value for many workloads, but its cost rises for workloads with poor memory access locality. This overhead comes from translation look aside buffer (TLB) misses where the hardware performs a 2D page walk (up to 24 memory ...
Large pages and lightweight memory management in virtualized environments: can you have it both ways?
MICRO-48: Proceedings of the 48th International Symposium on MicroarchitectureLarge pages have long been used to mitigate address translation overheads on big-memory systems, particularly in virtualized environments where TLB miss overheads are severe. We show, however, that far from being a panacea, large pages are used ...







Comments