Abstract
To harness a heterogeneous memory hierarchy, it is advantageous to integrate application knowledge in guiding frequent memory move, i.e., replicating or migrating virtual memory regions. To this end, we present memif, a protected OS service for asynchronous, hardware-accelerated memory move. Compared to the state of the art -- page migration in Linux, memif incurs low overhead and low latency; in order to do so, it not only redefines the semantics of kernel interface but also overhauls the underlying mechanisms, including request/completion management, race handling, and DMA engine configuration. We implement memif in Linux for a server-class system-on-chip that features heterogeneous memories. Compared to the current Linux page migration, memif reduces CPU usage by up to 15% for small pages and by up to 38x for large pages; in continuously serving requests, memif has no need for request batching and reduces latency by up to 63%. By crafting a small runtime atop memif, we improve the throughputs for a set of streaming workloads by up to 33%. Overall, memif has opened the door to software management of heterogeneous memory.
- S. Anthony. Intel unveils 72-core x86 knights landing cpu for exascale supercomputing. ExtremeTech, 2013.Google Scholar
- ARM. ARM architecture reference manual: Armv7-a and armv7-r edition, 2014.Google Scholar
- J. Balart, M. Gonzalez, X. Martorell, E. Ayguade, Z. Sura, T. Chen, T. Zhang, K. O'brien, and K. O'brien. A novel asynchronous software cache implementation for the cell-be processor. In Languages and Compilers for Parallel Computing, pages 125--140. Springer, 2008.Google Scholar
Digital Library
- R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign, pages 73--78, 2002.Google Scholar
Digital Library
- G. Banga, J. C. Mogul, and P. Druschel. A scalable and explicit event delivery mechanism for unix. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, pages 19--19, 1999.Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 72--81, 2008.Google Scholar
Digital Library
- B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. Loh, D. McCauley, P. Morrow, D. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on, pages 469--479, Dec 2006.Google Scholar
Digital Library
- S. Bock, B. R. Childers, R. Melhem, and D. Mossé. Concurrent page migration for mobile systems with os-managed hybrid memory. In Proceedings of the 11th ACM Conference on Computing Frontiers, pages 31:1--31:10, 2014.Google Scholar
Digital Library
- S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An analysis of linux scalability to many cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pages 1--8, 2010.Google Scholar
Digital Library
- F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An efficient OpenMP environment for NUMA architectures. Intl. Journal of Parallel Programming, 38(5--6):418--439, 2010.Google Scholar
Cross Ref
- D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. SIGOPS Oper. Syst. Rev., 25(Special Issue):40--52, Apr. 1991.Google Scholar
- C. Cantalupo, V. Venkatesan, J. R. Hammond, K. Czurylo, and S. Hammond. User extensible heap manager for heterogeneous memory platforms and mixed memory policies. Architecture document, 2015.Google Scholar
- G. Chen, B. Wu, D. Li, and X. Shen. Porple: An extensible optimizer for portable data placement on gpu. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 88--100, 2014.Google Scholar
Digital Library
- D. Chiou, S. Devadas, J. Jacobs, P. Jain, V. Lee, E. Peserico, P. Portante, L. Rudolph, G. E. Suh, and D. Willenson. Scheduler-based prefetching for multilevel memories. Lab. Comput. Sci., MIT, Boston, MA, Group Memo, 444, 2001.Google Scholar
- J. Corbet. The chained scatterlist api. https://lwn.net/Articles/256368/, 2007.Google Scholar
- M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: A holistic approach to memory placement on numa systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--394, 2013.Google Scholar
Digital Library
- M. Diener, E. H. Cruz, P. O. Navaux, A. Busse, and H.-U. Heiß. kmaf: Automatic kernel-level management of thread and data affinity. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pages 277--288, 2014.Google Scholar
Digital Library
- A. Dominguez, S. Udayakumaran, and R. Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4):521--540, 2005.Google Scholar
Digital Library
- M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proc. ACM Int. Conf. Architectural Support for Programming Languages & Operating Systems (ASPLOS), pages 37--48, 2012.Google Scholar
Digital Library
- Y. Gao, F. Zhang, and J. Bakos. Sparse matrix-vector multiply on the keystone ii digital signal processor. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pages 1--6, Sept 2014.Google Scholar
Cross Ref
- F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, and V. Quéma. Large pages may be harmful on numa systems. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 231--242, June 2014.Google Scholar
- B. Goglin and N. Furmento. Enabling high-performance memory migration for multithreaded applications on linux. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--9, May 2009.Google Scholar
Digital Library
- P. Hammarlund, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor. IEEE Micro, (2):6--20, 2014.Google Scholar
- S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy. Megapipe: A new programming interface for scalable network i/o. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 135--148, 2012.Google Scholar
- T. L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, pages 300--314, 2001.Google Scholar
Digital Library
- HP. Data sheet: Hp proliant m800 server cartridge, 2014.Google Scholar
- Intel. Product brief: Intel ixp425 network processor. ftp://download.intel.com/design/network/ProdBrf/27905105.pdf, 2006.Google Scholar
- Intel. Intel xeon processor e5--1600/e5--2600/e5--4600 v2 product families, 2014.Google Scholar
- T. Jiang, Q. Zhang, R. Hou, L. Chai, S. Mckee, Z. Jia, and N. Sun. Understanding the behavior of in-memory computing workloads. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 22--30, Oct 2014.Google Scholar
Cross Ref
- A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, and C. R. Das. Application-aware memory system for fair and efficient execution of concurrent gpgpu applications. In Proceedings of Workshop on General Purpose Processing Using GPUs, pages 1:1--1:8, 2014.Google Scholar
Digital Library
- S. Kaestle, R. Achermann, T. Roscoe, and T. Harris. Shoal: Smart allocation and replication of memory for parallel programs. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 263--276, July 2015.Google Scholar
- R. Lachaize, B. Lepers, and V. Quéma. Memprof: A memory profiler for numa multicore systems. In Proc. of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 53--64, 2012.Google Scholar
- C. Lameter. Local and remote memory: Memory in a linux/numa system. In Linux Symposium, 2006.Google Scholar
- J. Lemon. Kqueue-a generic and scalable event notification facility. In USENIX Annual Technical Conference, FREENIX Track, pages 141--153, 2001.Google Scholar
- B. Lepers, V. Quema, and A. Fedorova. Thread and memory placement on numa systems: Asymmetry matters. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 277--289, July 2015.Google Scholar
Digital Library
- C. Li, Y. Yang, Z. Lin, and H. Zhou. Automatic data placement into gpu on-chip memory resources. In Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on, pages 23--33, Feb 2015.Google Scholar
Cross Ref
- Linaro. Numa support for arm. https://wiki.linaro.org/LEG/Engineering/Kernel/NUMA, 2013.Google Scholar
- X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 259--272, 2014.Google Scholar
Digital Library
- X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, 2013.Google Scholar
Digital Library
- G. H. Loh and M. D. Hill. Efficiently enabling conventional block sizes for very large die-stacked dram caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 454--464, 2011.Google Scholar
Digital Library
- P. Machanick, P. Salverda, and L. Pompe. Hardware-software trade-offs in a direct rambus implementation of the rampage memory hierarchy. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 105--114, 1998.Google Scholar
Digital Library
- T. Maeurer and D. Shippy. Introduction to the cell multiprocessor. IBM journal of Research and Development, 49(4):589--604, 2005.Google Scholar
- J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19--25, Dec. 1995.Google Scholar
- R. McIlroy, P. Dickman, and J. Sventek. Efficient dynamic heap allocation of scratch-pad memory. In Proceedings of the 7th International Symposium on Memory Management, pages 31--40, 2008.Google Scholar
Digital Library
- M. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. Loh. Heterogeneous memory architectures: A hw/sw approach for mixing die-stacked and off-package memories. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 126--136, Feb 2015.Google Scholar
Cross Ref
- M. Meswani, G. Loh, S. Blagodurov, D. Roberts, J. Slice, and M. Ignatowski. Toward efficient programmer-managed two-level memory hierarchies in exascale computers. In Hardware-Software Co-Design for High Performance Computing (Co-HPC), 2014, pages 9--16, Nov 2014.Google Scholar
Digital Library
- M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. J. Parallel Distrib. Comput., 51(1):1--26, May 1998.Google Scholar
Digital Library
- J. C. Mogul and K. K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. ACM Trans. Comput. Syst., 15(3):217--252, Aug. 1997.Google Scholar
Digital Library
- D. D. Neteworks. Ddn solution brief -- accelerate seismic processing. http://www.ddn.com/pdfs/SeismicProcessing_SolutionBrief.pdf, 2013.Google Scholar
- A. Pena and P. Balaji. Toward the efficient use of multiple explicitly managed memory subsystems. In Cluster Computing (CLUSTER), 2014 IEEE International Conference on, pages 123--131, Sept 2014.Google Scholar
Cross Ref
- G. Piccoli, H. N. Santos, R. E. Rodrigues, C. Pousa, E. Borin, and F. M. Quint\ ao Pereira. Compiler support for selective page migration in numa architectures. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pages 369--380, 2014.Google Scholar
Digital Library
- L. E. Ramos, E. Gorbatov, and R. Bianchini. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing, pages 85--95, 2011.Google Scholar
Digital Library
- L. Rizzo. netmap: A novel framework for fast packet i/o. In USENIX Annual Technical Conference, pages 101--112, 2012.Google Scholar
Digital Library
- J. Sim, A. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim. Transparent hardware management of stacked dram as part of memory. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 13--24, Dec 2014.Google Scholar
Digital Library
- L. Soares and M. Stumm. Flexsc: Flexible system call scheduling with exception-less system calls. In Proc. USENIX Conf. Operating Systems Design and Implementation (OSDI), pages 1--8, 2010.Google Scholar
- S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings, pages 409--415, 2002.Google Scholar
Cross Ref
- H. Sundell and P. Tsigas. Lock-free and practical doubly linked list-based deques using single-word compare-and-swap. In Principles of Distributed Systems, volume 3544 of Lecture Notes in Computer Science, pages 240--255. Springer Berlin Heidelberg, 2005.Google Scholar
Digital Library
- Texas Instruments. Enhanced dma (edma3) controller. literature no.: Spruel2b, 2009.Google Scholar
- Texas Instruments. Multicore DSP+ARM KeyStone II System-on-Chip (SoC), 2013.Google Scholar
- Texas Instruments. Cmem overview. http://processors.wiki.ti.com/index.php/CMEM_Overview, 2014.Google Scholar
- L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu. Bigdatabench: A big data benchmark suite from internet services. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 488--499, Feb 2014.Google Scholar
Cross Ref
Index Terms
memif: Towards Programming Heterogeneous Memory Asynchronously
Recommendations
HeteroOS: OS Design for Heterogeneous Memory Management in Datacenter
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureHeterogeneous memory management combined with server virtualization in datacenters is expected to increase the software and OS management complexity. State-of-the-art solutions rely exclusively on the hypervisor (VMM) for expensive page hotness tracking ...
memif: Towards Programming Heterogeneous Memory Asynchronously
ASPLOS'16To harness a heterogeneous memory hierarchy, it is advantageous to integrate application knowledge in guiding frequent memory move, i.e., replicating or migrating virtual memory regions. To this end, we present memif, a protected OS service for ...
memif: Towards Programming Heterogeneous Memory Asynchronously
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsTo harness a heterogeneous memory hierarchy, it is advantageous to integrate application knowledge in guiding frequent memory move, i.e., replicating or migrating virtual memory regions. To this end, we present memif, a protected OS service for ...







Comments