skip to main content
research-article

memif: Towards Programming Heterogeneous Memory Asynchronously

Published:25 March 2016Publication History
Skip Abstract Section

Abstract

To harness a heterogeneous memory hierarchy, it is advantageous to integrate application knowledge in guiding frequent memory move, i.e., replicating or migrating virtual memory regions. To this end, we present memif, a protected OS service for asynchronous, hardware-accelerated memory move. Compared to the state of the art -- page migration in Linux, memif incurs low overhead and low latency; in order to do so, it not only redefines the semantics of kernel interface but also overhauls the underlying mechanisms, including request/completion management, race handling, and DMA engine configuration. We implement memif in Linux for a server-class system-on-chip that features heterogeneous memories. Compared to the current Linux page migration, memif reduces CPU usage by up to 15% for small pages and by up to 38x for large pages; in continuously serving requests, memif has no need for request batching and reduces latency by up to 63%. By crafting a small runtime atop memif, we improve the throughputs for a set of streaming workloads by up to 33%. Overall, memif has opened the door to software management of heterogeneous memory.

References

  1. S. Anthony. Intel unveils 72-core x86 knights landing cpu for exascale supercomputing. ExtremeTech, 2013.Google ScholarGoogle Scholar
  2. ARM. ARM architecture reference manual: Armv7-a and armv7-r edition, 2014.Google ScholarGoogle Scholar
  3. J. Balart, M. Gonzalez, X. Martorell, E. Ayguade, Z. Sura, T. Chen, T. Zhang, K. O'brien, and K. O'brien. A novel asynchronous software cache implementation for the cell-be processor. In Languages and Compilers for Parallel Computing, pages 125--140. Springer, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign, pages 73--78, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Banga, J. C. Mogul, and P. Druschel. A scalable and explicit event delivery mechanism for unix. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, pages 19--19, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 72--81, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. Loh, D. McCauley, P. Morrow, D. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on, pages 469--479, Dec 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Bock, B. R. Childers, R. Melhem, and D. Mossé. Concurrent page migration for mobile systems with os-managed hybrid memory. In Proceedings of the 11th ACM Conference on Computing Frontiers, pages 31:1--31:10, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An analysis of linux scalability to many cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pages 1--8, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An efficient OpenMP environment for NUMA architectures. Intl. Journal of Parallel Programming, 38(5--6):418--439, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. SIGOPS Oper. Syst. Rev., 25(Special Issue):40--52, Apr. 1991.Google ScholarGoogle Scholar
  12. C. Cantalupo, V. Venkatesan, J. R. Hammond, K. Czurylo, and S. Hammond. User extensible heap manager for heterogeneous memory platforms and mixed memory policies. Architecture document, 2015.Google ScholarGoogle Scholar
  13. G. Chen, B. Wu, D. Li, and X. Shen. Porple: An extensible optimizer for portable data placement on gpu. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 88--100, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Chiou, S. Devadas, J. Jacobs, P. Jain, V. Lee, E. Peserico, P. Portante, L. Rudolph, G. E. Suh, and D. Willenson. Scheduler-based prefetching for multilevel memories. Lab. Comput. Sci., MIT, Boston, MA, Group Memo, 444, 2001.Google ScholarGoogle Scholar
  15. J. Corbet. The chained scatterlist api. https://lwn.net/Articles/256368/, 2007.Google ScholarGoogle Scholar
  16. M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: A holistic approach to memory placement on numa systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--394, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Diener, E. H. Cruz, P. O. Navaux, A. Busse, and H.-U. Heiß. kmaf: Automatic kernel-level management of thread and data affinity. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pages 277--288, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Dominguez, S. Udayakumaran, and R. Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4):521--540, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proc. ACM Int. Conf. Architectural Support for Programming Languages & Operating Systems (ASPLOS), pages 37--48, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Gao, F. Zhang, and J. Bakos. Sparse matrix-vector multiply on the keystone ii digital signal processor. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pages 1--6, Sept 2014.Google ScholarGoogle ScholarCross RefCross Ref
  21. F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, and V. Quéma. Large pages may be harmful on numa systems. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 231--242, June 2014.Google ScholarGoogle Scholar
  22. B. Goglin and N. Furmento. Enabling high-performance memory migration for multithreaded applications on linux. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--9, May 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Hammarlund, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor. IEEE Micro, (2):6--20, 2014.Google ScholarGoogle Scholar
  24. S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy. Megapipe: A new programming interface for scalable network i/o. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 135--148, 2012.Google ScholarGoogle Scholar
  25. T. L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, pages 300--314, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. HP. Data sheet: Hp proliant m800 server cartridge, 2014.Google ScholarGoogle Scholar
  27. Intel. Product brief: Intel ixp425 network processor. ftp://download.intel.com/design/network/ProdBrf/27905105.pdf, 2006.Google ScholarGoogle Scholar
  28. Intel. Intel xeon processor e5--1600/e5--2600/e5--4600 v2 product families, 2014.Google ScholarGoogle Scholar
  29. T. Jiang, Q. Zhang, R. Hou, L. Chai, S. Mckee, Z. Jia, and N. Sun. Understanding the behavior of in-memory computing workloads. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 22--30, Oct 2014.Google ScholarGoogle ScholarCross RefCross Ref
  30. A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, and C. R. Das. Application-aware memory system for fair and efficient execution of concurrent gpgpu applications. In Proceedings of Workshop on General Purpose Processing Using GPUs, pages 1:1--1:8, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Kaestle, R. Achermann, T. Roscoe, and T. Harris. Shoal: Smart allocation and replication of memory for parallel programs. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 263--276, July 2015.Google ScholarGoogle Scholar
  32. R. Lachaize, B. Lepers, and V. Quéma. Memprof: A memory profiler for numa multicore systems. In Proc. of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 53--64, 2012.Google ScholarGoogle Scholar
  33. C. Lameter. Local and remote memory: Memory in a linux/numa system. In Linux Symposium, 2006.Google ScholarGoogle Scholar
  34. J. Lemon. Kqueue-a generic and scalable event notification facility. In USENIX Annual Technical Conference, FREENIX Track, pages 141--153, 2001.Google ScholarGoogle Scholar
  35. B. Lepers, V. Quema, and A. Fedorova. Thread and memory placement on numa systems: Asymmetry matters. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 277--289, July 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Li, Y. Yang, Z. Lin, and H. Zhou. Automatic data placement into gpu on-chip memory resources. In Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on, pages 23--33, Feb 2015.Google ScholarGoogle ScholarCross RefCross Ref
  37. Linaro. Numa support for arm. https://wiki.linaro.org/LEG/Engineering/Kernel/NUMA, 2013.Google ScholarGoogle Scholar
  38. X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 259--272, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. G. H. Loh and M. D. Hill. Efficiently enabling conventional block sizes for very large die-stacked dram caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 454--464, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. P. Machanick, P. Salverda, and L. Pompe. Hardware-software trade-offs in a direct rambus implementation of the rampage memory hierarchy. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 105--114, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T. Maeurer and D. Shippy. Introduction to the cell multiprocessor. IBM journal of Research and Development, 49(4):589--604, 2005.Google ScholarGoogle Scholar
  43. J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19--25, Dec. 1995.Google ScholarGoogle Scholar
  44. R. McIlroy, P. Dickman, and J. Sventek. Efficient dynamic heap allocation of scratch-pad memory. In Proceedings of the 7th International Symposium on Memory Management, pages 31--40, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. Loh. Heterogeneous memory architectures: A hw/sw approach for mixing die-stacked and off-package memories. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 126--136, Feb 2015.Google ScholarGoogle ScholarCross RefCross Ref
  46. M. Meswani, G. Loh, S. Blagodurov, D. Roberts, J. Slice, and M. Ignatowski. Toward efficient programmer-managed two-level memory hierarchies in exascale computers. In Hardware-Software Co-Design for High Performance Computing (Co-HPC), 2014, pages 9--16, Nov 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. J. Parallel Distrib. Comput., 51(1):1--26, May 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. C. Mogul and K. K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. ACM Trans. Comput. Syst., 15(3):217--252, Aug. 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. D. D. Neteworks. Ddn solution brief -- accelerate seismic processing. http://www.ddn.com/pdfs/SeismicProcessing_SolutionBrief.pdf, 2013.Google ScholarGoogle Scholar
  50. A. Pena and P. Balaji. Toward the efficient use of multiple explicitly managed memory subsystems. In Cluster Computing (CLUSTER), 2014 IEEE International Conference on, pages 123--131, Sept 2014.Google ScholarGoogle ScholarCross RefCross Ref
  51. G. Piccoli, H. N. Santos, R. E. Rodrigues, C. Pousa, E. Borin, and F. M. Quint\ ao Pereira. Compiler support for selective page migration in numa architectures. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pages 369--380, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. L. E. Ramos, E. Gorbatov, and R. Bianchini. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing, pages 85--95, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. L. Rizzo. netmap: A novel framework for fast packet i/o. In USENIX Annual Technical Conference, pages 101--112, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. J. Sim, A. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim. Transparent hardware management of stacked dram as part of memory. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 13--24, Dec 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. L. Soares and M. Stumm. Flexsc: Flexible system call scheduling with exception-less system calls. In Proc. USENIX Conf. Operating Systems Design and Implementation (OSDI), pages 1--8, 2010.Google ScholarGoogle Scholar
  56. S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings, pages 409--415, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  57. H. Sundell and P. Tsigas. Lock-free and practical doubly linked list-based deques using single-word compare-and-swap. In Principles of Distributed Systems, volume 3544 of Lecture Notes in Computer Science, pages 240--255. Springer Berlin Heidelberg, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Texas Instruments. Enhanced dma (edma3) controller. literature no.: Spruel2b, 2009.Google ScholarGoogle Scholar
  59. Texas Instruments. Multicore DSP+ARM KeyStone II System-on-Chip (SoC), 2013.Google ScholarGoogle Scholar
  60. Texas Instruments. Cmem overview. http://processors.wiki.ti.com/index.php/CMEM_Overview, 2014.Google ScholarGoogle Scholar
  61. L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu. Bigdatabench: A big data benchmark suite from internet services. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 488--499, Feb 2014.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. memif: Towards Programming Heterogeneous Memory Asynchronously

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 51, Issue 4
      ASPLOS '16
      April 2016
      774 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2954679
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2016
        824 pages
        ISBN:9781450340915
        DOI:10.1145/2872362
        • General Chair:
        • Tom Conte,
        • Program Chair:
        • Yuanyuan Zhou

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 March 2016

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!