skip to main content
research-article

Octopus+: An RDMA-Enabled Distributed Persistent Memory File System

Published:16 August 2021Publication History
Skip Abstract Section

Abstract

Non-volatile memory and remote direct memory access (RDMA) provide extremely high performance in storage and network hardware. However, existing distributed file systems strictly isolate file system and network layers, and the heavy layered software designs leave high-speed hardware under-exploited.

In this article, we propose an RDMA-enabled distributed persistent memory file system, Octopus+, to redesign file system internal mechanisms by closely coupling non-volatile memory and RDMA features. For data operations, Octopus+ directly accesses a shared persistent memory pool to reduce memory copying overhead, and actively fetches and pushes data all in clients to rebalance the load between the server and network. For metadata operations, Octopus+ introduces self-identified remote procedure calls for immediate notification between file systems and networking, and an efficient distributed transaction mechanism for consistency. Octopus+ is enabled with replication feature to provide better availability. Evaluations on Intel Optane DC Persistent Memory Modules show that Octopus+ achieves nearly the raw bandwidth for large I/Os and orders of magnitude better performance than existing distributed file systems.

References

  1. NVIDIA. 2013. Accelio. Retrieved June 20, 2021 from https://github.com/accelio/accelio.Google ScholarGoogle Scholar
  2. CohortFS. 2014. Ceph over Accelio. Retrieved June 20, 2021 from https://www.cohortfs.com/sites/default/files/ceph%20day-boston-2014-06-10-matt-benjamin-cohortfs-mlx-xio-v5ez.pdf.Google ScholarGoogle Scholar
  3. LWN.net. 2014. Support ext4 on NV-DIMMs. Retrieved June 20, 2021 from https://lwn.net/Articles/588218.Google ScholarGoogle Scholar
  4. LWN.net. 2014. Supporting Filesystems in Persistent Memory. Retrieved June 20, 2021 from https://lwn.net/Articles/610174.Google ScholarGoogle Scholar
  5. Mellanox. 2015. RDMA Improves Alluxio (Tachyon) Remote Read Bandwidth and CPU Utilization by up to 50%. Retrieved June 20, 2021 from https://community.mellanox.com/docs/DOC-2128.Google ScholarGoogle Scholar
  6. SAP HANA. 2016. In-Memory Computing and Real Time Analytics. Retrieved June 20, 2021 from https://www.sap.com/products/hana.html.Google ScholarGoogle Scholar
  7. GitHub. 2017. Crail: A Fast Multi-Tiered Distributed Direct Access File System. Retrieved June 20, 2021 from https://github.com/zrlio/crail.Google ScholarGoogle Scholar
  8. Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novaković, Arun Ramanathan, et al. 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC’18). 775–787. https://www.usenix.org/conference/atc18/presentation/aguilera. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. InfiniBand Trade Association. 2009. InfiniBand Architecture Specification: Release 1.3. InfiniBand Trade Association.Google ScholarGoogle Scholar
  10. Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, and Jiwu Shu. 2020. FlatStore: An efficient log-structured key-value storage engine for persistent memory. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). ACM, New York, NY, 1077–1091. https://doi.org/10.1145/3373376.3378515 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 133–146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, and Haibo Chen. 2019. Performance and protection in the ZoFS user-space NVM file system. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 478–493. https://doi.org/10.1145/3341301.3359637 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chet Douglas. 2015. RDMA with PMEM: Software mechanisms for enabling access to remote persistent memory. Retrieved June 20, 2021 from http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf.Google ScholarGoogle Scholar
  14. Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System software for persistent memory. In Proceedings of the 9th European Conference on Computer Systems (EuroSys’14). ACM, New York, NY, Article 15, 15 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gluster. 2020. GlusterFS on RDMA. Retrieved June 20, 2021 from https://gluster.readthedocs.io/en/latest/AdministratorGuide/RDMATransport/.Google ScholarGoogle Scholar
  17. Michio Honda, Lars Eggert, and Douglas Santry. 2016. PASTE: Network stacks must integrate with NVMM abstractions. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, New York, NY, 183–189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Beomseok Nam. 2018. Endurable transient inconsistency in byte-addressable persistent B+-tree. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). USENIX Association, 187–200. https://www.usenix.org/conference/fast18/presentation/hwang. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Intel. 2019. Intel Optane DC Persistent Memory. Retrieved June 20, 2021 from https://www.intel.com/content/www/us/en/products/memory-storage/optane-dc-persistent-memory.html.Google ScholarGoogle Scholar
  20. Intel. 2020. Intel Data Direct I/O Technology. Retrieved June 20, 2021 from https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html.Google ScholarGoogle Scholar
  21. Nusrat Sharmin Islam, Md Wasi-Ur Rahman, Xiaoyi Lu, and Dhabaleswar K. Panda. 2016. High performance design for HDFS with byte-addressability of NVM and RDMA. In Proceedings of the 2016 International Conference on Supercomputing. ACM, New York, NY, 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, et al. 2019. Basic performance measurements of the Intel Optane DC persistent memory module. arXiv:1903.05714.Google ScholarGoogle Scholar
  23. William K. Josephson, Lars A. Bongo, David Flynn, and Kai Li. 2010. DFS: A file system for virtualized flash storage. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram. 2019. SplitFS: Reducing software overhead in file systems for persistent memory. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 494–508. https://doi.org/10.1145/3341301.3359631 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter RPCs can be general and fast. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 1–16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM’14). 295–306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. Design guidelines for high performance RDMA systems. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 185–201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, and Thomas Anderson. 2017. Strata: A cross media file system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 460–477. https://doi.org/10.1145/3132747.3132770 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase change memory as a scalable dram alternative. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 2–13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). https://www.usenix.org/conference/fast15/technical-sessions/presentation/lee. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. 2019. Recipe: Converting concurrent DRAM indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 462–477. https://doi.org/10.1145/3341301.3359635 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In Proceedings of the ACM Symposium on Cloud Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Siyang Li, Youyou Lu, Jiwu Shu, Yang Hu, and Tao Li. 2017. Locofs: A loosely-coupled metadata service for distributed file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. 1–12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast in-memory key-value storage. Management 15, 32 (2014), 36.Google ScholarGoogle Scholar
  36. Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 773–785. https://www.usenix.org/conference/atc17/technical-sessions/presentation/lu. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Youyou Lu, Jiwu Shu, and Long Sun. 2015. Blurred persistence in transactional persistent memory. In Proceedings of the 31st Conference on Massive Storage Systems and Technologies (MSST’15). IEEE, Los Alamitos, CA, 1–13.Google ScholarGoogle ScholarCross RefCross Ref
  38. Youyou Lu, Jiwu Shu, Long Sun, and Onur Mutlu. 2014. Loose-ordering consistency for persistent memory. In Proceedings of the IEEE 32nd International Conference on Computer Design (ICCD’14). IEEE, Los Alamitos, CA.Google ScholarGoogle ScholarCross RefCross Ref
  39. Youyou Lu, Jiwu Shu, and Wei Wang. 2014. ReconFS: A reconstructable file system on flash storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 75–88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Youyou Lu, Jiwu Shu, and Weimin Zheng. 2013. Extending the lifetime of flash-based storage through reducing write amplification from file systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Teng Ma, Mingxing Zhang, Kang Chen, Zhuo Song, Yongwei Wu, and Xuehai Qian. 2020. AsymNVM: An efficient framework for implementing persistent data structures on asymmetric NVM architecture. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). ACM, New York, NY, 757–773. https://doi.org/10.1145/3373376.3378511 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). 103–114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Christopher Mitchell, Kate Montgomery, Lamont Nelson, Siddhartha Sen, and Jinyang Li. 2016. Balancing CPU and network in the cell distributed B-tree store. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sundeep Narravula, A. Marnidala, Abhinav Vishnu, Karthikeyan Vaidyanathan, and Dhabaleswar K. Panda. 2007. High performance distributed lock management services using network-based remote atomic operations. In Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid’07). IEEE, Los Alamitos, CA, 583–590. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jiaxin Ou, Jiwu Shu, and Youyou Lu. 2016. A high performance file system for non-volatile main memory. In Proceedings of the 11th European Conference on Computer Systems. ACM, New York, NY, 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, et al. 2015. The RAMCloud storage system. ACM Transactions on Computer Systems 33, 3 (2015), 1–55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. 2014. Memory persistency. In Proceedings of the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA’14). 265–276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 24–33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 323–337. https://doi.org/10.1145/3127479.3128610 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Dmitri B. Strukov, Gregory S. Snider, Duncan R. Stewart, and R. Stanley Williams. 2008. The missing memristor found. Nature 453, 7191 (2008), 80–83.Google ScholarGoogle Scholar
  51. Patrick Stuedi, Animesh Trivedi, Bernard Metzler, and Jonas Pfefferle. 2014. DaRPC: Data center RPC. In Proceedings of the ACM Symposium on Cloud Computing (SoCC’14). ACM, New York, NY, 1–13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Steven Swanson and Adrian M. Caulfield. 2013. Refactor, reduce, recycle: Restructuring the I/O stack for the future of storage. Computer 46, 8 (2013), 52–59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Tom Talpey. 2015. Remote Access to Ultra-Low-Latency Storage. Retrieved June 20, 2021 from http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/Talpey-Remote_Access_Storage.pdf.Google ScholarGoogle Scholar
  54. Shin-Yeh Tsai and Yiying Zhang. 2017. Lite kernel RDMA support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles. 306–324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, Xavier Guerin, Xiaoqiao Meng, and Shicong Meng. 2015. HydraDB: A resilient RDMA-driven key-value middleware for in-memory cluster computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. ACM, New York, NY, 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, New York, NY, 87–104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xiaojian Wu and A. L. Narasimha Reddy. 2011. SCMFS: A file system for storage class memory. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). ACM, New York, NY, Article 39, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jian Xu and Steven Swanson. 2016. NOVA: A log-structured file system for hybrid volatile/non-volatile main memories. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 323–338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and Andy Rudoff. 2017. NOVA-Fortis: A fault-tolerant non-volatile main memory file system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 478–496. https://doi.org/10.1145/3132747.3132761 Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jian Yang, Joseph Izraelevitz, and Steven Swanson. 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 221–234. https://www.usenix.org/conference/fast19/presentation/yang. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 169–182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Jiacheng Zhang, Jiwu Shu, and Youyou Lu. 2016. ParaFS: A log-structured file system to exploit the internal parallelism of flash devices. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. 2015. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 3–18. https://doi.org/10.1145/2694344.2694370 Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Shengan Zheng, Morteza Hoseinzadeh, and Steven Swanson. 2019. Ziggurat: A tiered file system for non-volatile main memories and disks. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 207–219. https://www.usenix.org/conference/fast19/presentation/zheng. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. 2009. A durable and energy efficient main memory using phase change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 14–23. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Octopus+: An RDMA-Enabled Distributed Persistent Memory File System

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 17, Issue 3
        August 2021
        227 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3477268
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents

        Copyright © 2021 Association for Computing Machinery.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 August 2021
        • Accepted: 1 January 2021
        • Revised: 1 November 2020
        • Received: 1 July 2020
        Published in tos Volume 17, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!