Abstract
Non-volatile memory and remote direct memory access (RDMA) provide extremely high performance in storage and network hardware. However, existing distributed file systems strictly isolate file system and network layers, and the heavy layered software designs leave high-speed hardware under-exploited.
In this article, we propose an RDMA-enabled distributed persistent memory file system, Octopus+, to redesign file system internal mechanisms by closely coupling non-volatile memory and RDMA features. For data operations, Octopus+ directly accesses a shared persistent memory pool to reduce memory copying overhead, and actively fetches and pushes data all in clients to rebalance the load between the server and network. For metadata operations, Octopus+ introduces self-identified remote procedure calls for immediate notification between file systems and networking, and an efficient distributed transaction mechanism for consistency. Octopus+ is enabled with replication feature to provide better availability. Evaluations on Intel Optane DC Persistent Memory Modules show that Octopus+ achieves nearly the raw bandwidth for large I/Os and orders of magnitude better performance than existing distributed file systems.
- NVIDIA. 2013. Accelio. Retrieved June 20, 2021 from https://github.com/accelio/accelio.Google Scholar
- CohortFS. 2014. Ceph over Accelio. Retrieved June 20, 2021 from https://www.cohortfs.com/sites/default/files/ceph%20day-boston-2014-06-10-matt-benjamin-cohortfs-mlx-xio-v5ez.pdf.Google Scholar
- LWN.net. 2014. Support ext4 on NV-DIMMs. Retrieved June 20, 2021 from https://lwn.net/Articles/588218.Google Scholar
- LWN.net. 2014. Supporting Filesystems in Persistent Memory. Retrieved June 20, 2021 from https://lwn.net/Articles/610174.Google Scholar
- Mellanox. 2015. RDMA Improves Alluxio (Tachyon) Remote Read Bandwidth and CPU Utilization by up to 50%. Retrieved June 20, 2021 from https://community.mellanox.com/docs/DOC-2128.Google Scholar
- SAP HANA. 2016. In-Memory Computing and Real Time Analytics. Retrieved June 20, 2021 from https://www.sap.com/products/hana.html.Google Scholar
- GitHub. 2017. Crail: A Fast Multi-Tiered Distributed Direct Access File System. Retrieved June 20, 2021 from https://github.com/zrlio/crail.Google Scholar
- Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novaković, Arun Ramanathan, et al. 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC’18). 775–787. https://www.usenix.org/conference/atc18/presentation/aguilera. Google Scholar
Digital Library
- InfiniBand Trade Association. 2009. InfiniBand Architecture Specification: Release 1.3. InfiniBand Trade Association.Google Scholar
- Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, and Jiwu Shu. 2020. FlatStore: An efficient log-structured key-value storage engine for persistent memory. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). ACM, New York, NY, 1077–1091. https://doi.org/10.1145/3373376.3378515 Google Scholar
Digital Library
- Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 133–146. Google Scholar
Digital Library
- Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, and Haibo Chen. 2019. Performance and protection in the ZoFS user-space NVM file system. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 478–493. https://doi.org/10.1145/3341301.3359637 Google Scholar
Digital Library
- Chet Douglas. 2015. RDMA with PMEM: Software mechanisms for enabling access to remote persistent memory. Retrieved June 20, 2021 from http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf.Google Scholar
- Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414. Google Scholar
Digital Library
- Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System software for persistent memory. In Proceedings of the 9th European Conference on Computer Systems (EuroSys’14). ACM, New York, NY, Article 15, 15 pages. Google Scholar
Digital Library
- Gluster. 2020. GlusterFS on RDMA. Retrieved June 20, 2021 from https://gluster.readthedocs.io/en/latest/AdministratorGuide/RDMATransport/.Google Scholar
- Michio Honda, Lars Eggert, and Douglas Santry. 2016. PASTE: Network stacks must integrate with NVMM abstractions. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, New York, NY, 183–189. Google Scholar
Digital Library
- Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Beomseok Nam. 2018. Endurable transient inconsistency in byte-addressable persistent B+-tree. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). USENIX Association, 187–200. https://www.usenix.org/conference/fast18/presentation/hwang. Google Scholar
Digital Library
- Intel. 2019. Intel Optane DC Persistent Memory. Retrieved June 20, 2021 from https://www.intel.com/content/www/us/en/products/memory-storage/optane-dc-persistent-memory.html.Google Scholar
- Intel. 2020. Intel Data Direct I/O Technology. Retrieved June 20, 2021 from https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html.Google Scholar
- Nusrat Sharmin Islam, Md Wasi-Ur Rahman, Xiaoyi Lu, and Dhabaleswar K. Panda. 2016. High performance design for HDFS with byte-addressability of NVM and RDMA. In Proceedings of the 2016 International Conference on Supercomputing. ACM, New York, NY, 8. Google Scholar
Digital Library
- Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, et al. 2019. Basic performance measurements of the Intel Optane DC persistent memory module. arXiv:1903.05714.Google Scholar
- William K. Josephson, Lars A. Bongo, David Flynn, and Kai Li. 2010. DFS: A file system for virtualized flash storage. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). Google Scholar
Digital Library
- Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram. 2019. SplitFS: Reducing software overhead in file systems for persistent memory. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 494–508. https://doi.org/10.1145/3341301.3359631 Google Scholar
Digital Library
- Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter RPCs can be general and fast. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 1–16. Google Scholar
Digital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM’14). 295–306. Google Scholar
Digital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. Design guidelines for high performance RDMA systems. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). Google Scholar
Digital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 185–201. Google Scholar
Digital Library
- Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, and Thomas Anderson. 2017. Strata: A cross media file system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 460–477. https://doi.org/10.1145/3132747.3132770 Google Scholar
Digital Library
- Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase change memory as a scalable dram alternative. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 2–13. Google Scholar
Digital Library
- Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). https://www.usenix.org/conference/fast15/technical-sessions/presentation/lee. Google Scholar
Digital Library
- Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. 2019. Recipe: Converting concurrent DRAM indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 462–477. https://doi.org/10.1145/3341301.3359635 Google Scholar
Digital Library
- Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In Proceedings of the ACM Symposium on Cloud Computing. Google Scholar
Digital Library
- Siyang Li, Youyou Lu, Jiwu Shu, Yang Hu, and Tao Li. 2017. Locofs: A loosely-coupled metadata service for distributed file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. 1–12. Google Scholar
Digital Library
- Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast in-memory key-value storage. Management 15, 32 (2014), 36.Google Scholar
- Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 773–785. https://www.usenix.org/conference/atc17/technical-sessions/presentation/lu. Google Scholar
Digital Library
- Youyou Lu, Jiwu Shu, and Long Sun. 2015. Blurred persistence in transactional persistent memory. In Proceedings of the 31st Conference on Massive Storage Systems and Technologies (MSST’15). IEEE, Los Alamitos, CA, 1–13.Google Scholar
Cross Ref
- Youyou Lu, Jiwu Shu, Long Sun, and Onur Mutlu. 2014. Loose-ordering consistency for persistent memory. In Proceedings of the IEEE 32nd International Conference on Computer Design (ICCD’14). IEEE, Los Alamitos, CA.Google Scholar
Cross Ref
- Youyou Lu, Jiwu Shu, and Wei Wang. 2014. ReconFS: A reconstructable file system on flash storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 75–88. Google Scholar
Digital Library
- Youyou Lu, Jiwu Shu, and Weimin Zheng. 2013. Extending the lifetime of flash-based storage through reducing write amplification from file systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). Google Scholar
Digital Library
- Teng Ma, Mingxing Zhang, Kang Chen, Zhuo Song, Yongwei Wu, and Xuehai Qian. 2020. AsymNVM: An efficient framework for implementing persistent data structures on asymmetric NVM architecture. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). ACM, New York, NY, 757–773. https://doi.org/10.1145/3373376.3378511 Google Scholar
Digital Library
- Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). 103–114. Google Scholar
Digital Library
- Christopher Mitchell, Kate Montgomery, Lamont Nelson, Siddhartha Sen, and Jinyang Li. 2016. Balancing CPU and network in the cell distributed B-tree store. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). Google Scholar
Digital Library
- Sundeep Narravula, A. Marnidala, Abhinav Vishnu, Karthikeyan Vaidyanathan, and Dhabaleswar K. Panda. 2007. High performance distributed lock management services using network-based remote atomic operations. In Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid’07). IEEE, Los Alamitos, CA, 583–590. Google Scholar
Digital Library
- Jiaxin Ou, Jiwu Shu, and Youyou Lu. 2016. A high performance file system for non-volatile main memory. In Proceedings of the 11th European Conference on Computer Systems. ACM, New York, NY, 12. Google Scholar
Digital Library
- John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, et al. 2015. The RAMCloud storage system. ACM Transactions on Computer Systems 33, 3 (2015), 1–55. Google Scholar
Digital Library
- Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. 2014. Memory persistency. In Proceedings of the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA’14). 265–276. Google Scholar
Digital Library
- Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 24–33. Google Scholar
Digital Library
- Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 323–337. https://doi.org/10.1145/3127479.3128610 Google Scholar
Digital Library
- Dmitri B. Strukov, Gregory S. Snider, Duncan R. Stewart, and R. Stanley Williams. 2008. The missing memristor found. Nature 453, 7191 (2008), 80–83.Google Scholar
- Patrick Stuedi, Animesh Trivedi, Bernard Metzler, and Jonas Pfefferle. 2014. DaRPC: Data center RPC. In Proceedings of the ACM Symposium on Cloud Computing (SoCC’14). ACM, New York, NY, 1–13. Google Scholar
Digital Library
- Steven Swanson and Adrian M. Caulfield. 2013. Refactor, reduce, recycle: Restructuring the I/O stack for the future of storage. Computer 46, 8 (2013), 52–59. Google Scholar
Digital Library
- Tom Talpey. 2015. Remote Access to Ultra-Low-Latency Storage. Retrieved June 20, 2021 from http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/Talpey-Remote_Access_Storage.pdf.Google Scholar
- Shin-Yeh Tsai and Yiying Zhang. 2017. Lite kernel RDMA support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles. 306–324. Google Scholar
Digital Library
- Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, Xavier Guerin, Xiaoqiao Meng, and Shicong Meng. 2015. HydraDB: A resilient RDMA-driven key-value middleware for in-memory cluster computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. ACM, New York, NY, 22. Google Scholar
Digital Library
- Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, New York, NY, 87–104. Google Scholar
Digital Library
- Xiaojian Wu and A. L. Narasimha Reddy. 2011. SCMFS: A file system for storage class memory. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). ACM, New York, NY, Article 39, 11 pages. Google Scholar
Digital Library
- Jian Xu and Steven Swanson. 2016. NOVA: A log-structured file system for hybrid volatile/non-volatile main memories. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 323–338. Google Scholar
Digital Library
- Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and Andy Rudoff. 2017. NOVA-Fortis: A fault-tolerant non-volatile main memory file system. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 478–496. https://doi.org/10.1145/3132747.3132761 Google Scholar
Digital Library
- Jian Yang, Joseph Izraelevitz, and Steven Swanson. 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 221–234. https://www.usenix.org/conference/fast19/presentation/yang. Google Scholar
Digital Library
- Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 169–182. Google Scholar
Digital Library
- Jiacheng Zhang, Jiwu Shu, and Youyou Lu. 2016. ParaFS: A log-structured file system to exploit the internal parallelism of flash devices. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). Google Scholar
Digital Library
- Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. 2015. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 3–18. https://doi.org/10.1145/2694344.2694370 Google Scholar
Digital Library
- Shengan Zheng, Morteza Hoseinzadeh, and Steven Swanson. 2019. Ziggurat: A tiered file system for non-volatile main memories and disks. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 207–219. https://www.usenix.org/conference/fast19/presentation/zheng. Google Scholar
Digital Library
- Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. 2009. A durable and energy efficient main memory using phase change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 14–23. Google Scholar
Digital Library
Index Terms
Octopus+: An RDMA-Enabled Distributed Persistent Memory File System
Recommendations
TH-DPMS: Design and Implementation of an RDMA-enabled Distributed Persistent Memory Storage System
Special Section on Computational Storage and Regular PapersThe rapidly increasing data in recent years requires the datacenter infrastructure to store and process data with extremely high throughput and low latency. Fortunately, persistent memory (PM) and RDMA technologies bring new opportunities towards this ...
A file system bypassing volatile main memory: towards a single-level persistent store
CF '18: Proceedings of the 15th ACM International Conference on Computing FrontiersExisting persistent memory (PM) based file systems rely on a DRAM and PM hybrid store. Although a hybrid store does boost system performance while avoiding some current PM limitations like limited endurance, we envision that with more advances PM ...
Using DRAM Buffer to Reduce Persistence and Consistence Overheads of Persistent Memory
Persistent memory has the potential to become universal storage for memory and storage uses. Unfortunately, our system architecture is good fit for two-level storage model with DRAM and storage. It incurs two of important performance overheads. First is ...






Comments