Abstract
Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem.
This article describes nimble clones in B-ϵ-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write-optimized key-value store, such as a Bε-tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write.
We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance advantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFS outperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized Linux Containers (LXC) backends by 3--4×.
- 1985. Vax/VMS System Software Handbook.Google Scholar
- Michael A. Bender, Jake Christensen, Alex Conway, Martin Farach-Colton, Rob Johnson, and Meng-Tsung Tsai. 2019. Optimal ball recycling. In SODA. SIAM, 2527--2546. Google Scholar
Digital Library
- Michael A. Bender, Richard Cole, Erik D. Demaine, and Martin Farach-Colton. 2002. Scanning and traversing: Maintaining data for traversals in a memory hierarchy. In ESA (Lecture Notes in Computer Science), Vol. 2461. Springer, 139--151. Google Scholar
Digital Library
- Michael A. Bender, Alex Conway, Martin Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, Prashant Pandey, Donald E. Porter, Jun Yuan, and Yang Zhan. 2019. Small refinements to the DAM can have big consequences for data-structure design. In SPAA. ACM, 265--274. Google Scholar
Digital Library
- Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA). 81--92. Google Scholar
Digital Library
- Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. An introduction to Bϵ-trees and write-optimization. :login; Magazine 40, 5 (Oct 2015), 22--28.Google Scholar
- Michael A. Bender, Martín Farach-Colton, Rob Johnson, Simon Mauras, Tyler Mayer, Cynthia A. Phillips, and Helen Xu. 2017. Write-optimized skip lists. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 69--78. Google Scholar
Digital Library
- Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy, and Raymond S. Tomlinson. 1972. TENEX, a paged time sharing system for the PDP - 10. Commun. ACM 15, 3 (March 1972), 135--143. DOI:https://doi.org/10.1145/361268.361271 Google Scholar
Digital Library
- Bill Bolosky, Scott Corbin, David Goebel, and John (JD) Douceur. 2000. Single instance storage in Windows 2000. In Proceedings of 4th USENIX Windows Systems Symposium (proceedings of 4th usenix windows systems symposium ed.). USENIX. https://www.microsoft.com/en-us/research/publication/single-instance-storage-in-windows-2000/. Google Scholar
Digital Library
- Gerth Stolting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms. 546--554. Google Scholar
Digital Library
- Sailesh Chutani, Owen T. Anderson, Michael L. Kazar, Bruce W. Leverett, W. Anthony Mason, Robert N. Sidebotham, et al. 1992. The Episode file system. In Proceedings of the USENIX Winter 1992 Technical Conference. 43--60.Google Scholar
- Alex Conway, Ainesh Bakshi, Yizheng Jiao, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. File systems fated for senescence? Nonsense, says Science! In Proceedings of the 15th Usenix Conference on File and Storage Technologies. 45--58. Google Scholar
Digital Library
- Alex Conway, Ainesh Bakshi, Yizheng Jiao, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. How to fragment your file system. :login; Magazine 42, 2 (Summer 2017), 22--28.Google Scholar
- Alexander Conway, Martin Farach-Colton, and Philip Shilane. 2018. Optimal Hashing in External Memory. In ICALP (LIPIcs), Vol. 107. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 39:1--39:14.Google Scholar
- Alex Conway, Eric Knorr, Yizheng Jiao, Michael A. Bender, William Jannen, Rob Johnson, Donald Porter, and Martin Farach-Colton. 2019. Filesystem aging: It’s more usage than fullness. In 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19). USENIX Association, Renton, WA. https://www.usenix.org/conference/hotstorage19/presentation/conway. Google Scholar
Digital Library
- Chris Dragga and Douglas J. Santry. 2016. GCTrees: Garbage collecting snapshots. ACM Transactions on Storage 12, 1 (2016), 4:1--4:32. Google Scholar
Digital Library
- John K. Edwards, Daniel Ellard, Craig Everhart, Robert Fair, Eric Hamilton, Andy Kahn, Arkady Kanevsky, James Lentini, Ashish Prakash, Keith A. Smith, and Edward Zayas. 2008. FlexVol: Flexible, efficient file volume virtualization in WAFL. In Proceedings of the 2008 USENIX Annual Technical Conference. 129--142. Google Scholar
Digital Library
- John Esmet, Michael A. Bender, Martin Farach-Colton, and Bradley C. Kuszmaul. 2012. The TokuFS streaming file system. In Proceedings of the 4th USENIX Workshop on Hot Topics in Storage and File Systems. Google Scholar
Digital Library
- Jan Finis, Robert Brunel, Alfons Kemper, Thomas Neumann, Norman May, and Franz Faerber. 2015. Indexing highly dynamic hierarchical data. In VLDB. Google Scholar
Digital Library
- Dave Hitz, James Lau, and Michael Malcolm. 1994. File system design for an NFS file server appliance. In Proceedings of the USENIX Winter 1994 Technical Conference. 19--19. Google Scholar
Digital Library
- John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. 1988. Scale and performance in a distributed file system. ACM Transactions on Computer Systems 6, 1 (1988), 51--81. Google Scholar
Digital Library
- William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: A right-optimized write-optimized file system. In Proceedings of the 13th USENIX Conference on File and Storage Technologies. 301--315. Google Scholar
Digital Library
- William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: Write-optimization in a kernel file system. ACM Transactions on Storage 11, 4 (2015), 18:1--18:29. Google Scholar
Digital Library
- Ryusuke Konishi, Yoshiji Amagai, Koji Sato, Hisashi Hifumi, Seiji Kihara, and Satoshi Moriai. 2006. The Linux implementation of a log-structured file system. SIGOPS Operating Systems Review 40, 3 (2006), 102--107. Google Scholar
Digital Library
- Philip L. Lehman and s. Bing Yao. 1981. Efficient locking for concurrent operations on B-trees. ACM Transactions on Database Systems 6, 4 (Dec. 1981). Google Scholar
Digital Library
- Marshall Kirk McKusick and Gregory R. Ganger. 1999. Soft updates: A technique for eliminating most synchronous writes in the fast filesystem. In Proceedings of the 1999 USENIX Annual Technical Conference. 1--17. Google Scholar
Digital Library
- Digital Equipment Corporation (DEC). 1988. Digital Equipment Corporation (DEC). TOPS-20 User's manual. http://www.bourguet.org/v2/pdp10/users/index.Google Scholar
- Kiran-Kumar Muniswamy-Reddy, Charles P. Wright, Andrew Himmer, and Erez Zadok. 2004. A versatile and user-oriented versioning file system. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. 115--128. Google Scholar
Digital Library
- Prashanth Nayak and Robert Ricci. 2013. Detailed study on Linux logical volume manager. Flux Research Group University of Utah (2013).Google Scholar
- Patrick O’Neil, Edward Cheng, Dieter Gawlic, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385. DOI:https://doi.org/10.1007/s002360050048 Google Scholar
Digital Library
- Zachary Peterson and Randal Burns. 2005. Ext3Cow: A time-shifting file system for regulatory compliance. ACM Transactions on Storage 1, 2 (2005), 190--212. Google Scholar
Digital Library
- Rob Pike, Dave Presotto, Ken Thompson, and Howard Trickey. 1990. Plan 9 from bell labs. In Proceedings of the Summer 1990 UKUUG Conference. 1--9.Google Scholar
- Ohad Rodeh. 2008. B-trees, shadowing, and clones. ACM Transactions on Storage 3, 4 (2008), 2:1--2:27. Google Scholar
Digital Library
- Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-tree filesystem. ACM Transactions on Storage 9, 3 (2013), 9:1--9:32. Google Scholar
Digital Library
- Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, and Jacob Ofir. 1999. Deciding when to forget in the elephant file system. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. 110--123. Google Scholar
Digital Library
- Mike Schroeder, David K. Gifford, and Roger M. Needham. 1985. A caching file system for a programmer’s workstation. In Proceedings of the 10th ACM Symposium on Operating Systems Principles. ACM, Inc. https://www.microsoft.com/en-us/research/publication/a-caching-file-system-for-a-programmers-workstation/. Google Scholar
Digital Library
- Craig A. N. Soules, Garth R. Goodson, John D. Strunk, and Gregory R. Ganger. 2003. Metadata efficiency in versioning file systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies. 43--58. Google Scholar
Digital Library
- Richard P. Spillane, Wenguang Wang, Luke Lu, Maxime Austruy, Rawlinson Rivera, and Christos Karamanolis. 2016. Exo-clones: Better container runtime image management across the clouds. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). USENIX Association, Denver, CO. https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/spillane. Google Scholar
Digital Library
- Vasily Tarasov, Lukas Rupprecht, Dimitris Skourtis, Wenji Li, Raju Rangaswami, and Ming Zhao. 2019. Evaluating Docker storage performance: From workloads to graph drivers. Cluster Computing (2019), 1--14.Google Scholar
- Vasily Tarasov, Lukas Rupprecht, Dimitris Skourtis, Amit Warke, Dean Hildebrand, Mohamed Mohamed, Nagapramod Mandagere, Wenji Li, Raju Rangaswami, and Ming Zhao. 2017. In search of the ideal storage configuration for Docker containers. In Proceedings of the 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS* W). IEEE, 199--206.Google Scholar
Cross Ref
- Veritas. 2019. Veritas System Recovery. Retreived from https://www.veritas.com/product/backup-and-recovery/system-recovery.Google Scholar
- Xingbo Wu, Wenguang Wang, and Song Jiang. 2015. Totalcow: Unleash the power of copy-on-write for thin-provisioned containers. In Proceedings of the 6th Asia-Pacific Workshop on Systems. ACM, 15. Google Scholar
Digital Library
- Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and Andy Rudoff. 2017. NOVA-Fortis: A fault-tolerant non-volatile main memory file system. In Proceedings of the 26th Symposium on Operating Systems Principles. 478--496. Google Scholar
Digital Library
- Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, Pooja Deo, Zardosht Kasheff, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2016. Optimizing every operation in a write-optimized file system. In Proceedings of the 14th Usenix Conference on File and Storage Technologies. 1--14. Google Scholar
Digital Library
- Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, Pooja Deo, Zardosht Kasheff, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2017. Writes wrought right, and other adventures in file system optimization. ACM Transactions on Storage 13, 1 (2017), 3:1--3:26. Google Scholar
Digital Library
- ZFS. [n.d.]. Retrieved July 5, 2018 from http://zfsonlinux.org/.Google Scholar
- Yang Zhan, Alex Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. 2018. The full path to full-path indexing. In Proceedings of the 16th USENIX Conference on File and Storage Technologies. 123--138. Google Scholar
Digital Library
- Yang Zhan, Yizheng Jiao, Donald E. Porter, Alex Conway, Eric Knorr, Martin Farach-Colton, Michael A. Bender, Jun Yuan, William Jannen, and Rob Johnson. 2018. Efficient directory mutations in a full-path-indexed file system. ACM Transactions on Storage 14, 3 (2018), 22:1--22:27. Google Scholar
Digital Library
- Frank Zhao, Kevin Xu, and Randy Shain. 2016. Improving Copy-on-Write Performance in Container Storage Drivers. Storage Developer’s Conference.Google Scholar
Index Terms
Copy-on-Abundant-Write for Nimble File System Clones
Recommendations
Efficient Directory Mutations in a Full-Path-Indexed File System
Special Issue on FAST 2018 and Regular PapersFull-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local ...
Writes Wrought Right, and Other Adventures in File System Optimization
Special Issue on USENIX FAST 2016 and Regular PapersFile systems that employ write-optimized dictionaries (WODs) can perform random-writes, metadata updates, and recursive directory traversals orders of magnitude faster than conventional file systems. However, previous WOD-based file systems have not ...
BetrFS: Write-Optimization in a Kernel File System
Special Issue USENIX FAST 2015The Bε-tree File System, or BetrFS (pronounced “better eff ess”), is the first in-kernel file system to use a write-optimized data structure (WODS). WODS are promising building blocks for storage systems because they support both microwrites and large ...






Comments