skip to main content
research-article
Free Access

BetrFS: Write-Optimization in a Kernel File System

Authors Info & Claims
Published:04 November 2015Publication History
Skip Abstract Section

Abstract

The Bε-tree File System, or BetrFS (pronounced “better eff ess”), is the first in-kernel file system to use a write-optimized data structure (WODS). WODS are promising building blocks for storage systems because they support both microwrites and large scans efficiently. Previous WODS-based file systems have shown promise but have been hampered in several ways, which BetrFS mitigates or eliminates altogether. For example, previous WODS-based file systems were implemented in user space using FUSE, which superimposes many reads on a write-intensive workload, reducing the effectiveness of the WODS. This article also contributes several techniques for exploiting write-optimization within existing kernel infrastructure. BetrFS dramatically improves performance of certain types of large scans, such as recursive directory traversals, as well as performance of arbitrary microdata operations, such as file creates, metadata updates, and small writes to files. BetrFS can make small, random updates within a large file 2 orders of magnitude faster than other local file systems. BetrFS is an ongoing prototype effort and requires additional data-structure tuning to match current general-purpose file systems on some operations, including deletes, directory renames, and large sequential writes. Nonetheless, many applications realize significant performance improvements on BetrFS. For instance, an in-place rsync of the Linux kernel source sees roughly 1.6--22 × speedup over commodity file systems.

References

  1. Alok Aggarwal and Jeffrey Scott Vitter. 1988. The input/output complexity of sorting and related problems. Communications of the ACM 31, 9 (Sept. 1988), 1116--1127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. Big Sky, Montana, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache. 2015a. Accumulo. Retrieved May 16, 2015 from http://accumulo.apache.org.Google ScholarGoogle Scholar
  4. Apache. 2015b. HBase. Retrieved May 16, 2015 from http://hbase.apache.org.Google ScholarGoogle Scholar
  5. Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious streaming b-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA). 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. And introduction to Be-trees and write-optimization. :login; Magazine 40, 5 (Oct. 2015).Google ScholarGoogle Scholar
  7. Michael A. Bender, Haodong Hu, and Bradley C. Kuszmaul. 2010. Performance guarantees for b-trees with different-sized atomic keys. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'10). 305--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. John Bent, Garth A. Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, and Meghan Wingate. 2009. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the ACM/IEEE Conference on High Performance Computing (SC'09). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jeff Bonwick. 2004. ZFS: The Last Word in File Systems. Retrieved from https://blogs.oracle.com/video/entry/zfs_the_last_word_in.Google ScholarGoogle Scholar
  10. Gerth Stølting Brodal, Erik D. Demaine, Jeremy T. Fineman, John Iacono, Stefan Langerman, and J. Ian Munro. 2010. Cache-oblivious dynamic dictionaries with update/query tradeoffs. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1448--1456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gerth Stølting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (ACM). 546--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Adam L. Buchsbaum, Michael Goldwasser, Suresh Venkatasubramanian, and Jeffery R. Westbrook. 2000. On external memory graph traversal. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'00). 859--860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rémy Card, Theodore Ts'o, and Stephen Tweedie. 1994. Design and implementation of the second extended filesystem. In Proceedings of the 1st Dutch International Symposium on Linux. 1--6.Google ScholarGoogle Scholar
  14. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bernard Chazelle and Leonidas J. Guibas. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 1--4, 133--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Douglas Comer. 1979. The ubiquitous B-tree. ACM Computing Surveys 11, 2 (June 1979), 121--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David Douthitt. 2011. Instant 10-20% Boost in Disk Performance: The “Noatime” Option. Retrieved from http://administratosphere.wordpress.com/2011/07/29/instant-10-20-boost-in-disk-performance-the-noatime-option/.Google ScholarGoogle Scholar
  18. John Esmet, Michael A. Bender, Martin Farach-Colton, and B. C. Kuszmaul. 2012. The TokuFS streaming file system. In Proceedings of the 4th USENIX Workshop on Hot Topics in Storage (HotStorage'12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. FUSE. 2015. File System in Userspace. Retrieved May 16, 2015 from http://fuse.sourceforge.net/.Google ScholarGoogle Scholar
  20. Google, Inc. 2015. LevelDB: A fast and lightweight key/value database library by Google. Retrieved May 16, 2015 from http://github.com/leveldb/.Google ScholarGoogle Scholar
  21. Jim Gray and Andreas Reuter. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dave Hitz, James Lau, and Michael Malcolm. 1994. File System Design for an NFS File Server Appliance. Technical Report. NetApp.Google ScholarGoogle Scholar
  23. William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: A right-optimized write-optimized file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13). 301--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2, 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13). 273--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of 23rd ACM Symposium on Operating Systems Principles. 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Peter Macko, Margo Seltzer, and Keith A. Smith. 2010. Tracking back references in a write-anywhere file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13). 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Patrick O'Neil, Edward Cheng, Dieter Gawlic, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4, 351--385. DOI:http://dx.doi.org/10.1007/s002360050048 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. QuickLZ. 2015. Fast Compression Library for C, C#, and Java. Retrieved May 16, 2015 from http://www.quicklz.com/.Google ScholarGoogle Scholar
  30. Kai Ren and Garth A. Gibson. 2013. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the USENIX Annual Technical Conference. 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-tree filesystem. Transactions in Storage 9, 3, Article 9 (Aug. 2013), 32 pages. DOI:http://dx.doi.org/10.1145/2501620.2501623 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems 10, 1 (Feb. 1992), 26--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Russell Sears, Mark Callaghan, and Eric A. Brewer. 2008. Rose: Compressed, log-structured replication. Proceedings of the VLDB Endowment 1, 1, 526--537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Russell Sears and Raghu Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Margo Seltzer, Keith Bostic, Marshall Kirk Mckusick, and Carl Staelin. 1993. An implementation of a log-structured file system for UNIX. In Proceedings of the USENIX Winter 1993 Conference Proceedings. 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Margo Seltzer, Keith A. Smith, Hari Balakrishnan, Jacqueline Chang, Sara McMains, and Venkata Padmanabhan. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX 1995 Technical Conference Proceedings. 21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pradeep Shetty, Richard P. Spillane, Ravikant Malpani, Binesh Andrews, Justin Seyster, and Erez Zadok. 2013. Building workload-independent storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13). 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Adam Sweeny, Doug Doucette, Wwei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS file system. In Proceedings of the 1996 USENIX Technical Conference. CA, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tokutek, Inc. 2013a. TokuDB: MySQL Performance, MariaDB Performance. http://www.tokutek.com/products/tokudb-for-mysql/.Google ScholarGoogle Scholar
  40. Tokutek, Inc. 2013b. TokuMX—MongoDB Performance Engine. Retrieved from http://www.tokutek.com/products/tokumx-for-mongodb/.Google ScholarGoogle Scholar
  41. Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD. In Proceedings of the 9th European Conference on Computer Systems (EuroSys'14). 16:1--16:14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the USENIX Annual Technical Conference. 71--82. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BetrFS: Write-Optimization in a Kernel File System

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Storage
                  ACM Transactions on Storage  Volume 11, Issue 4
                  Special Issue USENIX FAST 2015
                  November 2015
                  141 pages
                  ISSN:1553-3077
                  EISSN:1553-3093
                  DOI:10.1145/2836327
                  Issue’s Table of Contents

                  Copyright © 2015 ACM

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 4 November 2015
                  • Accepted: 1 June 2015
                  • Received: 1 May 2015
                  Published in tos Volume 11, Issue 4

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!