Abstract
The Bε-tree File System, or BetrFS (pronounced “better eff ess”), is the first in-kernel file system to use a write-optimized data structure (WODS). WODS are promising building blocks for storage systems because they support both microwrites and large scans efficiently. Previous WODS-based file systems have shown promise but have been hampered in several ways, which BetrFS mitigates or eliminates altogether. For example, previous WODS-based file systems were implemented in user space using FUSE, which superimposes many reads on a write-intensive workload, reducing the effectiveness of the WODS. This article also contributes several techniques for exploiting write-optimization within existing kernel infrastructure. BetrFS dramatically improves performance of certain types of large scans, such as recursive directory traversals, as well as performance of arbitrary microdata operations, such as file creates, metadata updates, and small writes to files. BetrFS can make small, random updates within a large file 2 orders of magnitude faster than other local file systems. BetrFS is an ongoing prototype effort and requires additional data-structure tuning to match current general-purpose file systems on some operations, including deletes, directory renames, and large sequential writes. Nonetheless, many applications realize significant performance improvements on BetrFS. For instance, an in-place rsync of the Linux kernel source sees roughly 1.6--22 × speedup over commodity file systems.
- Alok Aggarwal and Jeffrey Scott Vitter. 1988. The input/output complexity of sorting and related problems. Communications of the ACM 31, 9 (Sept. 1988), 1116--1127. Google Scholar
Digital Library
- David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. Big Sky, Montana, 1--14. Google Scholar
Digital Library
- Apache. 2015a. Accumulo. Retrieved May 16, 2015 from http://accumulo.apache.org.Google Scholar
- Apache. 2015b. HBase. Retrieved May 16, 2015 from http://hbase.apache.org.Google Scholar
- Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious streaming b-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA). 81--92. Google Scholar
Digital Library
- Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. And introduction to Be-trees and write-optimization. :login; Magazine 40, 5 (Oct. 2015).Google Scholar
- Michael A. Bender, Haodong Hu, and Bradley C. Kuszmaul. 2010. Performance guarantees for b-trees with different-sized atomic keys. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'10). 305--316. Google Scholar
Digital Library
- John Bent, Garth A. Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, and Meghan Wingate. 2009. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the ACM/IEEE Conference on High Performance Computing (SC'09). 1--12. Google Scholar
Digital Library
- Jeff Bonwick. 2004. ZFS: The Last Word in File Systems. Retrieved from https://blogs.oracle.com/video/entry/zfs_the_last_word_in.Google Scholar
- Gerth Stølting Brodal, Erik D. Demaine, Jeremy T. Fineman, John Iacono, Stefan Langerman, and J. Ian Munro. 2010. Cache-oblivious dynamic dictionaries with update/query tradeoffs. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 1448--1456. Google Scholar
Digital Library
- Gerth Stølting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (ACM). 546--554. Google Scholar
Digital Library
- Adam L. Buchsbaum, Michael Goldwasser, Suresh Venkatasubramanian, and Jeffery R. Westbrook. 2000. On external memory graph traversal. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'00). 859--860. Google Scholar
Digital Library
- Rémy Card, Theodore Ts'o, and Stephen Tweedie. 1994. Design and implementation of the second extended filesystem. In Proceedings of the 1st Dutch International Symposium on Linux. 1--6.Google Scholar
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2, 4. Google Scholar
Digital Library
- Bernard Chazelle and Leonidas J. Guibas. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 1--4, 133--162.Google Scholar
Digital Library
- Douglas Comer. 1979. The ubiquitous B-tree. ACM Computing Surveys 11, 2 (June 1979), 121--137. Google Scholar
Digital Library
- David Douthitt. 2011. Instant 10-20% Boost in Disk Performance: The “Noatime” Option. Retrieved from http://administratosphere.wordpress.com/2011/07/29/instant-10-20-boost-in-disk-performance-the-noatime-option/.Google Scholar
- John Esmet, Michael A. Bender, Martin Farach-Colton, and B. C. Kuszmaul. 2012. The TokuFS streaming file system. In Proceedings of the 4th USENIX Workshop on Hot Topics in Storage (HotStorage'12). Google Scholar
Digital Library
- FUSE. 2015. File System in Userspace. Retrieved May 16, 2015 from http://fuse.sourceforge.net/.Google Scholar
- Google, Inc. 2015. LevelDB: A fast and lightweight key/value database library by Google. Retrieved May 16, 2015 from http://github.com/leveldb/.Google Scholar
- Jim Gray and Andreas Reuter. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann. Google Scholar
Digital Library
- Dave Hitz, James Lau, and Michael Malcolm. 1994. File System Design for an NFS File Server Appliance. Technical Report. NetApp.Google Scholar
- William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: A right-optimized write-optimized file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13). 301--315. Google Scholar
Digital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2, 35--40. Google Scholar
Digital Library
- Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13). 273--286. Google Scholar
Digital Library
- Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of 23rd ACM Symposium on Operating Systems Principles. 1--13. Google Scholar
Digital Library
- Peter Macko, Margo Seltzer, and Keith A. Smith. 2010. Tracking back references in a write-anywhere file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13). 15--28. Google Scholar
Digital Library
- Patrick O'Neil, Edward Cheng, Dieter Gawlic, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4, 351--385. DOI:http://dx.doi.org/10.1007/s002360050048 Google Scholar
Digital Library
- QuickLZ. 2015. Fast Compression Library for C, C#, and Java. Retrieved May 16, 2015 from http://www.quicklz.com/.Google Scholar
- Kai Ren and Garth A. Gibson. 2013. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the USENIX Annual Technical Conference. 145--156. Google Scholar
Digital Library
- Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-tree filesystem. Transactions in Storage 9, 3, Article 9 (Aug. 2013), 32 pages. DOI:http://dx.doi.org/10.1145/2501620.2501623 Google Scholar
Digital Library
- Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems 10, 1 (Feb. 1992), 26--52. Google Scholar
Digital Library
- Russell Sears, Mark Callaghan, and Eric A. Brewer. 2008. Rose: Compressed, log-structured replication. Proceedings of the VLDB Endowment 1, 1, 526--537. Google Scholar
Digital Library
- Russell Sears and Raghu Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 217--228. Google Scholar
Digital Library
- Margo Seltzer, Keith Bostic, Marshall Kirk Mckusick, and Carl Staelin. 1993. An implementation of a log-structured file system for UNIX. In Proceedings of the USENIX Winter 1993 Conference Proceedings. 3. Google Scholar
Digital Library
- Margo Seltzer, Keith A. Smith, Hari Balakrishnan, Jacqueline Chang, Sara McMains, and Venkata Padmanabhan. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX 1995 Technical Conference Proceedings. 21. Google Scholar
Digital Library
- Pradeep Shetty, Richard P. Spillane, Ravikant Malpani, Binesh Andrews, Justin Seyster, and Erez Zadok. 2013. Building workload-independent storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13). 17--30. Google Scholar
Digital Library
- Adam Sweeny, Doug Doucette, Wwei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS file system. In Proceedings of the 1996 USENIX Technical Conference. CA, 1--14. Google Scholar
Digital Library
- Tokutek, Inc. 2013a. TokuDB: MySQL Performance, MariaDB Performance. http://www.tokutek.com/products/tokudb-for-mysql/.Google Scholar
- Tokutek, Inc. 2013b. TokuMX—MongoDB Performance Engine. Retrieved from http://www.tokutek.com/products/tokumx-for-mongodb/.Google Scholar
- Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD. In Proceedings of the 9th European Conference on Computer Systems (EuroSys'14). 16:1--16:14. Google Scholar
Digital Library
- Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the USENIX Annual Technical Conference. 71--82. Google Scholar
Digital Library
Index Terms
BetrFS: Write-Optimization in a Kernel File System
Recommendations
BetrFS: a compleat file system for commodity SSDs
EuroSys '22: Proceedings of the Seventeenth European Conference on Computer SystemsDespite the existence of file systems tailored for flash and over a decade of research into flash file systems, this paper shows that no single Linux file system performs consistently well on a commodity SSD across different workloads. We define a ...
Efficient Directory Mutations in a Full-Path-Indexed File System
Special Issue on FAST 2018 and Regular PapersFull-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local ...
Writes Wrought Right, and Other Adventures in File System Optimization
Special Issue on USENIX FAST 2016 and Regular PapersFile systems that employ write-optimized dictionaries (WODs) can perform random-writes, metadata updates, and recursive directory traversals orders of magnitude faster than conventional file systems. However, previous WOD-based file systems have not ...






Comments