Abstract
Full-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local file system with full-path indexing are prohibitively expensive.
This article shows how to use full-path indexing in a file system to realize fast directory scans, writes, and renames. The article introduces a range-rename mechanism for efficient key-space changes in a write-optimized dictionary. This mechanism is encapsulated in the key-value Application Programming Interface (API) and simplifies the overall file system design.
We implemented this mechanism in Bε-trees File System (BetrFS), an in-kernel, local file system for Linux. This new version, BetrFS 0.4, performs recursive greps 1.5x faster and random writes 1.2x faster than BetrFS 0.3, but renames are competitive with indirection-based file systems for a range of sizes. BetrFS 0.4 outperforms BetrFS 0.3, as well as traditional file systems, such as ext4, Extents File System (XFS), and Z File System (ZFS), across a variety of workloads.
- Yanif Ahmad, Oliver Kennedy, Christoph Koch, and Milos Nikolic. 2012. DBToaster: Higher-order delta processing for dynamic, frequently fresh views. Proc. VLDB Endow. 5, 10 (2012), 968--979. Google Scholar
Digital Library
- Yanif Ahmad and Christoph Koch. 2009. DBToaster: A SQL compiler for high-performance delta processing in main-memory databases. Proc. VLDB Endow. 2, 2 (2009), 1566--1569. Google Scholar
Digital Library
- Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Itaru Nishizawa, Justin Rosenstein, and Jennifer Widom. 2003. STREAM: The Stanford stream data manager (demonstration description). In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD’03). 665--665. Google Scholar
Digital Library
- Oana Balmau, Rachid Guerraoui, Vasileios Trigonakis, and Igor Zablotchi. 2017. FloDB: Unlocking memory in persistent key-value stores. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys’17). 80--94. Google Scholar
Digital Library
- Michael A. Bender, Richard Cole, Erik D. Demaine, and Martin Farach-Colton. 2002. Scanning and traversing: Maintaining data for traversals in a memory hierarchy. In Proceedings of the 10th Annual European Symposium on Algorithms (ESA’02). 139--151. Google Scholar
Digital Library
- Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’07). 81--92. Google Scholar
Digital Library
- Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. An introduction to B-trees and write-optimization. :login; Magazine 40, 5 (Oct. 2015), 22--28.Google Scholar
- Gerth Stølting Brodal, Erik D. Demaine, Jeremy T. Fineman, John Iacono, Stefan Langerman, and J. Ian Munro. 2010. Cache-oblivious dynamic dictionaries with update/query tradeoffs. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 1448--1456. Google Scholar
Digital Library
- Gerth Stolting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 546--554. Google Scholar
Digital Library
- Adam L. Buchsbaum, Michael Goldwasser, Suresh Venkatasubramanian, and Jeffery R. Westbrook. 2000. On external memory graph traversal. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00). 859--860. Google Scholar
Digital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (2008), 4:1--4:26. Google Scholar
Digital Library
- James Cipar, Greg Ganger, Kimberly Keeton, Charles B. Morrey, III, Craig A. N. Soules, and Alistair Veitch. 2012. LazyBase: Trading freshness for performance in a scalable database. In Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys’12). 169--182. Google Scholar
Digital Library
- Alex Conway, Ainesh Bakshi, Yizheng Jiao, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. File systems fated for senescence? Nonsense, says science! In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). 45--58. Google Scholar
Digital Library
- Rene De La Briandais. 1959. File searching using variable length keys. In Papers Presented at the the March 3-5, 1959, Western Joint Computer Conference (IRE-AIEE-ACM’59 (Western)). 295--298. Google Scholar
Digital Library
- John Esmet, Michael A. Bender, Martin Farach-Colton, and Bradley C. Kuszmaul. 2012. The TokuFS streaming file system. In Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’12). 14--14. Google Scholar
Digital Library
- Facebook, Inc.RocksDB. Retrieved April 26, 2018 from http://rocksdb.org/.Google Scholar
- Jan Finis, Robert Brunel, Alfons Kemper, Thomas Neumann, Norman May, and Franz Faerber. 2015. Indexing highly dynamic hierarchical data. Proc. VLDB Endow. 8, 10 (2015), 986--997. Google Scholar
Digital Library
- FUSE. Retrieved April 26, 2018 from https://github.com/libfuse/libfuse.Google Scholar
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43. Google Scholar
Digital Library
- Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15). 32:1--32:14. Google Scholar
Digital Library
- Google, Inc.LevelDB. Retrieved April 26, 2018 from https://github.com/google/leveldb.Google Scholar
- Mingsheng Hong, Alan J. Demers, Johannes E. Gehrke, Christoph Koch, Mirek Riedewald, and Walker M. White. 2007. Massively multi-query join processing in publish/subscribe systems. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD’07). 761--772. Google Scholar
Digital Library
- William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: A right-optimized write-optimized file system. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 301--315. Google Scholar
Digital Library
- William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: Write-optimization in a kernel file system. ACM Trans. Storage 11, 4 (2015), 18:1--18:29. Google Scholar
Digital Library
- Charles Johnson, Kimberly Keeton, Charles B. Morrey, Craig A. N. Soules, Alistair Veitch, Stephen Bacon, Oskar Batuner, Marcelo Condotta, Hamilton Coutinho, Patrick J. Doyle, Rafael Eichelberger, Hugo Kiehl, Guilherme Magalhaes, James McEvoy, Padmanabhan Nagarajan, Patrick Osborne, Joaquim Souza, Andy Sparkes, Mike Spitzer, Sebastien Tandel, Lincoln Thomas, and Sebastian Zangaro. 2014. From research to practice: Experiences engineering a production metadata database for a scale out file system. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 191--198. Google Scholar
Digital Library
- Sangman Kim, Michael Z. Lee, Alan M. Dunn, Owen S. Hofmann, Xuan Wang, Emmett Witchel, and Donald E. Porter. 2012. Improving server applications with system transactions. In Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys’12). 15--28. Google Scholar
Digital Library
- Ryusuke Konishi, Yoshiji Amagai, Koji Sato, Hisashi Hifumi, Seiji Kihara, and Satoshi Moriai. 2006. The Linux implementation of a log-structured file system. SIGOPS Operating Systems Review 40, 3 (2006), 102--107. Google Scholar
Digital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (2010), 35--40. Google Scholar
Digital Library
- Paul Hermann Lensing, Toni Cortes, and André Brinkmann. 2013. Direct lookup and hash-based metadata placement for local file systems. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 5:1--5:11. Google Scholar
Digital Library
- Linux kernel source tree. Retrieved April 26, 2018 from https://github.com/torvalds/linux.Google Scholar
- Mary Lovelace, Jose Dovidauskas, Alvaro Salla, and Valeria Sokai. 2004. VSAM Demystified. (2004). Retrieved April 26, 2018 from http://www.redbooks.ibm.com/abstracts/sg246105.html. Google Scholar
Digital Library
- Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 133--148. Google Scholar
Digital Library
- Avantika Mathur, MingMing Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2007. The new ext4 filesystem: Current status and future plans. In Ottowa Linux Symposium (OLS), Vol. 2. 21--34.Google Scholar
- Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. 1984. A fast file system for UNIX. ACM Trans. Comput. Syst. 2, 3 (1984), 181--197. Google Scholar
Digital Library
- Fei Mei, Qiang Cao, Hong Jiang, and Lei Tian Tintri. 2017. LSM-tree managed storage for large-scale key-value store. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). 142--156. Google Scholar
Digital Library
- Jason Olson. 2007. Enhance your apps with file system transactions. MSDN Magazine (July 2007). http://msdn2.microsoft.com/en-us/magazine/cc163388.aspx.Google Scholar
- ZFS on Linux. Retrieved April 26, 2018 from http://zfsonlinux.org/.Google Scholar
- Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351--385. Google Scholar
Digital Library
- Anastasios Papagiannis, Giorgos Saloustros, Pilar González-Férez, and Angelos Bilas. 2016. Tucana: Design and implementation of a fast and efficient scale-up key-value store. In Proceedings of the 2016 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’16). 537--550. Google Scholar
Digital Library
- Christopher Peery, Francisco Matias Cuenca-Acuna, Richard P. Martin, and Thu D. Nguyen. 2005. Wayfinder: Navigating and sharing information in a decentralized world. In Proceedings of the Second International Conference on Databases, Information Systems, and Peer-to-Peer Computing (DBISP2P’04). 200--214. Google Scholar
Digital Library
- Donald E. Porter, Owen S. Hofmann, Christopher J. Rossbach, Alexander Benn, and Emmett Witchel. 2009. Operating system transactions. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). 161--176. Google Scholar
Digital Library
- Kai Ren and Garth Gibson. 2013. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC’13). 145--156. Google Scholar
Digital Library
- Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The linux B-tree filesystem. ACM Trans. Storage 9, 3 (2013), 9:1--9:32. Google Scholar
Digital Library
- Russell Sears, Mark Callaghan, and Eric Brewer. 2008. Rose: Compressed, log-structured replication. Proc. VLDB Endow. 1, 1 (2008), 526--537. Google Scholar
Digital Library
- Russell Sears and Raghu Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 217--228. Google Scholar
Digital Library
- Pradeep Shetty, Richard Spillane, Ravikant Malpani, Binesh Andrews, Justin Seyster, and Erez Zadok. 2013. Building workload-independent storage with VT-trees. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 17--30. Google Scholar
Digital Library
- Richard P. Spillane, Sachin Gaikwad, Manjunath Chinni, Erez Zadok, and Charles P. Wright. 2009. Enabling transactional file access via lightweight kernel extensions. In Proceedings of the 7th Conference on File and Storage Technologies (FAST’09). 29--42. Google Scholar
Digital Library
- Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS file system. In Proceedings of the 1996 Annual Conference on USENIX Annual Technical Conference (ATEC’96). 1--1. Google Scholar
Digital Library
- Alexander Thomson and Daniel J. Abadi. 2015. CalvinFS: Consistent WAN replication and scalable metadata management for distributed file systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 1--14. Google Scholar
Digital Library
- Tokutek, Inc.TokuDB. Retrieved April 26, 2018 from https://github.com/Tokutek/ft-index.Google Scholar
- Chia-Che Tsai, Yang Zhan, Jayashree Reddy, Yizheng Jiao, Tao Zhang, and Donald E. Porter. 2015. How to get more value from your file system directory cache. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). 441--456. Google Scholar
Digital Library
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). 307--320. Google Scholar
Digital Library
- Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and Ethan L. Miller. 2004. Dynamic metadata management for petabyte-scale file systems. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC’04). 4--15. Google Scholar
Digital Library
- Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15). 71--82. Google Scholar
Digital Library
- Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, Pooja Deo, Zardosht Kasheff, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2017. Writes wrought right, and other adventures in file system optimization. ACM Trans. Storage 13, 1 (2017), 3:1--3:26. Google Scholar
Digital Library
- Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, Pooja Deo, Zardosht Kasheff, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2016. Optimizing every operation in a write-optimized file system. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 1--14. Google Scholar
Digital Library
- Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, and David Mazières. 2006. Making information flow explicit in HiStar. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). 263--278. Google Scholar
Digital Library
Index Terms
Efficient Directory Mutations in a Full-Path-Indexed File System
Recommendations
Copy-on-Abundant-Write for Nimble File System Clones
Special Section on Usenix Fast 2020Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating ...
Writes Wrought Right, and Other Adventures in File System Optimization
Special Issue on USENIX FAST 2016 and Regular PapersFile systems that employ write-optimized dictionaries (WODs) can perform random-writes, metadata updates, and recursive directory traversals orders of magnitude faster than conventional file systems. However, previous WOD-based file systems have not ...
BetrFS: Write-Optimization in a Kernel File System
Special Issue USENIX FAST 2015The Bε-tree File System, or BetrFS (pronounced “better eff ess”), is the first in-kernel file system to use a write-optimized data structure (WODS). WODS are promising building blocks for storage systems because they support both microwrites and large ...






Comments