Abstract
BTRFS is a Linux filesystem that has been adopted as the default filesystem in some popular versions of Linux. It is based on copy-on-write, allowing for efficient snapshots and clones. It uses B-trees as its main on-disk data structure. The design goal is to work well for many use cases and workloads. To this end, much effort has been directed to maintaining even performance as the filesystem ages, rather than trying to support a particular narrow benchmark use-case.
Linux filesystems are installed on smartphones as well as enterprise servers. This entails challenges on many different fronts.
---Scalability. The filesystem must scale in many dimensions: disk space, memory, and CPUs.
---Data integrity. Losing data is not an option, and much effort is expended to safeguard the content. This includes checksums, metadata duplication, and RAID support built into the filesystem.
---Disk diversity. The system should work well with SSDs and hard disks. It is also expected to be able to use an array of different sized disks, which poses challenges to the RAID and striping mechanisms.
This article describes the core ideas, data structures, and algorithms of this filesystem. It sheds light on the challenges posed by defragmentation in the presence of snapshots, and the tradeoffs required to maintain even performance in the face of a wide spectrum of workloads.
- Bonwick, J. and Moore, B. ZFS, The last word in file systems. http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfslast.pdf.Google Scholar
- Callaghan, B., Pawlowski, B., and Staubach, P. 1995. NFS Version 3 Protocol Specification. RFC 1813, IETF. June. Google Scholar
Digital Library
- Chang, F., Dean, J., Ghemawat, S., Wilson, C., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber R. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Berkeley, CA, 15--15. Google Scholar
Digital Library
- Comer, D. 1979. Ubiquitous B-Tree. ACM Comput. Surv. 11, 2, 121--137. Google Scholar
Digital Library
- Dean, J. and Ghemawat, S. LevelDB. http://code.google.com/p/leveldb.Google Scholar
- Edwards, J., Ellard, D., Everhart, C., Fair, R., Hamilton, E., Kahn, A., Kanevsky, A., Lentini, J., Prakash, A., Smith, K., and Zayas, E. 2008. FlexVol: Flexible, efficient file volume virtualization in WAFL. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, 129--142. Google Scholar
Digital Library
- Gailly, J. L. and Adler, M. ZLIB. en.wikipedia.org/wiki/Zlib.Google Scholar
- Heizer, I., Leach, P., and Perry, D. 1996. Common Internet File System Protocol (CIFS/1.0). Draft draft-heizer-cifs-v1-spec-00.txt, IETF.Google Scholar
- Hellwig, C. 2009. XFS: The big storage file system for Linux. In Usenix Login Magazine.Google Scholar
- Henson, V., Ahrens, M., and Bonwick, J. 2003. Automatic performance tuning in the Zettabyte File System. In File and Storage Technologies (FAST), Work in Progress Report. USENIX Association, Berkeley.Google Scholar
- Hitz, D., Lau, J., and Malcolm, M. 1994. File system design for an NFS file server appliance. In USENIX. USENIX Association, Berkeley, CA. Google Scholar
Digital Library
- Konishi, R., Sato, K., and Amagai, Y. NILFS. www.nilfs.org.Google Scholar
- Macko, P., Seltzer, M., and Smith, K. 2010. Tracking back references in a Write-Anywhere file system. In Proceedings of 8th USENIX Conference on File and Storage Technologies (FAST). USENIX Association, Berkeley, CA. Google Scholar
Digital Library
- Mason, C. 2007. BTRFS. http://en.wikipedia.org/wiki/Btrfs.Google Scholar
- Mason, C. 2008. Seekwatcher. http://oss.oracle.com/~mason/seekwatcher.Google Scholar
- Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., and Vivier, L. 2007. The new Ext4 Filesystem: Current status and future plans. In Proceedings of Linux Symposium.Google Scholar
- O’Neil, P., Cheng, E., Gawlick, D., and O’Neil, E. 1996. The Log-Structured Merge-Tree (LSM-tree). Acta Informatica 33, 4, 351--385. Google Scholar
Digital Library
- Patterson, D., Gibson, G., and Katz, R. 1988. A Case for redundant arrays of inexpensive disks (RAID). SIGMOD 17, 3, 109--116. Google Scholar
Digital Library
- Reed, I. S. and Solomon, G. 1960. Polynomial codes over certain finite fields. J. Society Indus. Appl. Math. 8, 300--304.Google Scholar
Cross Ref
- Reiser, H. 2001. ReiserFS. http://http://en.wikipedia.org/wiki/ReiserFS.Google Scholar
- Ren, K. and Gibson, G. 2012. TABLEFS: Embedding a NOSQL database inside the local file system. Tech. rep. CMU-PDL-12-103.Google Scholar
- Rodeh, O. 2006a. B-trees, shadowing, and clones. Tech. rep. H-248, IBM.Google Scholar
- Rodeh, O. 2006b. B-trees, shadowing, and range-operations. Tech. rep. H-248, IBM.Google Scholar
- Rodeh, O. 2008. B-trees, shadowing, and clones. ACM Trans. Storage 3, 4. Google Scholar
Digital Library
- Rodeh, O. 2010. Deferred reference counters for Copy-On-Write B-trees. Tech. rep. rj10464, IBM.Google Scholar
- Rosenblum, M. and Ousterhout, J. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1, 26--52. Google Scholar
Digital Library
- Sears, R. and Ramakrishnan, R. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 217--228. Google Scholar
Digital Library
- Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., Eisler, M., and Noveck, D. 2000. NFS version 4 Protocol. RFC 3010, IETF. Google Scholar
Digital Library
- Shetty, P., Spillane, R., Malpani, R., Andrews, B., Seyster, J., and Zadok, E. 2013. Building workload-independent storage with VT-trees. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST).Google Scholar
- Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. 1996. Scalability in the XFS File System. In Proceedings of USENIX Annual Technical Conference. Google Scholar
Digital Library
Index Terms
BTRFS: The Linux B-Tree Filesystem
Recommendations
B-trees, shadowing, and clones
B-trees are used by many file systems to represent files and directories. They provide guaranteed logarithmic time key-search, insert, and remove. File systems like WAFL and ZFS use shadowing, or copy-on-write, to implement snapshots, crash recovery, ...
A File Level RAID in Blue Whale File System
HPCC '11: Proceedings of the 2011 IEEE International Conference on High Performance Computing and CommunicationsBlue Whale File System (BWFS) is a distributed file system with proven high performance and high scalability. In order to provide high reliability for BWFS, we designed and implemented a new architecture named BW-FILERAID, which combines distributed ...
R-Barrier: Rapid Barrier for Software RAID Cache Using Hints from Journaling Filesystem
ICPADS '12: Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed SystemsWhile adopting cache in software RAID brings performance benefit, it can cause data loss at power failure which results in the broken filesystem consistency. Though I/O Barrier can be used to remove the consistency issue, it sacrifices the write ...






Comments