Abstract
Oracle File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transactions. Self-validating B-tree nodes and housekeeping operations performed as separate transactions allow each key in a B-tree transaction to require only one page in the underlying block transaction. The filesystem provides snapshots by using versioned key-value pairs. The system is programmed using a nonblocking lock-free programming style. Presentation servers maintain no persistent local state making them scalable and easy to failover. A non-scalable Paxos-replicated hash table holds configuration information required to bootstrap the system. An additional B-tree provides conversational multi-key minitransactions for control-plane information. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication. FSS has been in production since January 2018 and holds tens of thousands of customer file systems comprising many petabytes of data.
- Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, and Christos Karamoanolis. 2009. Sinfonia: A new paradigm for building scalable distributed systems. ACM Trans. Comput. Syst. 27, 3 (Nov. 2009). DOI:https://doi.org/10.1145/1629087.1629088Google Scholar
Digital Library
- Alibaba 2018. Alibaba Elastic Block Storage. Retrieved September 26, 2018 from https://www.alibabacloud.com/help/doc-detail/25383.htm.Google Scholar
- Hervey Allen. 2005. Introduction to FreeBSD additional topics. In Proceedings of the Pacific Network Operators Group (PacNOG I) Workshop.Google Scholar
- Amazon 2018. Amazon Elastic Block Store. Retrieved September 26, 2018 from https://aws.amazon.com/ebs.Google Scholar
- Amazon 2018. Amazon Elastic File System. Retrieved October 12, 2019 from https://aws.amazon.com/efs.Google Scholar
- Amazon 2018. Amazon FSx. Retrieved January 22, 2020 from https://aws.amazon.com/fsx.Google Scholar
- G. M. Amdahl. 1967. The validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the American Federation of Information Processing Societies Conference (AFIPS’67), Vol. 30.Google Scholar
Digital Library
- Apache Software Foundation. 2009. ZooKeeper Internals. Retrieved from https://zookeeper.apache.org/doc/r3.1.2/zookeeperInternals.html.Google Scholar
- Rudolf Bayer and Edward M. McCreight. 1972. Organization and maintenance of large ordered indexes. Acta Inf. 1, 3 (Feb. 1972), 173--189. DOI:https://doi.org/10.1145/1734663.1734671Google Scholar
Digital Library
- Steve Best and Dave Kleikamp. 2000. JFS layout. IBM Developerworks. Retreived from http://jfs.sourceforge.net/project/pub/jfslayout.pdf.Google Scholar
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1996. Cilk: An efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37, 1 (Aug. 25 1996), 55--69. (An early version appeared in the Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’95). 207--216.Google Scholar
Digital Library
- Hans-J. Boehm. 2009. Transactional memory should be an implementation technique, not a programming interface. In Proceedings of the 1st USENIX Conference on Hot Topics in Parallelism (HotPar’09). 15:1--15:6.Google Scholar
- William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. 141--154.Google Scholar
Digital Library
- Richard P. Brent. 1974. The parallel evaluation of general arithmetic expressions. Journal of the ACM 21, 2 (Apr. 1974), 201--206.Google Scholar
Digital Library
- Gerth Stølting Brodal, Konstantinos Tsakalidis, Spyros Sioutas, and Kostas Tsichlas. 2012. Fully persistent B-trees. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’12). 602--614. DOI:https://doi.org/10.1137/1.9781611973099.51Google Scholar
Cross Ref
- Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. 1993. The primary-backup approach. In Distributed Systems (2 ed.). ACM Press/Addison-Wesley, New York, NY, 199--216.Google Scholar
- Brent Callaghan, Brian Pawlowski, and Peter Staubach. 1995. NFS Version 3 Protocol Specification. IETF RFC 1813. Retrieved from https://www.ietf.org/rfc/rfc1813.Google Scholar
- Rémy Card, Theodore Ts’o, and Stephen Tweedie. 1994. Design and implementation of the second extended filesystem. In Proceedings of the 1st Dutch International Symposium on Linux.Google Scholar
- Călin Casçaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. 2008. Software transactional memory: Why is it only a research toy. ACM Queue 6, 5 (Sept. 2008). DOI:https://doi.org/10.1145/1454456.1454466Google Scholar
Digital Library
- Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing (PODC’07). 398--407. DOI:https://doi.org/10.1145/1281100.1281103Google Scholar
Digital Library
- Alexander Conway, Ainesh Bakshi, Yizheng Jiao, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. File systems fated for senescence? Nonsense, says science! In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). 45--58.Google Scholar
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally distributed database. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 251--264.Google Scholar
- Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. 1985. Consistency in partitioned network. Comput. Surv. 17, 3 (Sep. 1985), 341--370. DOI:https://doi.org/10.1145/5505.5508Google Scholar
Digital Library
- David L. Detlefs, Christine H. Flood, Alexander T. Garthwaite, Paul A. Martin, Nir N. Shavit, and Guy L. Steele Jr. 2000. Even better DCAS-based concurrented deques. In Proceedings of the 14th International Conference on Distributed Computing (DISC’00). 59--73.Google Scholar
- David L. Detlefs, Paul A. Martin, Mark Moir, and Guy L. Steele Jr. 2002. Lock-free reference counting. Distrib. Comput. 15, 4 (Dec. 2002), 255--271. DOI:https://doi.org/10.1017/s00446-002-0079-zGoogle Scholar
Cross Ref
- Matthew Dillon. 2008. The Hammer Filesystem. Retrieved from https://www.dragonflybsd.org/hammer/hammer.pdf.Google Scholar
- Mark Fasheh. 2006. OCFS2: The oracle clustered file system version 2. In Proceedings of the 2006 Linux Symposium. 289--302.Google Scholar
- Glustre 2005. GlusterFS. Retrieved from http://www.gluster.org.Google Scholar
- Google 2012. Google Persistent Disk. Retrieved September 26, 2018 from https://cloud.google.com/persistent-disk/.Google Scholar
- Google 2018. Google Filestore. Retrieved January 22, 2020 https://cloud.google.com/filestore/.Google Scholar
- Goetz Graefe. 2010. A survey of B-tree locking techniques. ACM Transactions on Database Systems 35, 3 (Jul. 2010). DOI:https://doi.org/10.1145/1806907.1806908Google Scholar
Digital Library
- R. L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 2 (Mar. 1969), 416--429.Google Scholar
Digital Library
- Jim Gray and Andreas Reuter. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann.Google Scholar
Digital Library
- Jim N. Gray. 1978. Notes on data base operating systems. In Operating Systems—An Advanced Course. Lecture Notes in Computer Science, Vol. 60. Springer-Verlag, Chapter 3.Google Scholar
- Tim Harris and Keir Fraser. 2003. Language support for lightweight transactions. In Proceedings of the 18th Annual SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’03). 388--402.Google Scholar
Digital Library
- HDFS 2012. Add support for Variable length block. HDFS Ticket. Retrieved from https://issues.apache.org/jira/browse/HDFS-3689.Google Scholar
- HDFS 2013. HDFS Architecture. Retrieved from http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Large_Data_Sets.Google Scholar
- Maurice Herlihy. 1991. Wait-free synchronizatoin. ACM Trans. Program. Lang. Syst. 11, 1 (Jan. 1991), 124--149. DOI:https://doi.org/10.1145/114005.102808Google Scholar
- M. Herlihy and J. E. Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA’93). 289--300. DOI:https://doi.org/10.1145/173682.165164Google Scholar
- Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (Jul. 1990), 463--492. DOI:https://doi.org/10.1145/78969.78972Google Scholar
Digital Library
- Dave Hitz, James Lau, and Michael Malcolm. 1994. File system design for an NFS file server appliance. In Proceedings of the USENIX Winter 1994 Technical Conference. 19--19.Google Scholar
Digital Library
- Valentin Höbel. 2016. LizardFS: Software-Defined Storage As It Should Be. Retrieved from https://www.golem.de/news/lizardfs-software-defined-storage-wie-es-sein-soll-1604-119518.html.Google Scholar
- IBM. 1966. Data File Handbook. Retrieved from http://www.bitsavers.org/pdf/ibm/generalInfo/C20-1638-1_Data_File_Handbook_Mar66.pdf C20-1638-1.Google Scholar
- IBM 1983. IBM System/370 Extended Architecture—Principles of Operation. IBM. Retrieved from https://archive.org/details/bitsavers_ibm370prinrinciplesofOperationMar83_40542805.Google Scholar
- Apple Inc. 2004. HFS Plus Volume Format. Retrieved from Technical Note TN1150. Apple Developer Connection. https://developer.apple.com/library/archive/technotes/tn/tn1150.html.Google Scholar
- William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: A write-optimization in a kernel file system. ACM Trans. Stor. 11, 4 (Nov. 2015). DOI:https://doi.org/10.1145/2798729Google Scholar
Digital Library
- Eric H. Jensen, Gary W. Hagensen, and Jeffrey M. Broughton. 1987. A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL-97663. Lawrence Livermore National Laboratory, Livermore, California. Retrieved from https://e-reports-ext.llnl.gov/pdf/212157.pdf.Google Scholar
- M. Tim Jones. 2004. Ceph: A Linux petabyte-scale distributed file system. Retrieved from https://www.ibm.com/developerworks/linux/library/l-ceph/index.html.Google Scholar
- Sakis Kasampalis. 2010. Copy on Write Based File Systems Performance Analysis and Implementation. Master’s thesis. Department of Informatics, The Technical University of Denmark. Retrieved from http://sakisk.me/files/copy-on-write-based-file-systems.pdf.Google Scholar
- Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. DOI:https://doi.org/10.1145/279227.279229Google Scholar
Digital Library
- Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (121) (Dec. 2001), 51--58. https://www.microsoft.com/en-us/research/publication/paxos-made-simple/.Google Scholar
- Butler Lampson. 1980. Atomic transactions. In Distributed Systems—Architecture and Implementation. Vol. 100. Springer Verlag.Google Scholar
- Philip L. Lehman and S. Bing Yao. 1981. Efficient locking for concurrent operations on B-trees. ACM Transactions on Database Systems 6, 4 (Dec. 1981), 650--670. DOI:https://doi.org/10.1145/319628.319663Google Scholar
Digital Library
- Yossi Lev, Mark Moir, and Dan Nussbaum. 2007. PhTM: Phased transactional memory. In Proceedings of the The 2nd ACM SIGPLAN Workshop on Transactional Computing.Google Scholar
- A. J. Lewis. 2002. LVM HOWTO. Retrieved from http://tldp.org/HOWTO/LVM-HOWTO/.Google Scholar
- Bruce G. Lindsay. 1980. Single and multi-site recovery facilities. In Distributed Data Bases, I. W. Draffan and F. Poole (Eds.). Cambridge University Press, Chapter 10. Also available as Reference [57].Google Scholar
- Bruce G. Lindsay, Patricia G. Selinger, Cesare A. Galtieri, James N. Gray, Raymond A. Lorie, Thomas G. Price, Franco Putzolu, Irving L. Traiger, and Bradford W. Wade. 1979. Notes on Distributed Databases. Research Report RJ2571. IBM Research Laboratory, San Jose, CA. Retrieved from http://domino.research.ibm.com/library/cyberdig.nsf/papers/A776EC17FC2FCE73852579F100578964/$File/RJ2571.pdf.Google Scholar
- Lustre 2003. The Lustre File System. Retrieved from lustre.org.Google Scholar
- Avantika Mathur, MingMing Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2007. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux Symposium.Google Scholar
- Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. 1984. A fast file system for UNIX. Comput. Syst. 2, 3 (1984), 181--197. DOI:https://doi.org/10.1145/989.990Google Scholar
Digital Library
- Microsoft 2017. Microsoft Azure Blob Storage. Retrieved from https://azure.microsoft.com/en-us/services/storage/blobs/. Viewed 2018-09-26.Google Scholar
- Microsoft 2018. Microsoft SMB Protocol and CIFS Protocol Overview. Retrieved from https://docs.microsoft.com/en-us/windows/desktop/FileIO/microsoft-smb-protocol-and-cifs-protocol-overview.Google Scholar
- Barton P. Miller, Louis Fredersen, and Bryan So. 1990. An empirical study of the reliability of UNIX utilities. Commun. ACM 33, 12 (Dec. 1990), 32--44. DOI:https://doi.org/10.1145/96267.96279Google Scholar
Digital Library
- Moose 2018. MooseFS Fact Sheet. Retrieved from https://moosefs.com/factsheet/.Google Scholar
- Brian Oki and Barbara Liskov. 1988. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing (PODC’88). 8--17. DOI:https://doi.org/10.1145/62546.62549Google Scholar
- Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC’14).Google Scholar
Digital Library
- Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm (Extended Version). Retrieved from https://raft.github.io/raft.pdf. Extended version of Reference [66].Google Scholar
- Oracle 2016. Oracle Cloud Infrastructure Block Volumes. Retrieved from https://cloud.oracle.com/en_US/storage/block-volume/features.Google Scholar
- K. K. Ramakrishnan, Sally Floyd, and David L. Black. 2001. The Addition of Explicit Congestion Notification (ECN) to IP. IETF RFC 3168. Retrieved from http://www.ietf.org/rfc/rfc3168.txt.Google Scholar
- I. S. Reed and G. Solomon. 1960. Polynomial codes over certain finite fields. J. Soc. Industr. Appl. Math. 8, 2 (Jun. 1960), 300--304. DOI:https://doi.org/10.1137/0108018Google Scholar
Cross Ref
- Hans T. Reiser. 2006. Reiser4. Retrieved July 6, 2006 from https://web.archive.org/web/20060706032252 http://www.namesys.com:80/.Google Scholar
- Kai Ren and Garth Gibson. 2013. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the USENIX Annual Technical Conference. 145--156.Google Scholar
- Ohad Rodeh. 2008. B-trees, shadowing, and clones. ACM Trans. Comput. Logic 3, 4 (Feb. 2008), 15:1--15:27. DOI:https://doi.org/10.1145/1326542.1326544Google Scholar
- Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-tree filesystem. ACM Trans. Stor. 9, 3 (Aug. 2013). DOI:https://doi.org/10.1145/2501620.2501623Google Scholar
Digital Library
- Mark Russinovich. 2000. Inside Win2K NTFS, Part 1. ITProToday (22 Oct. 2000). Retrieved from https://www.itprotoday.com/management-mobility/inside-win2k-ntfs-part-1.Google Scholar
- Spencer Shepler, Brent Callaghan, David Robinson, Robert Thurlow, Carl Beame, Mike Eisler, and David Noveck. 2003. Network File System (NFS) version 4 Protocol. IETF RFC 3530. Retrieved from https://www.ietf.org/html/rfc3530.Google Scholar
- Chris Siebenmann. 2017. ZFS’s recordsize, Holes In Files, and Partial Blocks. Retrieved from https://utcc.utoronto.ca/cks/space/blog/solaris/ZFSFilePartialAndHoleStorage.Google Scholar
- Chris Siebenmann. 2018. What ZFS Gang Blocks Are and Why They Exist. Retrieved August 30, 2018 from https://utcc.utoronto.ca/ cks/space/blog/solaris/ZFSGangBlocks.Google Scholar
- Jon Stacey. 2009. Mac OS X Resource Forks. Jon’s View (blog). Retrieved January 23, 2020 https://jonsview.com/mac-os-x-resource-forks.Google Scholar
- W. Richard Stevens. 1997. TCP Slow Start, Congestion Avoidance, Fast Retransmit and Fast Recovery Algorithms. IETF RFC 2001. Retrieved from https://www.ietf.org/html/rfc2001.Google Scholar
- Sun Microsystems. 2006. ZFS On-Disk Specification—draft. Retrieved from http://www.giis.co.in/Zfs_ondiskformat.pdf.Google Scholar
- Adam Sweeny, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS file system. In Proceedings of the 1996 USENIX Annual Technical Conference (ATC’96). 1--14.Google Scholar
- Lingxiang Xiang and Michael L. Scott. 2015. Conflict reduction in hardware transactions using advisory locks. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’15). 234--243. DOI:https://doi.org/10.1145/2755573.2755577Google Scholar
- Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, Pooja Deo, Zardosht Kasheff, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2017. Writes wrought right, and other adventures in file system optimization. ACM Trans. Stor. 13, 1 (Mar. 2017), 3:1--3:21. DOI:https://doi.org/10.1145/3032969Google Scholar
Digital Library
- Yang Zhan, Alexander Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. 2018. The full path to full-path indexing. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). 123--138.Google Scholar
Index Terms
Everyone Loves File: Oracle File Storage Service
Recommendations
Everyone loves file: file storage service (FSS) in Oracle cloud infrastructure
USENIX ATC '19: Proceedings of the 2019 USENIX Conference on Usenix Annual Technical ConferenceFile Storage Service (FSS) is an elastic filesystem provided as a managed NFS service in Oracle Cloud Infrastructure. Using a pipelined Paxos implementation, we implemented a scalable block store that provides linearizable multipage limited-size ...
Consensus on transaction commit
The distributed transaction commit problem requires reaching agreement on whether a transaction is committed or aborted. The classic Two-Phase Commit protocol blocks if the coordinator fails. Fault-tolerant consensus algorithms also reach agreement, but ...
Multi-shot distributed transaction commit
AbstractAtomic Commit Problem (ACP) is a single-shot agreement problem similar to consensus, meant to model the properties of transaction commit protocols in fault-prone distributed systems. We argue that ACP is too restrictive to capture the complexities ...






Comments