Abstract
Existing storage stacks are top heavy and expect little from block storage. As a result, new high-level storage abstractions—and new designs for existing abstractions—are difficult to realize, requiring developers to implement from scratch complex functionality such as failure atomicity and fine-grained concurrency control. In this article, we argue that pushing transactional isolation into the block store (in addition to atomicity and durability) is both viable and broadly useful, resulting in simpler high-level storage systems that provide strong semantics without sacrificing performance. We present Isotope, a new block store that supports ACID transactions over block reads and writes. Internally, Isotope uses a new multiversion concurrency control protocol that exploits fine-grained, subblock parallelism in workloads and offers both strict serializability and snapshot isolation guarantees. We implemented several high-level storage systems over Isotope, including two key-value stores that implement the LevelDB API over a hash table and B-tree, respectively, and a POSIX file system. We show that Isotope’s block-level transactions enable systems that are simple (100s of lines of code), robust (i.e., providing ACID guarantees), and fast (e.g., 415MB/s for random file writes). We also show that these systems can be composed using Isotope, providing applications with transactions across different high-level constructs such as files, directories, and key-value pairs.
- Abutalib Aghayev and Peter Desnoyers. 2015. Skylight—a window on shingled disk operation. In USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 135--149. Google Scholar
Digital Library
- Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, and Christos Karamanolis. 2007. Sinfonia: A new paradigm for building scalable distributed systems. ACM SIGOPS Operating Systems Review 41, 6 (2007), 159--174. Google Scholar
Digital Library
- Khalil Amiri, Garth A. Gibson, and Richard Golding. 2000. Highly concurrent shared storage. In International Conference on Distributed Computing Systems. IEEE, 298--307. Google Scholar
Digital Library
- Anirudh Badam and Vivek S. Pai. 2011. SSDAlloc: Hybrid SSD/RAM memory management made easy. In USENIX Conference on Networked Systems Design and Implementation (NSDI’11). USENIX Association, 211--224. Google Scholar
Digital Library
- Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. 2012. CORFU: A shared log design for flash clusters. In USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, 1--14. Google Scholar
Digital Library
- Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. 1995. A critique of ANSI SQL isolation levels. ACM SIGMOD Record 24, 2 (1995), 1--10. Google Scholar
Digital Library
- Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency Control and Recovery in Database Systems. Vol. 370. Addison-Wesley, New York. Google Scholar
Digital Library
- Chia Chao, Robert English, David Jacobson, Alexander Stepanov, and John Wilkes. 1992. Mime: A High Performance Parallel Storage Device with Strong Recovery Guarantees. Technical Report. HPL-CSP-92-9, Hewlett-Packard Laboratories.Google Scholar
- Joel Coburn, Trevor Bunker, Meir Schwarz, Rajesh Gupta, and Steven Swanson. 2013. From ARIES to MARS: Transaction support for next-generation, solid-state drives. In ACM Symposium on Operating Systems Principles (SOSP’13). ACM, 197--212. Google Scholar
Digital Library
- Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. ACM SIGARCH Computer Architecture News 39, 1 (2011), 105--118. Google Scholar
Digital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In ACM Symposium on Cloud Computing (SoCC’10). ACM, 143--154. Google Scholar
Digital Library
- Brian Cornell, Peter A. Dinda, and Fabián E. Bustamante. 2004. Wayback: A user-level versioning file system for Linux. In USENIX Annual Technical Conference (ATC’04). USENIX Association, 19--28. Google Scholar
Digital Library
- Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson, Keir Fraser, Tim Deegan, Daniel Stodden, Geoffrey Lefebvre, Daniel Ferstay, and Andrew Warfield. 2014. Strata: Scalable high-performance storage on virtualized non-volatile memory. In USENIX Conference on File and Storage Technologies (FAST’14). USENIX Association, 17--31. Google Scholar
Digital Library
- Wiebren De Jonge, M. Frans Kaashoek, and Wilson C. Hsieh. 1993. The logical disk: A new approach to improving file systems. ACM SIGOPS Operating Systems Review 27, 5 (1993), 15--28. Google Scholar
Digital Library
- David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker, and David A. Wood. 1984. Implementation techniques for main memory database systems. In ACM SIGMOD International Conference on Management of Data. ACM, 1--8. Google Scholar
Digital Library
- James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. 1986. Making data structures persistent. In ACM Symposium on Theory of Computing (STOC’86). ACM, 109--121. Google Scholar
Digital Library
- Robert M. English and Alexander A. Stepanov. 1992. Loge: A self-organizing disk controller. In USENIX Winter Technical Conference. USENIX Association, 237--251.Google Scholar
- Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and concurrent MemCache with dumber caching and smarter hashing. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). USENIX Association, 371--384. Google Scholar
Digital Library
- fcntl(2) Linux manual page. 2016. fcntl(2) Linux manual page. Retrieved from http://man7.org/linux/man-pages/man2/fcntl.2.html.Google Scholar
- Filesystem in Userspace. 2016. Retrieved from https://github.com/libfuse/libfuse.Google Scholar
- Michail Flouris and Angelos Bilas. 2004. Clotho: Transparent data versioning at the block I/O level. In IEEE Conference on Mass Storage Systems and Technologies (MSST’04). IEEE, 315--328.Google Scholar
- Fusion-io. 2015. Fusion-io. Retrieved from http://www.fusionio.com.Google Scholar
- Gregory R. Ganger. 2001. Blurring the Line Between OSes and Storage Devices. School of Computer Science, Carnegie Mellon University.Google Scholar
- Google. 2016. LevelDB benchmarks. Retrieved from https://github.com/google/leveldb/blob/master/doc/benchmark.html.Google Scholar
- Rachid Guerraoui and Michal Kapalka. 2008. On the correctness of transactional memory. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, 175--184. Google Scholar
Digital Library
- Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional Memory. Morgan and Claypool Publishers. Google Scholar
Digital Library
- Dave Hitz, James Lau, and Michael Malcolm. 1994. File system design for an NFS file server appliance. In USENIX Winter Technical Conference. USENIX Association, 235--246. Google Scholar
Digital Library
- IOzone. 2016. IOzone filesystem benchmark. Retrieved from http://www.iozone.org.Google Scholar
- Jithin Jose, Mohammad Banikazemi, Wendy Belluomini, Chet Murthy, and Dhabaleswar K Panda. 2013. MetaData persistence using storage class memory: Experiences with flash-backed DRAM. In Proceedings of Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads (INFLOW’13). ACM, 3:1--3:7. Google Scholar
Digital Library
- Hsiang-Tsung Kung and John T. Robinson. 1981. On optimistic methods for concurrency control. ACM Transactions on Database Systems (TODS) 6, 2 (1981), 213--226. Google Scholar
Digital Library
- David E. Lowell and Peter M. Chen. 1997. Free transactions with rio vista. ACM SIGOPS Operating Systems Review 31, 5 (1997), 92--101. Google Scholar
Digital Library
- John MacCormick, Nick Murphy, Marc Najork, Chandramohan A. Thekkath, and Lidong Zhou. 2004. Boxwood: Abstractions as the foundation for storage infrastructure. In USENIX Symposium on Opearting Systems Design and Implementation (OSDI’04). USENIX Association, 105--120. Google Scholar
Digital Library
- Mike Mesnier, Gregory R. Ganger, and Erik Riedel. 2003. Object-based storage. IEEE Communications Magazine 41, 8 (2003), 84--90. Google Scholar
Digital Library
- Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, Geoffrey Lefebvre, Michael J. Feeley, Norman C. Hutchinson, and Andrew Warfield. 2008. Parallax: Virtual disks for virtual machines. ACM SIGOPS Operating Systems Review 42, 4 (2008), 41--54. Google Scholar
Digital Library
- Microsoft. 2016a. Storage Spaces. Retrieved from http://technet.microsoft.com/en-us/library/hh831739.aspx.Google Scholar
- Microsoft. 2016b. WinFS. Retrieved from http://blogs.msdn.com/b/winfs/.Google Scholar
- C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems (TODS) 17, 1 (1992), 94--162. Google Scholar
Digital Library
- Kiran-Kumar Muniswamy-Reddy, Charles P. Wright, Andrew Himmer, and Erez Zadok. 2004. A versatile and user-oriented versioning file system. In USENIX Conference on File and Storage Technologies (FAST’04). USENIX Association, 115--128. Google Scholar
Digital Library
- Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat datacenter storage. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 1--15. Google Scholar
Digital Library
- Michael A. Olson. 1993. The design and implementation of the inversion file system. In USENIX Winter Technical Conference. USENIX Association, 205--218.Google Scholar
- Avery Pennarun. 2016. Everything you never wanted to know about file locking. Retrieved from http://apenwarr.ca/log/?m=201012#13.Google Scholar
- Donald E. Porter, Owen S. Hofmann, Christopher J. Rossbach, Alexander Benn, and Emmett Witchel. 2009. Operating system transactions. In ACM Symposium on Operating Systems Principles (SOSP’09). ACM, 161--176. Google Scholar
Digital Library
- Vijayan Prabhakaran, Thomas L. Rodeheffer, and Lidong Zhou. 2008. Transactional flash. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’08). USENIX Association, 147--160. Google Scholar
Digital Library
- Sean Quinlan and Sean Dorward. 2002. Venti: A new approach to archival storage. In USENIX Conference on File and Storage Technologies (FAST’02). USENIX Association, 89--101. Google Scholar
Digital Library
- Colin Reid, Philip A. Bernstein, Ming Wu, and Xinhao Yuan. 2011. Optimistic concurrency control by melding trees. Proceedings of the VLDB Endowment 4, 11 (2011).Google Scholar
- Jerome H. Saltzer, David P. Reed, and David D. Clark. 1984. End-to-end arguments in system design. ACM Transactions on Computer Systems (TOCS) 2, 4 (1984), 277--288. Google Scholar
Digital Library
- SanDisk. 2015a. SanDisk Fusion-io Atomic Multi-Block Writes. Retrieved from http://www.sandisk.com/assets/docs/accelerate-myql-open-source-databases-with-sandisk-nvmfs-and-fusion-iomemory-sx300-application-accelerators.pdf.Google Scholar
- SanDisk. 2015b. SanDisk Fusion-io Auto-Commit Memory. Retrieved from http://web.sandisk.com/assets/white-papers/MySQL_High-Speed_Transaction_Logg ing.pdf.Google Scholar
- Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, and Jacob Ofir. 1999. Deciding when to forget in the elephant file system. ACM SIGOPS Operating Systems Review 33, 5 (1999), 110--123. Google Scholar
Digital Library
- Mahadev Satyanarayanan, Henry H. Mashburn, Puneet Kumar, David C. Steere, and James J. Kistler. 1994. Lightweight recoverable virtual memory. ACM Transactions on Computer Systems (TOCS) 12, 1 (1994), 33--57. Google Scholar
Digital Library
- Mohit Saxena, Mehul A. Shah, Stavros Harizopoulos, Michael M. Swift, and Arif Merchant. 2012a. Hathi: Durable transactions for memory using flash. In International Workshop on Data Management on New Hardware. ACM, 33--38. Google Scholar
Digital Library
- Mohit Saxena, Michael M. Swift, and Yiying Zhang. 2012b. FlashTier: A lightweight, consistent and durable storage cache. In ACM European Conference on Computer Systems (EuroSys’12). ACM, 267--280. Google Scholar
Digital Library
- Seagate. 2016. Seagate Kinetic Open Storage Platform. Retrieved from http://www.seagate.com/solutions/cloud/data-center-cloud/platforms/.Google Scholar
- Russell Sears and Eric Brewer. 2006. Stasis: Flexible transactional storage. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). USENIX Association, 29--44. Google Scholar
Digital Library
- Nir Shavit and Dan Touitou. 1997. Software transactional memory. Distributed Computing 10, 2 (1997), 99--116.Google Scholar
Cross Ref
- Ji-Yong Shin, Mahesh Balakrishnan, Tudor Marian, and Hakim Weatherspoon. 2013. Gecko: Contention-oblivious disk arrays for cloud storage. In USENIX Conference on File and Storage Technologies (FAST’13). USENIX Association, 213--225. Google Scholar
Digital Library
- Muthian Sivathanu, Vijayan Prabhakaran, Florentina I. Popovici, Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2003. Semantically-smart disk systems. In USENIX Conference on File and Storage Technologies (FAST’03). USENIX Association, 73--88. Google Scholar
Digital Library
- Dimitris Skourtis, Dimitris Achlioptas, Noah Watkins, Carlos Maltzahn, and Scott Brandt. 2014. Flash on rails: Consistent flash performance through redundancy. In USENIX Annual Technical Conference (ATC’14). USENIX Association, 463--474. Google Scholar
Digital Library
- Gokul Soundararajan, Vijayan Prabhakaran, Mahesh Balakrishnan, and Ted Wobber. 2010. Extending SSD lifetimes with disk-based write caches. In USENIX Conference on File and Storage Technologies (FAST’10). USENIX Association, 101--114. Google Scholar
Digital Library
- Yair Sovran, Russell Power, Marcos K. Aguilera, and Jinyang Li. 2011. Transactional storage for geo-replicated systems. In ACM Symposium on Operating Systems Principles (SOSP’11). ACM, 385--400. Google Scholar
Digital Library
- Lex Stein. 2005. Stupid file systems are better. In Workshop on Hot Topics in Operating Systems (HotOS’05). USENIX Association. Google Scholar
Digital Library
- Alexander Thomson and Daniel J. Abadi. 2015. CalvinFS: Consistent WAN replication and scalable metadata management for distributed file systems. In USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 1--14. Google Scholar
Digital Library
- Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: Lightweight persistent memory. ACM SIGARCH Computer Architecture News 39, 1 (2011), 91--104. Google Scholar
Digital Library
- Randolph Y. Wang, Thomas E. Anderson, and David A. Patterson. 1998. Virtual log based file systems for a programmable disk. Operating Systems Review 33 (1998), 29--44.Google Scholar
- John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan. 1996. The HP AutoRAID hierarchical storage system. ACM Transactions on Computer Systems (TOCS) 14, 1 (1996), 108--136. Google Scholar
Digital Library
- Charles P. Wright, Richard Spillane, Gopalan Sivathanu, and Erez Zadok. 2007. Extending ACID semantics to the file system. ACM Transactions on Storage (TOS) 3, 2 (2007), 4. Google Scholar
Digital Library
- Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. De-indirection for flash-based SSDs with nameless writes. In USENIX Conference on File and Storage Technologies (FAST’12). USENIX Association, 1--16. Google Scholar
Digital Library
Index Terms
Isotope: ACID Transactions for Block Storage
Recommendations
Prettier concurrency: purely functional concurrent revisions
Haskell '11This article presents an extension to the work of Launchbury and Peyton-Jones on the ST monad. Using a novel model for concurrency, called concurrent revisions [3,5], we show how we can use concurrency together with imperative mutable variables, while ...
Prettier concurrency: purely functional concurrent revisions
Haskell '11: Proceedings of the 4th ACM symposium on HaskellThis article presents an extension to the work of Launchbury and Peyton-Jones on the ST monad. Using a novel model for concurrency, called concurrent revisions [3,5], we show how we can use concurrency together with imperative mutable variables, while ...
Isotope: transactional isolation for block storage
FAST'16: Proceedings of the 14th Usenix Conference on File and Storage TechnologiesExisting storage stacks are top-heavy and expect little from block storage. As a result, new high-level storage abstractions - and new designs for existing abstractions - are difficult to realize, requiring developers to implement from scratch complex ...






Comments