Abstract
Block storage provides virtual disks that can be mounted by virtual machines (VMs). Although erasure coding (EC) has been widely used in many cloud storage systems for its high efficiency and durability, current EC schemes cannot provide high-performance block storage for the cloud. This is because they introduce significant overhead to small write operations (which perform partial write to an entire EC group), whereas cloud-oblivious applications running on VMs are often small-write-intensive. We identify the root cause for the poor performance of partial writes in state-of-the-art EC schemes: for each partial write, they have to perform a time-consuming write-after-read operation that reads the current value of the data and then computes and writes the parity delta, which will be used to “patch” the parity in journal replay.
In this article, we present a speculative partial write scheme (called PARIX) that supports fast small writes in erasure-coded storage systems. We transform the original formula of parity calculation to use the data deltas (between the current/original data values), instead of the parity deltas, to calculate the parities in journal replay. For each partial write, this allows PARIX to speculatively log only the new value of the data without reading its original value. For a series of n partial writes to the same data, PARIX performs pure write (instead of write-after-read) for the last n-1 ones while only introducing a small penalty of an extra network round-trip time to the first one. Based on PARIX, we design and implement PARIX Block Storage (PBS), an efficient block storage system that provides high-performance virtual disk service for VMs running cloud-oblivious applications. PBS not only supports fast partial writes but also realizes efficient full writes, background journal replay, and fast failure recovery with strong consistency guarantees. Both microbenchmarks and trace-driven evaluation show that PBS provides efficient block storage and outperforms state-of-the-art EC-based systems by orders of magnitude.
- Amazon Web Services. [n.d.]. Amazon Simple Email Service. Retrieved January 6, 2020 from http://aws.amazon.com/ses/.Google Scholar
- SNIA. [n.d.]. MSR Cambridge Traces. Received February 6, 2020 from http://iotta.snia.org/traces/388.Google Scholar
- Man7.org. [n.d.]. FSTRIM(8). Retrieved February 6, 2020 from http://man7.org/linux/man-pages/man8/fstrim.8.html.Google Scholar
- Amazon Web Services. [n.d.]. Amazon Elastic Block Store. Retrieved February 6, 2020 from https://aws.amazon.com/ebs/.Google Scholar
- Microsoft. [n.d.]. Microsoft Azure Storage. Retrieved February 6, 2020 from https://azure.microsoft.com/en-us/services/storage/.Google Scholar
- Google Cloud. [n.d.]. Cloud SQL. Retrieved February 6, 2020 from https://cloud.google.com/products/cloud-sql.Google Scholar
- Fedora Wiki. [n.d.]. Infrastructure/Fedorahosted-Retirement. Retrieved February 6, 2020 from https://git.fedorahosted.org/cgit/libaio.git.Google Scholar
- Apache Hadoop. [n.d.]. HDFS Architecture Guide. Retrieved February 6, 2020 from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.Google Scholar
- Alibaba Cloud. [n.d.]. Home Page. Retrieved February 6, 2020 from https://intl.aliyun.com/.Google Scholar
- Oracle Open Source. [n.d.]. Project: OCFS. Retrieved February 6, 2020 from https://oss.oracle.com/projects/ocfs/.Google Scholar
- Sheepdog. [n.d.]. Sheepdog Project. Retrieved February 6, 2020 from https://sheepdog.github.io/sheepdog/.Google Scholar
- Wiki OpenStack. [n.d.]. Cinder. Retrieved February 6, 2020 from https://wiki.openstack.org/cinder.Google Scholar
- Microsoft Azure. [n.d.]. Windows Virtual Desktop. Retrieved February 6, 2020 from https://www.microsoft.com/en-us/cloud-platform/desktop-virtualization.Google Scholar
- Qcloud. [n.d.]. Home Page. Retrieved February 6, 2020 from https://www.qcloud.com/.Google Scholar
- Rich Miller. 2008. Failure Rates in Google Data Centers. Retrieved February 6, 2020 from http://www.datacenterknowledge.com/archives/2008/05/30/failure-rates-in-google-data-centers/.Google Scholar
- IBM. [n.d.]. General Parallel File System. Retrieved February 6, 2020 from http://www.ibm.com/support/knowledgecenter/SSFKCN/gpfs_welcome.html.Google Scholar
- VMware. [n.d.]. vSAN. Retrieved February 6, 2020 from https://www.vmware.com/products/virtual-san.html.Google Scholar
- Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, et al. 2005. Ursa Minor: Versatile cluster-based storage. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST’05), Vol. 4. 5.Google Scholar
- Marcos Kawazoe Aguilera, Ramaprabhu Janakiraman, and Lihao Xu. 2005. Using erasure codes efficiently for storage in a distributed system. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE, Los Alamitos, CA, 336--345.Google Scholar
Digital Library
- Stephen Aiken, Dirk Grunwald, Andrew R. Pleszkun, and Jesse Willeke. 2003. A performance analysis of the iSCSI protocol. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’03). IEEE, Los Alamitos, CA, 123--134.Google Scholar
Digital Library
- Ranjita Bhagwan, Kiran Tati, Yuchung Cheng, Stefan Savage, and Geoffrey M. Voelker. 2004. Total recall: System support for automated availability management. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI’04), Vol. 1. 25.Google Scholar
- Ed L. Cashin. 2005. Kernel korner: ATA over Ethernet: Putting hard drives on the LAN. Linux Journal 2005, 134 (2005), 10.Google Scholar
Digital Library
- Jeremy C. W. Chan, Qian Ding, Patrick P. C. Lee, and Helen H. W. Chan. 2014. Parity logging with reserved space: Towards efficient updates and recovery in erasure-coded clustered storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 163--176.Google Scholar
- Haibo Chen, Heng Zhang, Mingkai Dong, Zhaoguo Wang, Yubin Xia, Haibing Guan, and Binyu Zang. 2017. Efficient and available in-memory KV-store with hybrid erasure coding and replication. ACM Transactions on Storage 13, 3 (Sept. 2017), Article 25, 30 pages. DOI:https://doi.org/10.1145/3129900Google Scholar
Digital Library
- Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12).Google Scholar
- Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10), Vol. 10. 1--7.Google Scholar
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43.Google Scholar
Digital Library
- Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy, and Thomas E. Anderson. 2011. Scalable consistency in Scatter. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). 15--28.Google Scholar
- C. Gray and D. Cheriton. 1989. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP’89). ACM, New York, NY, 202--210. DOI:https://doi.org/10.1145/74850.74870Google Scholar
- Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Analysis of HDFS under HBase: A Facebook messages case study. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 199--212.Google Scholar
Digital Library
- Dean Hildebrand and Peter Honeyman. 2005. Exporting storage systems in a scalable manner with pNFS. In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05). IEEE, Los Alamitos, CA, 18--27.Google Scholar
Digital Library
- Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, et al. 2012. Erasure coding in Windows Azure storage. In 2012 USENIX Annual Technical Conference (ATC’12). 15--26.Google Scholar
Digital Library
- Chao Jin, Dan Feng, Hong Jiang, and Lei Tian. 2011. RAID6L: A log-assisted RAID6 storage architecture with improved write performance. In Proceedings of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST’11). IEEE, Los Alamitos, CA, 1--6.Google Scholar
Digital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2015. Using RDMA efficiently for key-value services. ACM Special Interest Group on Data Communication 44, 4 (2015), 295--306.Google Scholar
- Osama Khan, Randal C. Burns, James S. Plank, William Pierce, and Cheng Huang. 2012. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 20.Google Scholar
Digital Library
- John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishan Gummadi, et al. 2000. OceanStore: An architecture for global-scale persistent storage. ACM SIGPLAN Notices 35, 11 (2000), 190--201.Google Scholar
Digital Library
- Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7 (1978), 558--565.Google Scholar
Digital Library
- Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Distributed virtual disks. ACM SIGPLAN Notices 31 9 (1996), 84--92.Google Scholar
- Andrew W. Leung, Shankar Pasupathy, Garth R. Goodson, and Ethan L. Miller. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the 2008 USENIX Annual Technical Conference (ATC’08), Vol. 1. 2--5.Google Scholar
- Huiba Li, Yiming Zhang, Dongsheng Li, Zhiming Zhang, Shengyun Liu, Peng Huang, Zheng Qin, Kai Chen, and Yongqiang Xiong. 2019. URSA: Hybrid block storage for cloud-scale virtual disks. In Proceedings of the 14th EuroSys Conference. Article 15, 17 pages. DOI:https://doi.org/10.1145/3302424.3303967Google Scholar
Digital Library
- Xicheng Lu, Huaimin Wang, and Ji Wang. 2006. Internet-based virtual computing environment (iVCE): Concepts and architecture. Science in China Series F: Information Sciences 49, 6 (2006), 681--701.Google Scholar
Cross Ref
- Xicheng Lu, Huaimin Wang, Ji Wang, and Jie Xu. 2013. Internet-based virtual computing environment: Beyond the data center as a computer. Future Generation Computer Systems 29, 1 (2013), 309--322.Google Scholar
Digital Library
- James Mickens, Edmund B. Nightingale, Jeremy Elson, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram, Osama Khan, and Krishna Nareddy. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 257--273.Google Scholar
Digital Library
- Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, et al. 2014. f4: Facebook’s warm BLOB storage system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 383--398.Google Scholar
Digital Library
- Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008. Write off-loading: Practical power management for enterprise storage. ACM Transactions on Storage 4, 3 (2008), 10.Google Scholar
Digital Library
- Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat datacenter storage. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12).Google Scholar
- John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, et al. 2009. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. Operating Systems Review 43, 4 (2009), 92--105.Google Scholar
Digital Library
- Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385.Google Scholar
Digital Library
- Xiaoqiang Pei, Yijie Wang, Xingkong Ma, and Fangliang Xu. 2016. T-update: A tree-structured update scheme with top-down transmission in erasure-coded systems. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications(IEEE INFOCOM’16). IEEE, Los Alamitos, CA, 1--9.Google Scholar
Digital Library
- James S. Plank, Jianqiang Luo, Catherine D. Schuman, Lihao Xu, and Zooko Wilcox-O’Hearn. 2009. A performance evaluation and examination of open-source erasure coding libraries for storage. In Proceedings of the 7th Conference on File and Storage Technologies (FAST’09), Vol. 9. 253--265.Google Scholar
- K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2015. A “hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. ACM Computer Communication Review 44, 4 (2015), 331--342.Google Scholar
Digital Library
- Luigi Rizzo. 1997. Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Computer Communication Review 27, 2 (1997), 24--36.Google Scholar
Digital Library
- Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems 10, 1 (1992), 26--52.Google Scholar
Digital Library
- Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. Xoring elephants: Novel erasure codes for big data. Proceedings of the VLDB Endowment 6 (2013), 325--336.Google Scholar
Digital Library
- Zhirong Shen and Patrick P. C. Lee. 2018. Cross-rack-aware updates in erasure-coded data centers. In Proceedings of the 47th International Conference on Parallel Processing. ACM, New York, NY, 80.Google Scholar
- Daniel Stodolsky, Garth Gibson, and Mark Holland. 1993. Parity logging overcoming the small write problem in redundant disk arrays. ACM SIGARCH Computer Architecture News 21 (1993), 64--75.Google Scholar
Digital Library
- Mark W. Storer, Kevin M. Greenan, Ethan L. Miller, and Kaladhar Voruganti. 2008. Pergamum: Replacing tape with energy efficient, reliable, disk-based archival storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. 1.Google Scholar
- Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the Salus scalable block store. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). 357--370.Google Scholar
Digital Library
- Hakim Weatherspoon and John D. Kubiatowicz. 2002. Erasure coding vs. replication: A quantitative comparison. In Proceedings of the International Workshop on Peer-to-Peer Systems. 328--337.Google Scholar
- Charles Weddle, Mathew Oldham, Jin Qian, An-I Andy Wang, Peter Reiher, and Geoff Kuenning. 2007. PARAID: A gear-shifting power-aware RAID. ACM Transactions on Storage 3, 3 (2007), 13.Google Scholar
Digital Library
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). 307--320.Google Scholar
Digital Library
- Brent Welch, Marc Unangst, Zainul Abbasi, Garth A. Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. 2008. Scalable performance of the Panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Vol. 8. 1--17.Google Scholar
- Stephen B. Wicker and Vijay K. Bhargava. 1999. Reed-Solomon Codes and Their Applications. John Wiley 8 Sons.Google Scholar
- J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. 1995. The HP AutoRAID hierarchical storage system. In Proceeedings of the 15th ACM Symposium on Operating Systems Principles. 96--108.Google Scholar
- Yiming Zhang, Dongsheng Li, Chuanxiong Guo, Haitao Wu, Yongqiang Xiong, and Xicheng Lu. 2017. CubicRing: Exploiting network proximity for distributed in-memory key-value store. IEEE/ACM Transactions on Networking 25, 4 (2017), 2040--2053.Google Scholar
Digital Library
- Yiming Zhang, Dongsheng Li, and Ling Liu. 2019. Leveraging glocality for fast failure recovery in distributed RAM storage. ACM Transactions on Storage 15, 1 (2019), 3.Google Scholar
Digital Library
Index Terms
PBS: An Efficient Erasure-Coded Block Storage System Based on Speculative Partial Writes
Recommendations
Data Delta Based Hybrid Writes for Erasure-Coded Storage Systems
Network and Parallel ComputingAbstractErasure coding is widely used in storage systems since it can offer higher reliability at lower redundancy than data replication. However, erasure-coded storage systems have to perform a partial write to an entire erasure coding group for a small ...
A Highly Reliable Storage Systems Based on SSD Array for IoE Environment
Devices in IoE Internet of Everything environment generate massive data from various sensors. To store and process the rapidly incoming large-scale data, SSDs are used for improving performance and reliability of storage systems. However, they have ...
Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems
Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is ...






Comments