skip to main content
research-article

PBS: An Efficient Erasure-Coded Block Storage System Based on Speculative Partial Writes

Published:28 February 2020Publication History
Skip Abstract Section

Abstract

Block storage provides virtual disks that can be mounted by virtual machines (VMs). Although erasure coding (EC) has been widely used in many cloud storage systems for its high efficiency and durability, current EC schemes cannot provide high-performance block storage for the cloud. This is because they introduce significant overhead to small write operations (which perform partial write to an entire EC group), whereas cloud-oblivious applications running on VMs are often small-write-intensive. We identify the root cause for the poor performance of partial writes in state-of-the-art EC schemes: for each partial write, they have to perform a time-consuming write-after-read operation that reads the current value of the data and then computes and writes the parity delta, which will be used to “patch” the parity in journal replay.

In this article, we present a speculative partial write scheme (called PARIX) that supports fast small writes in erasure-coded storage systems. We transform the original formula of parity calculation to use the data deltas (between the current/original data values), instead of the parity deltas, to calculate the parities in journal replay. For each partial write, this allows PARIX to speculatively log only the new value of the data without reading its original value. For a series of n partial writes to the same data, PARIX performs pure write (instead of write-after-read) for the last n-1 ones while only introducing a small penalty of an extra network round-trip time to the first one. Based on PARIX, we design and implement PARIX Block Storage (PBS), an efficient block storage system that provides high-performance virtual disk service for VMs running cloud-oblivious applications. PBS not only supports fast partial writes but also realizes efficient full writes, background journal replay, and fast failure recovery with strong consistency guarantees. Both microbenchmarks and trace-driven evaluation show that PBS provides efficient block storage and outperforms state-of-the-art EC-based systems by orders of magnitude.

References

  1. Amazon Web Services. [n.d.]. Amazon Simple Email Service. Retrieved January 6, 2020 from http://aws.amazon.com/ses/.Google ScholarGoogle Scholar
  2. SNIA. [n.d.]. MSR Cambridge Traces. Received February 6, 2020 from http://iotta.snia.org/traces/388.Google ScholarGoogle Scholar
  3. Man7.org. [n.d.]. FSTRIM(8). Retrieved February 6, 2020 from http://man7.org/linux/man-pages/man8/fstrim.8.html.Google ScholarGoogle Scholar
  4. Amazon Web Services. [n.d.]. Amazon Elastic Block Store. Retrieved February 6, 2020 from https://aws.amazon.com/ebs/.Google ScholarGoogle Scholar
  5. Microsoft. [n.d.]. Microsoft Azure Storage. Retrieved February 6, 2020 from https://azure.microsoft.com/en-us/services/storage/.Google ScholarGoogle Scholar
  6. Google Cloud. [n.d.]. Cloud SQL. Retrieved February 6, 2020 from https://cloud.google.com/products/cloud-sql.Google ScholarGoogle Scholar
  7. Fedora Wiki. [n.d.]. Infrastructure/Fedorahosted-Retirement. Retrieved February 6, 2020 from https://git.fedorahosted.org/cgit/libaio.git.Google ScholarGoogle Scholar
  8. Apache Hadoop. [n.d.]. HDFS Architecture Guide. Retrieved February 6, 2020 from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.Google ScholarGoogle Scholar
  9. Alibaba Cloud. [n.d.]. Home Page. Retrieved February 6, 2020 from https://intl.aliyun.com/.Google ScholarGoogle Scholar
  10. Oracle Open Source. [n.d.]. Project: OCFS. Retrieved February 6, 2020 from https://oss.oracle.com/projects/ocfs/.Google ScholarGoogle Scholar
  11. Sheepdog. [n.d.]. Sheepdog Project. Retrieved February 6, 2020 from https://sheepdog.github.io/sheepdog/.Google ScholarGoogle Scholar
  12. Wiki OpenStack. [n.d.]. Cinder. Retrieved February 6, 2020 from https://wiki.openstack.org/cinder.Google ScholarGoogle Scholar
  13. Microsoft Azure. [n.d.]. Windows Virtual Desktop. Retrieved February 6, 2020 from https://www.microsoft.com/en-us/cloud-platform/desktop-virtualization.Google ScholarGoogle Scholar
  14. Qcloud. [n.d.]. Home Page. Retrieved February 6, 2020 from https://www.qcloud.com/.Google ScholarGoogle Scholar
  15. Rich Miller. 2008. Failure Rates in Google Data Centers. Retrieved February 6, 2020 from http://www.datacenterknowledge.com/archives/2008/05/30/failure-rates-in-google-data-centers/.Google ScholarGoogle Scholar
  16. IBM. [n.d.]. General Parallel File System. Retrieved February 6, 2020 from http://www.ibm.com/support/knowledgecenter/SSFKCN/gpfs_welcome.html.Google ScholarGoogle Scholar
  17. VMware. [n.d.]. vSAN. Retrieved February 6, 2020 from https://www.vmware.com/products/virtual-san.html.Google ScholarGoogle Scholar
  18. Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, et al. 2005. Ursa Minor: Versatile cluster-based storage. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST’05), Vol. 4. 5.Google ScholarGoogle Scholar
  19. Marcos Kawazoe Aguilera, Ramaprabhu Janakiraman, and Lihao Xu. 2005. Using erasure codes efficiently for storage in a distributed system. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE, Los Alamitos, CA, 336--345.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Stephen Aiken, Dirk Grunwald, Andrew R. Pleszkun, and Jesse Willeke. 2003. A performance analysis of the iSCSI protocol. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’03). IEEE, Los Alamitos, CA, 123--134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ranjita Bhagwan, Kiran Tati, Yuchung Cheng, Stefan Savage, and Geoffrey M. Voelker. 2004. Total recall: System support for automated availability management. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI’04), Vol. 1. 25.Google ScholarGoogle Scholar
  22. Ed L. Cashin. 2005. Kernel korner: ATA over Ethernet: Putting hard drives on the LAN. Linux Journal 2005, 134 (2005), 10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jeremy C. W. Chan, Qian Ding, Patrick P. C. Lee, and Helen H. W. Chan. 2014. Parity logging with reserved space: Towards efficient updates and recovery in erasure-coded clustered storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 163--176.Google ScholarGoogle Scholar
  24. Haibo Chen, Heng Zhang, Mingkai Dong, Zhaoguo Wang, Yubin Xia, Haibing Guan, and Binyu Zang. 2017. Efficient and available in-memory KV-store with hybrid erasure coding and replication. ACM Transactions on Storage 13, 3 (Sept. 2017), Article 25, 30 pages. DOI:https://doi.org/10.1145/3129900Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12).Google ScholarGoogle Scholar
  26. Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10), Vol. 10. 1--7.Google ScholarGoogle Scholar
  27. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy, and Thomas E. Anderson. 2011. Scalable consistency in Scatter. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). 15--28.Google ScholarGoogle Scholar
  29. C. Gray and D. Cheriton. 1989. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP’89). ACM, New York, NY, 202--210. DOI:https://doi.org/10.1145/74850.74870Google ScholarGoogle Scholar
  30. Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Analysis of HDFS under HBase: A Facebook messages case study. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 199--212.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Dean Hildebrand and Peter Honeyman. 2005. Exporting storage systems in a scalable manner with pNFS. In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05). IEEE, Los Alamitos, CA, 18--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, et al. 2012. Erasure coding in Windows Azure storage. In 2012 USENIX Annual Technical Conference (ATC’12). 15--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Chao Jin, Dan Feng, Hong Jiang, and Lei Tian. 2011. RAID6L: A log-assisted RAID6 storage architecture with improved write performance. In Proceedings of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST’11). IEEE, Los Alamitos, CA, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2015. Using RDMA efficiently for key-value services. ACM Special Interest Group on Data Communication 44, 4 (2015), 295--306.Google ScholarGoogle Scholar
  35. Osama Khan, Randal C. Burns, James S. Plank, William Pierce, and Cheng Huang. 2012. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishan Gummadi, et al. 2000. OceanStore: An architecture for global-scale persistent storage. ACM SIGPLAN Notices 35, 11 (2000), 190--201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7 (1978), 558--565.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Distributed virtual disks. ACM SIGPLAN Notices 31 9 (1996), 84--92.Google ScholarGoogle Scholar
  39. Andrew W. Leung, Shankar Pasupathy, Garth R. Goodson, and Ethan L. Miller. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the 2008 USENIX Annual Technical Conference (ATC’08), Vol. 1. 2--5.Google ScholarGoogle Scholar
  40. Huiba Li, Yiming Zhang, Dongsheng Li, Zhiming Zhang, Shengyun Liu, Peng Huang, Zheng Qin, Kai Chen, and Yongqiang Xiong. 2019. URSA: Hybrid block storage for cloud-scale virtual disks. In Proceedings of the 14th EuroSys Conference. Article 15, 17 pages. DOI:https://doi.org/10.1145/3302424.3303967Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xicheng Lu, Huaimin Wang, and Ji Wang. 2006. Internet-based virtual computing environment (iVCE): Concepts and architecture. Science in China Series F: Information Sciences 49, 6 (2006), 681--701.Google ScholarGoogle ScholarCross RefCross Ref
  42. Xicheng Lu, Huaimin Wang, Ji Wang, and Jie Xu. 2013. Internet-based virtual computing environment: Beyond the data center as a computer. Future Generation Computer Systems 29, 1 (2013), 309--322.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. James Mickens, Edmund B. Nightingale, Jeremy Elson, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram, Osama Khan, and Krishna Nareddy. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 257--273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, et al. 2014. f4: Facebook’s warm BLOB storage system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 383--398.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008. Write off-loading: Practical power management for enterprise storage. ACM Transactions on Storage 4, 3 (2008), 10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat datacenter storage. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12).Google ScholarGoogle Scholar
  47. John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, et al. 2009. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. Operating Systems Review 43, 4 (2009), 92--105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Xiaoqiang Pei, Yijie Wang, Xingkong Ma, and Fangliang Xu. 2016. T-update: A tree-structured update scheme with top-down transmission in erasure-coded systems. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications(IEEE INFOCOM’16). IEEE, Los Alamitos, CA, 1--9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. James S. Plank, Jianqiang Luo, Catherine D. Schuman, Lihao Xu, and Zooko Wilcox-O’Hearn. 2009. A performance evaluation and examination of open-source erasure coding libraries for storage. In Proceedings of the 7th Conference on File and Storage Technologies (FAST’09), Vol. 9. 253--265.Google ScholarGoogle Scholar
  51. K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2015. A “hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. ACM Computer Communication Review 44, 4 (2015), 331--342.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Luigi Rizzo. 1997. Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Computer Communication Review 27, 2 (1997), 24--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems 10, 1 (1992), 26--52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. Xoring elephants: Novel erasure codes for big data. Proceedings of the VLDB Endowment 6 (2013), 325--336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Zhirong Shen and Patrick P. C. Lee. 2018. Cross-rack-aware updates in erasure-coded data centers. In Proceedings of the 47th International Conference on Parallel Processing. ACM, New York, NY, 80.Google ScholarGoogle Scholar
  56. Daniel Stodolsky, Garth Gibson, and Mark Holland. 1993. Parity logging overcoming the small write problem in redundant disk arrays. ACM SIGARCH Computer Architecture News 21 (1993), 64--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Mark W. Storer, Kevin M. Greenan, Ethan L. Miller, and Kaladhar Voruganti. 2008. Pergamum: Replacing tape with energy efficient, reliable, disk-based archival storage. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. 1.Google ScholarGoogle Scholar
  58. Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the Salus scalable block store. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). 357--370.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Hakim Weatherspoon and John D. Kubiatowicz. 2002. Erasure coding vs. replication: A quantitative comparison. In Proceedings of the International Workshop on Peer-to-Peer Systems. 328--337.Google ScholarGoogle Scholar
  60. Charles Weddle, Mathew Oldham, Jin Qian, An-I Andy Wang, Peter Reiher, and Geoff Kuenning. 2007. PARAID: A gear-shifting power-aware RAID. ACM Transactions on Storage 3, 3 (2007), 13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). 307--320.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Brent Welch, Marc Unangst, Zainul Abbasi, Garth A. Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. 2008. Scalable performance of the Panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Vol. 8. 1--17.Google ScholarGoogle Scholar
  63. Stephen B. Wicker and Vijay K. Bhargava. 1999. Reed-Solomon Codes and Their Applications. John Wiley 8 Sons.Google ScholarGoogle Scholar
  64. J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. 1995. The HP AutoRAID hierarchical storage system. In Proceeedings of the 15th ACM Symposium on Operating Systems Principles. 96--108.Google ScholarGoogle Scholar
  65. Yiming Zhang, Dongsheng Li, Chuanxiong Guo, Haitao Wu, Yongqiang Xiong, and Xicheng Lu. 2017. CubicRing: Exploiting network proximity for distributed in-memory key-value store. IEEE/ACM Transactions on Networking 25, 4 (2017), 2040--2053.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Yiming Zhang, Dongsheng Li, and Ling Liu. 2019. Leveraging glocality for fast failure recovery in distributed RAM storage. ACM Transactions on Storage 15, 1 (2019), 3.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. PBS: An Efficient Erasure-Coded Block Storage System Based on Speculative Partial Writes

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 16, Issue 1
          ATC 2019 Special Section and Regular Papers
          February 2020
          155 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3386184
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 28 February 2020
          • Accepted: 1 October 2019
          • Revised: 1 June 2019
          • Received: 1 December 2018
          Published in tos Volume 16, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!