skip to main content
research-article
Public Access

Protocol-Aware Recovery for Consensus-Based Distributed Storage

Published:03 October 2018Publication History
Skip Abstract Section

Abstract

We introduce protocol-aware recovery (Par), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of Par through the design and implementation of <underline>c</underline>orruption-<underline>t</underline>olerant <underline>r</underline>ep<underline>l</underline>ication (Ctrl), a Par mechanism specific to replicated state machine (RSM) systems. We experimentally show that the Ctrl versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the Ctrl versions achieve this reliability with little performance overheads.

References

  1. Ittai Abraham, Gregory Chockler, Idit Keidar, and Dahlia Malkhi. 2006. Byzantine disk paxos: Optimal resilience with byzantine shared memory. Distributed Computing 18, 5 (2006), 387--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee, Aws Albarghouthi, Vijay Chidambaram, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. Protocol-aware recovery for consensus-based storage. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache. 2017. Kakfa. Retrieved April 21, 2017 from http://kafka.apache.org/.Google ScholarGoogle Scholar
  5. Apache. 2008. ZooKeeper. Retrieved April 21, 2017 from https://zookeeper.apache.org/.Google ScholarGoogle Scholar
  6. Apache. 2008. ZooKeeper Guarantees, Properties, and Definitions. Retrieved April 21, 2017 from https://zookeeper.apache.org/doc/r3.2.2/zookeeperInternals.html#sc_guaranteesPropertiesDefinitions.Google ScholarGoogle Scholar
  7. Apache Cassandra. 2017. Cassandra Replication. Retrieved April 21, 2017 from http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html.Google ScholarGoogle Scholar
  8. Apache ZooKeeper. 2014. Applications and Organizations using ZooKeeper. Retrieved April 21, 2017 from https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy.Google ScholarGoogle Scholar
  9. Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2015. Operating Systems: Three Easy Pieces (0.91 ed.). Arpaci-Dusseau Books.Google ScholarGoogle Scholar
  10. Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’08).Google ScholarGoogle Scholar
  13. Lakshmi Narayanan Bairavasundaram. 2008. Characteristics, Impact, and Tolerance of Partial Disk Failures. Ph.D. dissertation. University of Wisconsin, Madison.Google ScholarGoogle Scholar
  14. Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. 2012. CORFU: A shared log design for flash clusters. In Proceedings of the 9th Symposium on Networked Systems Design and Implementation (NSDI’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andrew D. Birrell, Roy Levin, Michael D. Schroeder, and Roger M. Needham. 1982. Grapevine: An exercise in distributed computing. Communications of the ACM 25, 4 (April 1982), 260--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mike Burrows. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th ACM Symposium on Principles of Distributed Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST'12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. Upright cluster services. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Miguel Correia, Daniel Gómez Ferro, Flavio P. Junqueira, and Marco Serafini. 2012. Practical hardening of crash-tolerant systems. In 2012 USENIX Annual Technical Conference (USENIX ATC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jeff Dean. 2010. Building Large-Scale Internet Services. Retrieved April 21, 2017 from http://static.googleusercontent.com/media/research.google.com/en//people/jeff/SOCC2010-keynote-slides.pdf.Google ScholarGoogle Scholar
  24. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Diego Ongaro. 2014. Raft TLA+ Specification. Retrieved April 21, 2017 from https://github.com/ongardie/raft.tla.Google ScholarGoogle Scholar
  26. epaxos. 2012. epaxos Source Code. Retrieved April 21, 2017 from https://github.com/efficient/epaxos.Google ScholarGoogle Scholar
  27. etcd. 2014. etcd. Retrieved April 21, 2017 from https://coreos.com/etcd.Google ScholarGoogle Scholar
  28. etcd. 2014. etcd: Production Users. Retrieved April 21, 2017 from https://coreos.com/etcd/docs/latest/production-users.html.Google ScholarGoogle Scholar
  29. Daniel Fryer, Dai Qin, Jack Sun, Kah Wai Lee, Angela Demke Brown, and Ashvin Goel. 2014. Checking the integrity of transactional mechanisms. In Proceedings of the 12th USENIX Symposium on File and Storage Technologies (FAST’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST'12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults. ACM Transactions on Storage 13, 3 (Sept. 2017), 20:1--20:33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST'17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Matthias Grawinkel, Thorsten Schafer, Andre Brinkmann, Jens Hagemeyer, and Mario Porrmann. 2011. Evaluation of applied intra-disk redundancy schemes to improve single disk reliability. In Proceedings of the 19th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kevin M. Greenan, Darrell D. E. Long, Ethan L. Miller, Thomas Schwarz, and Avani Wildani. 2009. Building flexible, fault-tolerant flash-based storage systems. In Proceedings of the 5th Workshop on Hot Topics in System Dependability (HotDep’09).Google ScholarGoogle Scholar
  36. Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: Anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. James Hamilton. 2007. On designing and deploying internet-scale services. In Proceedings of the 21st Annual Large Installation System Administration Conference (LISA’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. James Myers. 2014. Data Integrity in Solid State Drives. Retrieved April 21, 2017 from http://intel.ly/2cF0dTT.Google ScholarGoogle Scholar
  39. John Goerzen. 2017. Silent Data Corruption Is Real. Retrieved April 21, 2017 from http://changelog.complete.org/archives/9769-silent-data-corruption-is-real.Google ScholarGoogle Scholar
  40. Jonathan Corbet. 2008. Responding to ext4 Journal Corruption. Retrieved April 21, 2017 from https://lwn.net/Articles/284037/.Google ScholarGoogle Scholar
  41. Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Dmitrii Kuvaiskii, Rasha Faqeh, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. 2016. HAFT: Hardware-assisted fault tolerance. In Proceedings of the EuroSys Conference (EuroSys’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (2001), 18--25.Google ScholarGoogle Scholar
  44. Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolic. 2016. XFT: Practical fault tolerance beyond crashes. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. LogCabin. 2014. LogCabin. Retrieved April 21, 2017 from https://github.com/logcabin/logcabin.Google ScholarGoogle Scholar
  46. Jacob R. Lorch, Atul Adya, William J. Bolosky, Ronnie Chaiken, John R. Douceur, and Jon Howell. 2006. The SMART way to migrate replicated stateful services. In Proceedings of the EuroSys Conference (EuroSys’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Parisa Jalili Marandi, Christos Gkantsidis, Flavio Junqueira, and Dushyanth Narayanan. 2016. Filo: Consolidated consensus as a cloud service. In 2016 USENIX Annual Technical Conference (USENIX ATC’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. MongoDB. 2017. MongoDB Replication. Retrieved April 21, 2017 from https://docs.mongodb.org/manual/replication/.Google ScholarGoogle Scholar
  50. Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in egalitarian parliaments. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. 2016. SSD failures in datacenters: What? When? and Why?. In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Diego Ongaro. 2014. Consensus: Bridging Theory and Practice. Ph.D. dissertation. Stanford University.Google ScholarGoogle Scholar
  53. Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Bernd Panzer-Steindel. 2007. Data integrity. CERN/IT.Google ScholarGoogle Scholar
  55. Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST'17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Redis. 2015. Redis. Retrieved April 21, 2017 from http://redis.io/.Google ScholarGoogle Scholar
  59. Redis. 2015. Redis Replication. Retrieved April 21, 2017 from http://redis.io/topics/replication.Google ScholarGoogle Scholar
  60. Robert Ricci, Eric Eide, and CloudLab Team. 2014. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login: 39, 6 (2014).Google ScholarGoogle Scholar
  61. Robert Harris. 2007. Data Corruption Is Worse than You Know. Retrieved April 21, 2017 from http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/.Google ScholarGoogle Scholar
  62. Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22, 4 (Dec. 1990), 299--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Michael D. Schroeder, Andrew D. Birrell, and Roger M. Needham. 1984. Experience with grapevine: The growth of a distributed system. ACM Transactions on Computer Systems 2, 1 (Feb. 1984), 3--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Thomas Schwarz, Ahmed Amer, Thomas Kroeger, Ethan L. Miller, Darrell D. E. Long, and Jehan-François Pâris. 2016. RESAR: Reliable storage at exabyte scale. In Proceedings of the 24th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google ScholarGoogle ScholarCross RefCross Ref
  67. Romain Slootmaekers and Nicolas Trangez. 2012. Arakoon: A distributed consistent key-value store. In SIGPLAN OCaml Users and Developers Workshop, Vol. 62.Google ScholarGoogle Scholar
  68. Stackoverflow. 2015. Can ext4 Detect Corrupted File Contents? Retrieved April 21, 2017 from http://stackoverflow.com/questions/31345097/can-ext4-detect-corrupted-file-contents.Google ScholarGoogle Scholar
  69. Stackoverflow. 2013. ZooKeeper Clear State. Retrieved April 21, 2017 from http://stackoverflow.com/questions/17038957/org-apache-hadoop-hbase-pleaseholdexception-master-is-initializing.Google ScholarGoogle Scholar
  70. Michael M. Swift, Brian N. Bershad, and Henry M. Levy. 2003. Improving the reliability of commodity operating systems. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer, and C. H. Hauser. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning. In Proceedings of the 11th Conference on File and Storage Technologies (FAST’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Theodore Ts’o. 2008. What to Do when the Journal Checksum is Incorrect. Retrieved April 21, 2017 from https://lwn.net/Articles/284038/.Google ScholarGoogle Scholar
  74. Robbert Van Renesse, Nicolas Schiper, and Fred B. Schneider. 2015. Vive la différence: Paxos vs. viewstamped replication vs. zab. IEEE Transactions on Dependable and Secure Computing 12, 4 (2015), 472--484.Google ScholarGoogle ScholarCross RefCross Ref
  75. Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the Salus scalable block store. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2010. End-to-end data integrity for file systems: A ZFS case study. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. ZooKeeper Jira Issues. 2012. Unable to Load Database on Disk when Restarting after Node Freeze. Retrieved April 21, 2017 from https://issues.apache.org/jira/browse/ZOOKEEPER-1546.Google ScholarGoogle Scholar

Index Terms

  1. Protocol-Aware Recovery for Consensus-Based Distributed Storage

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Storage
            ACM Transactions on Storage  Volume 14, Issue 3
            Special Issue on FAST 2018 and Regular Papers
            August 2018
            210 pages
            ISSN:1553-3077
            EISSN:1553-3093
            DOI:10.1145/3282875
            • Editor:
            • Sam H. Noh
            Issue’s Table of Contents

            Copyright © 2018 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 3 October 2018
            • Accepted: 1 July 2018
            • Received: 1 May 2018
            Published in tos Volume 14, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!