Abstract
We introduce protocol-aware recovery (Par), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of Par through the design and implementation of <underline>c</underline>orruption-<underline>t</underline>olerant <underline>r</underline>ep<underline>l</underline>ication (Ctrl), a Par mechanism specific to replicated state machine (RSM) systems. We experimentally show that the Ctrl versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the Ctrl versions achieve this reliability with little performance overheads.
- Ittai Abraham, Gregory Chockler, Idit Keidar, and Dahlia Malkhi. 2006. Byzantine disk paxos: Optimal resilience with byzantine shared memory. Distributed Computing 18, 5 (2006), 387--408. Google Scholar
Digital Library
- Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee, Aws Albarghouthi, Vijay Chidambaram, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. Protocol-aware recovery for consensus-based storage. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). Google Scholar
Digital Library
- Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). Google Scholar
Digital Library
- Apache. 2017. Kakfa. Retrieved April 21, 2017 from http://kafka.apache.org/.Google Scholar
- Apache. 2008. ZooKeeper. Retrieved April 21, 2017 from https://zookeeper.apache.org/.Google Scholar
- Apache. 2008. ZooKeeper Guarantees, Properties, and Definitions. Retrieved April 21, 2017 from https://zookeeper.apache.org/doc/r3.2.2/zookeeperInternals.html#sc_guaranteesPropertiesDefinitions.Google Scholar
- Apache Cassandra. 2017. Cassandra Replication. Retrieved April 21, 2017 from http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html.Google Scholar
- Apache ZooKeeper. 2014. Applications and Organizations using ZooKeeper. Retrieved April 21, 2017 from https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy.Google Scholar
- Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2015. Operating Systems: Three Easy Pieces (0.91 ed.). Arpaci-Dusseau Books.Google Scholar
- Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08). Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’08).Google Scholar
- Lakshmi Narayanan Bairavasundaram. 2008. Characteristics, Impact, and Tolerance of Partial Disk Failures. Ph.D. dissertation. University of Wisconsin, Madison.Google Scholar
- Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. 2012. CORFU: A shared log design for flash clusters. In Proceedings of the 9th Symposium on Networked Systems Design and Implementation (NSDI’12). Google Scholar
Digital Library
- Andrew D. Birrell, Roy Levin, Michael D. Schroeder, and Roger M. Needham. 1982. Grapevine: An exercise in distributed computing. Communications of the ACM 25, 4 (April 1982), 260--274. Google Scholar
Digital Library
- William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI’11). Google Scholar
Digital Library
- Mike Burrows. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). Google Scholar
Digital Library
- Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th ACM Symposium on Principles of Distributed Computing. Google Scholar
Digital Library
- Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Google Scholar
Digital Library
- Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST'12). Google Scholar
Digital Library
- Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. Upright cluster services. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09). Google Scholar
Digital Library
- Miguel Correia, Daniel Gómez Ferro, Flavio P. Junqueira, and Marco Serafini. 2012. Practical hardening of crash-tolerant systems. In 2012 USENIX Annual Technical Conference (USENIX ATC’12). Google Scholar
Digital Library
- Jeff Dean. 2010. Building Large-Scale Internet Services. Retrieved April 21, 2017 from http://static.googleusercontent.com/media/research.google.com/en//people/jeff/SOCC2010-keynote-slides.pdf.Google Scholar
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07). Google Scholar
Digital Library
- Diego Ongaro. 2014. Raft TLA+ Specification. Retrieved April 21, 2017 from https://github.com/ongardie/raft.tla.Google Scholar
- epaxos. 2012. epaxos Source Code. Retrieved April 21, 2017 from https://github.com/efficient/epaxos.Google Scholar
- etcd. 2014. etcd. Retrieved April 21, 2017 from https://coreos.com/etcd.Google Scholar
- etcd. 2014. etcd: Production Users. Retrieved April 21, 2017 from https://coreos.com/etcd/docs/latest/production-users.html.Google Scholar
- Daniel Fryer, Dai Qin, Jack Sun, Kah Wai Lee, Angela Demke Brown, and Ashvin Goel. 2014. Checking the integrity of transactional mechanisms. In Proceedings of the 12th USENIX Symposium on File and Storage Technologies (FAST’14). Google Scholar
Digital Library
- Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST'12). Google Scholar
Digital Library
- Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults. ACM Transactions on Storage 13, 3 (Sept. 2017), 20:1--20:33. Google Scholar
Digital Library
- Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST'17). Google Scholar
Digital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03). Google Scholar
Digital Library
- Matthias Grawinkel, Thorsten Schafer, Andre Brinkmann, Jens Hagemeyer, and Mario Porrmann. 2011. Evaluation of applied intra-disk redundancy schemes to improve single disk reliability. In Proceedings of the 19th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). Google Scholar
Digital Library
- Kevin M. Greenan, Darrell D. E. Long, Ethan L. Miller, Thomas Schwarz, and Avani Wildani. 2009. Building flexible, fault-tolerant flash-based storage systems. In Proceedings of the 5th Workshop on Hot Topics in System Dependability (HotDep’09).Google Scholar
- Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: Anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). Google Scholar
Digital Library
- James Hamilton. 2007. On designing and deploying internet-scale services. In Proceedings of the 21st Annual Large Installation System Administration Conference (LISA’07). Google Scholar
Digital Library
- James Myers. 2014. Data Integrity in Solid State Drives. Retrieved April 21, 2017 from http://intel.ly/2cF0dTT.Google Scholar
- John Goerzen. 2017. Silent Data Corruption Is Real. Retrieved April 21, 2017 from http://changelog.complete.org/archives/9769-silent-data-corruption-is-real.Google Scholar
- Jonathan Corbet. 2008. Responding to ext4 Journal Corruption. Retrieved April 21, 2017 from https://lwn.net/Articles/284037/.Google Scholar
- Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’11). Google Scholar
Digital Library
- Dmitrii Kuvaiskii, Rasha Faqeh, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. 2016. HAFT: Hardware-assisted fault tolerance. In Proceedings of the EuroSys Conference (EuroSys’16). Google Scholar
Digital Library
- Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (2001), 18--25.Google Scholar
- Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolic. 2016. XFT: Practical fault tolerance beyond crashes. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). Google Scholar
Digital Library
- LogCabin. 2014. LogCabin. Retrieved April 21, 2017 from https://github.com/logcabin/logcabin.Google Scholar
- Jacob R. Lorch, Atul Adya, William J. Bolosky, Ronnie Chaiken, John R. Douceur, and Jon Howell. 2006. The SMART way to migrate replicated stateful services. In Proceedings of the EuroSys Conference (EuroSys’06). Google Scholar
Digital Library
- Parisa Jalili Marandi, Christos Gkantsidis, Flavio Junqueira, and Dushyanth Narayanan. 2016. Filo: Consolidated consensus as a cloud service. In 2016 USENIX Annual Technical Conference (USENIX ATC’16). Google Scholar
Digital Library
- Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). Google Scholar
Digital Library
- MongoDB. 2017. MongoDB Replication. Retrieved April 21, 2017 from https://docs.mongodb.org/manual/replication/.Google Scholar
- Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in egalitarian parliaments. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Google Scholar
Digital Library
- Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. 2016. SSD failures in datacenters: What? When? and Why?. In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16). Google Scholar
Digital Library
- Diego Ongaro. 2014. Consensus: Bridging Theory and Practice. Ph.D. dissertation. Stanford University.Google Scholar
- Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC’14). Google Scholar
Digital Library
- Bernd Panzer-Steindel. 2007. Data integrity. CERN/IT.Google Scholar
- Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST'17). Google Scholar
Digital Library
- Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google Scholar
Digital Library
- Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). Google Scholar
Digital Library
- Redis. 2015. Redis. Retrieved April 21, 2017 from http://redis.io/.Google Scholar
- Redis. 2015. Redis Replication. Retrieved April 21, 2017 from http://redis.io/topics/replication.Google Scholar
- Robert Ricci, Eric Eide, and CloudLab Team. 2014. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login: 39, 6 (2014).Google Scholar
- Robert Harris. 2007. Data Corruption Is Worse than You Know. Retrieved April 21, 2017 from http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/.Google Scholar
- Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22, 4 (Dec. 1990), 299--319. Google Scholar
Digital Library
- Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST'10). Google Scholar
Digital Library
- Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). Google Scholar
Digital Library
- Michael D. Schroeder, Andrew D. Birrell, and Roger M. Needham. 1984. Experience with grapevine: The growth of a distributed system. ACM Transactions on Computer Systems 2, 1 (Feb. 1984), 3--23. Google Scholar
Digital Library
- Thomas Schwarz, Ahmed Amer, Thomas Kroeger, Ethan L. Miller, Darrell D. E. Long, and Jehan-François Pâris. 2016. RESAR: Reliable storage at exabyte scale. In Proceedings of the 24th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google Scholar
Cross Ref
- Romain Slootmaekers and Nicolas Trangez. 2012. Arakoon: A distributed consistent key-value store. In SIGPLAN OCaml Users and Developers Workshop, Vol. 62.Google Scholar
- Stackoverflow. 2015. Can ext4 Detect Corrupted File Contents? Retrieved April 21, 2017 from http://stackoverflow.com/questions/31345097/can-ext4-detect-corrupted-file-contents.Google Scholar
- Stackoverflow. 2013. ZooKeeper Clear State. Retrieved April 21, 2017 from http://stackoverflow.com/questions/17038957/org-apache-hadoop-hbase-pleaseholdexception-master-is-initializing.Google Scholar
- Michael M. Swift, Brian N. Bershad, and Henry M. Levy. 2003. Improving the reliability of commodity operating systems. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03). Google Scholar
Digital Library
- D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer, and C. H. Hauser. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). Google Scholar
Digital Library
- Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. 2013. HARDFS: Hardening HDFS with selective and lightweight versioning. In Proceedings of the 11th Conference on File and Storage Technologies (FAST’13). Google Scholar
Digital Library
- Theodore Ts’o. 2008. What to Do when the Journal Checksum is Incorrect. Retrieved April 21, 2017 from https://lwn.net/Articles/284038/.Google Scholar
- Robbert Van Renesse, Nicolas Schiper, and Fred B. Schneider. 2015. Vive la différence: Paxos vs. viewstamped replication vs. zab. IEEE Transactions on Dependable and Secure Computing 12, 4 (2015), 472--484.Google Scholar
Cross Ref
- Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the Salus scalable block store. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI’13). Google Scholar
Digital Library
- Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2010. End-to-end data integrity for file systems: A ZFS case study. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST'10). Google Scholar
Digital Library
- ZooKeeper Jira Issues. 2012. Unable to Load Database on Disk when Restarting after Node Freeze. Retrieved April 21, 2017 from https://issues.apache.org/jira/browse/ZOOKEEPER-1546.Google Scholar
Index Terms
Protocol-Aware Recovery for Consensus-Based Distributed Storage
Recommendations
Protocol-aware recovery for consensus-based storage
FAST'18: Proceedings of the 16th USENIX Conference on File and Storage TechnologiesWe introduce protocol-aware recovery (PAR), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the efficacy of PAR through the design and implementation of corruption-...
Consensus-based data replication protocol for distributed cloud
AbstractData availability ensures efficient data accessibility by the readers anytime and from anywhere. It can be addressed by creating multiple copies of each data file and storing them on well-distributed distinct servers. The more the number of copies,...
A Nonblocking Quorum Consensus Protocol for Replicated Data
A nonblocking quorum protocol for replica control which guarantees one-copyserializability is developed. The effects of a nonblocking protocol are analyzed, and it isshown that the gains can be substantial under certain conditions. It is demonstrated ...






Comments