Abstract
We analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous problems related to file-system fault tolerance. We find that modern distributed systems do not consistently use redundancy to recover from file-system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability. We also find that the above outcomes arise due to fundamental problems in file-system fault handling that are common across many systems. Our results have implications for the design of next-generation fault-tolerant distributed and cloud storage systems.
- Cords Tool and Results. 2017. Retrieved from http://research.cs.wisc.edu/adsl/Software/cords/.Google Scholar
- Ramnatthan Alagappan, Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Aws Albarghouthi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2015. Beyond storage APIs: Provable semantics for storage stacks. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems (HOTOS’15).Google Scholar
- Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16).Google Scholar
Digital Library
- Apache. Cassandra. Retrieved from http://cassandra.apache.org/.Google Scholar
- Apache. Kakfa. Retrieved from http://kafka.apache.org/.Google Scholar
- Apache. ZooKeeper. Retrieved from https://zookeeper.apache.org/.Google Scholar
- Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2015. Operating Systems: Three Easy Pieces (0.91 ed.). Arpaci-Dusseau Books.Google Scholar
- Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08). Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). Google Scholar
Digital Library
- Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’08). Google Scholar
Cross Ref
- J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek. 1990. Fault injection experiments using FIAT. IEEE Trans. Comput. 39, 4 (April 1990), 575–582. Google Scholar
Digital Library
- Eric Brewer, Lawrence Ying, Lawrence Greenfield, Robert Cypher, and Theodore T’so. 2016. Disks for Data Centers. Technical Report. Google.Google Scholar
- Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. 2001. An empirical study of operating system errors. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01). Google Scholar
Digital Library
- CockroachDB. CockroachDB. Retrieved from https://www.cockroachlabs.com/.Google Scholar
- CockroachDB. Disk corruptions and read/write error handling in CockroachDB. Retrieved from https://forum.cockroachlabs.com/t/disk-corruptions-and-read-write-error-handling-in-cockroachdb/258.Google Scholar
- CockroachDB. Resiliency to disk corruption and storage errors. Retrieved from https://github.com/cockroachdb/cockroach/issues/7882.Google Scholar
- Miguel Correia, Daniel Gómez Ferro, Flavio P. Junqueira, and Marco Serafini. 2012. Practical hardening of crash-tolerant systems. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC’12).Google Scholar
Digital Library
- Data Center Knowledge. Ma.gnolia data is gone for good. Retrieved from http://www.datacenterknowledge.com/archives/2009/02/19/magnolia-data-is-gone-for-good/.Google Scholar
- Datastax. Netflix Cassandra Use Case. Retrieved from http://www.datastax.com/resources/casestudies/netflix.Google Scholar
- DataStax. Read Repair: Repair during Read Path. Retrieved from http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html.Google Scholar
- S. Dawson, F. Jahanian, and T. Mitton. 1996. ORCHESTRA: A probing and fault injection environment for testing protocol implementations. In Proceedings of the 2nd International Computer Performance and Dependability Symposium (IPDS’96). Google Scholar
Cross Ref
- Jeff Dean. Building Large-Scale Internet Services. Retrieved from http://static.googleusercontent.com/media/research.google.com/en//people/jeff/SOCC2010-keynote-slides.pdf.Google Scholar
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07). Stevenson, WA.Google Scholar
Digital Library
- Jon Elerath. 2009. Hard-disk drives: The good, the bad, and the ugly. Commun. ACM 52, 6 (June 2009), 38–45. Google Scholar
Digital Library
- David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). Google Scholar
Cross Ref
- Daniel Fryer, Dai Qin, Jack Sun, Kah Wai Lee, Angela Demke Brown, and Ashvin Goel. 2014. Checking the integrity of transactional mechanisms. In Proceedings of the 12th USENIX Symposium on File and Storage Technologies (FAST’14). Google Scholar
Digital Library
- Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST’12).Google Scholar
Digital Library
- FUSE. Linux FUSE (Filesystem in Userspace) interface. Retrieved from https://github.com/libfuse/libfuse.Google Scholar
- Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17). 149--166.Google Scholar
Digital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). Google Scholar
Digital Library
- Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 Conference. Google Scholar
Digital Library
- Jim Gray. 1985. Why Do Computers Stop and What Can Be Done About It? Technical Report PN87614. Tandem.Google Scholar
- Weining Gu, Z. Kalbarczyk, Ravishankar K. Iyer, and Zhenyu Yang. 2003. Characterization of linux kernel behavior under errors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’03).Google Scholar
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’14).Google Scholar
- Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011. Practical software model checking via dynamic interface reduction. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Google Scholar
Digital Library
- James R. Hamilton and others. 2007. On designing and deploying internet-scale services. In Proceedings of the 21st Annual Large Installation System Administration Conference (LISA’07).Google Scholar
- Seungjae Han, Kang G. Shin, and Harold A. Rosenberg. 1995. DOCTOR: An integrated software fault injection environment for distributed real-time systems. In Proceedings of the International Computer Performance and Dependability Symposium (IPDS’95).Google Scholar
- James Myers. Data Integrity in Solid State Drives. Retrieved from http://intel.ly/2cF0dTT.Google Scholar
- Jerome Verstrynge. Timestamps in Cassandra. Retrieved from http://docs.oracle.com/cd/B12037_01/server.101/b10726/apphard.htm.Google Scholar
- Kafka. Data corruption or EIO leads to data loss. https://issues.apache.org/jira/browse/KAFKA-4009.Google Scholar
- Kimberley Keeton, Cipriano Santos, Dirk Beyer, Jeffrey Chase, and John Wilkes. 2004. Designing for disasters. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST’04).Google Scholar
- John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrisha Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao. 2000. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’00).Google Scholar
Digital Library
- Kyle Kingsbury. Jepsen. Retrieved from http://jepsen.io/.Google Scholar
- Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google Scholar
- Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolic. 2016. XFT: Practical fault tolerance beyond crashes. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16).Google Scholar
Digital Library
- LogCabin. LogCabin. Retrieved from https://github.com/logcabin/logcabin.Google Scholar
- LogCabin. Reaction to disk errors and corruptions. Retrieved from https://groups.google.com/forum/#!topic/logcabin-dev/wqNcdj0IHe4.Google Scholar
- Mark Adler. Adler32 Collisions. Retrieved from http://stackoverflow.com/questions/13455067/horrific-collisions-of-adler32-hash.Google Scholar
- Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). Google Scholar
Digital Library
- Ningfang Mi, A. Riska, E. Smirni, and E. Riedel. 2008. Enhancing data availability in disk drives through background activities. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’08), Anchorage, Alaska.Google Scholar
- Michael Rubin. Google moves from ext2 to ext4. Retrieved from http://lists.openwall.net/linux-ext4/2010/01/04/8.Google Scholar
- MongoDB. MongoDB. Retrieved from https://www.mongodb.org/.Google Scholar
- MongoDB. MongoDB at eBay. Retrieved from https://www.mongodb.com/presentations/mongodb-ebay.Google Scholar
- MongoDB. MongoDB WiredTiger. Retrieved from https://docs.mongodb.org/manual/core/wiredtiger/.Google Scholar
- Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. 2016. SSD failures in datacenters: What? When? And why? In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16).Google Scholar
Digital Library
- Netflix. Cassandra at Netflix. Retrieved from http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html.Google Scholar
- Oracle. Fusion-IO Data Integrity. Retrieved from https://blogs.oracle.com/linux/entry/fusion_io_showcases_data_integrity.Google Scholar
- Oracle. Preventing Data Corruptions with HARD. Retrieved from http://docs.oracle.com/cd/B12037_01/server.101/b10726/apphard.htm.Google Scholar
- Patrick ONeil, Edward Cheng, Dieter Gawlick, and Elizabeth ONeil. 1996. The log-structured merge-tree (LSM-tree). Acta Inform. 33, 4 (1996).Google Scholar
- Bernd Panzer-Steindel. 2007. Data integrity. CERN/IT (2007).Google Scholar
- Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google Scholar
Digital Library
- Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. Model-based failure analysis of journaling file systems. In The Proceedings of the International Conference on Dependable Systems and Networks (DSN’05). Google Scholar
Digital Library
- Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). Google Scholar
Digital Library
- Rahul Bhartia. MongoDB on AWS Guidelines and Best Practices. Retrieved from http://media.amazonwebservices.com/AWS_NoSQL_MongoDB.pdf.Google Scholar
- Redis. Instagram Architecture. Retrieved from http://highscalability.com/blog/2012/4/9/the-instagram-architecture-facebook-bought-for-a-cool-billio.html.Google Scholar
- Redis. Redis. Retrieved from http://redis.io/.Google Scholar
- Redis. Redis at Flickr. Retrieved from http://code.flickr.net/2014/07/31/redis-sentinel-at-flickr/.Google Scholar
- Redis. Silent data corruption in Redis. Retrieved from https://github.com/antirez/redis/issues/3730.Google Scholar
- RethinkDB. Integrity of read results. Retrieved from https://github.com/rethinkdb/rethinkdb/issues/5925.Google Scholar
- RethinkDB. RethinkDB. Retrieved from https://www.rethinkdb.com/.Google Scholar
- RethinkDB. RethinkDB Data Storage. Retrieved from https://www.rethinkdb.com/docs/architecture/#data-storage.Google Scholar
- RethinkDB. RethinkDB Doc Issues. Retrieved from https://github.com/rethinkdb/docs/issues/1167.Google Scholar
- RethinkDB. RethinkDB Faq. Retrieved from https://www.rethinkdb.com/faq/.Google Scholar
- RethinkDB. Silent data loss on metablock corruptions. Retrieved from https://github.com/rethinkdb/rethinkdb/issues/6034.Google Scholar
- Robert Ricci, Eric Eide, and CloudLab Team. 2014. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login: 39, 6 (2014).Google Scholar
- Robert Harris. Data corruption is worse than you know. Retrieved from http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/.Google Scholar
- Ron Kuris. Cassandra From tarball to production. Retrieved from http://www.slideshare.net/planetcassandra/cassandra-from-tarball-to-production-2.Google Scholar
- Mendel Rosenblum and John Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1 (February 1992), 26–52. Google Scholar
Digital Library
- J. H. Saltzer, D. P. Reed, and D. D. Clark. 1984. End-to-end arguments in system design. ACM Trans. Comput. Syst. 2, 4 (1984), 277–288. Google Scholar
Digital Library
- Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10). Google Scholar
Digital Library
- Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07).Google Scholar
Digital Library
- Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16).Google Scholar
Digital Library
- D. P. Siewiorek, J. J. Hudak, B. H. Suh, and Z. Z. Segal. 1993. Development of a benchmark to measure system robustness. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing (FTCS’23). Google Scholar
Cross Ref
- Gopalan Sivathanu, Charles P. Wright, and Erez Zadok. 2005. Ensuring data integrity in storage: Techniques and applications. In The 1st International Workshop on Storage Security and Survivability (StorageSS’05).Google Scholar
Digital Library
- Mike J. Spreitzer, Marvin M. Theimer, Karin Petersen, Alan J. Demers, and Douglas B. Terry. 1999. Dealing with server corruption in weakly consistent replicated data systems. Wirel. Netw. 5, 5 (October 1999), 357–371. Google Scholar
Digital Library
- Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory errors in modern systems: The good, the bad, and the ugly. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15).Google Scholar
Digital Library
- David T. Stott, Benjamin Floering, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2000. A framework for assessing dependability in distributed systems with lightweight fault injectors. In Proceedings of the 4th International Computer Performance and Dependability Symposium (IPDS’00). Google Scholar
Cross Ref
- Sriram Subramanian, Yupu Zhang, Rajiv Vaidyanathan, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Jeffrey F. Naughton. 2010. Impact of disk corruption on open-source DBMS. In Proceedings of the 26th International Conference on Data Engineering (ICDE’10). Google Scholar
Cross Ref
- Michael M. Swift, Brian N. Bershad, and Henry M. Levy. 2003. Improving the reliability of commodity operating systems. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). Google Scholar
Digital Library
- T. K. Tsai and R. K. Iyer. 1995. Measuring fault tolerance with the FTAPE fault injection tool. In Proceedings of the 8th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation: Quantitative Evaluation of Computing and Communication Systems (MMB’95). Google Scholar
Cross Ref
- Twitter. Kafka at Twitter. Retrieved from https://blog.twitter.com/2015/handling-five-billion-sessions-a-day-in-real-time.Google Scholar
- Uber. The Uber Engineering Tech Stack, Part I: The Foundation. Retrieved from https://eng.uber.com/tech-stack-part-one/.Google Scholar
- Uber. The Uber Engineering Tech Stack, Part II: The Edge And Beyond. Retrieved from https://eng.uber.com/tech-stack-part-two/.Google Scholar
- Voldemort. Project Voldemort. http://www.project-voldemort.com/voldemort/.Google Scholar
- Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the salus scalable block store. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI’13).Google Scholar
Digital Library
- Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent model checking of unmodified distributed systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI’09).Google Scholar
- Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google Scholar
- Yupu Zhang, Chris Dragga, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. 2014. ViewBox: Integrating local file systems with cloud storage services. In Proceedings of the 12th USENIX Symposium on File and Storage Technologies (FAST’14).Google Scholar
- Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2010. End-to-end data integrity for file systems: A ZFS case study. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10).Google Scholar
Digital Library
- ZooKeeper. Cluster unavailable on space and write errors. Retrieved from https://issues.apache.org/jira/browse/ZOOKEEPER-2495.Google Scholar
- ZooKeeper. Crash on detecting a corruption. Retrieved from http://mail-archives.apache.org/mod_mbox/zookeeper-dev/201701.mbox/browser.Google Scholar
- ZooKeeper. Zookeeper service becomes unavailable when leader fails to write transaction log. Retrieved from https://issues.apache.org/jira/browse/ZOOKEEPER-2247.Google Scholar
Index Terms
Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults
Recommendations
Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions
FAST'17: Proceedings of the 15th Usenix Conference on File and Storage TechnologiesWe analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous bugs related to file-system ...
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy
An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...
Towards Robust File System Checkers
Special Section on Systor 2017 and Regular PapersFile systems may become corrupted for many reasons despite various protection techniques. Therefore, most file systems come with a checker to recover the file system to a consistent state. However, existing checkers are commonly assumed to be able to ...






Comments