skip to main content
research-article
Public Access

Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults

Published:28 September 2017Publication History
Skip Abstract Section

Abstract

We analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous problems related to file-system fault tolerance. We find that modern distributed systems do not consistently use redundancy to recover from file-system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability. We also find that the above outcomes arise due to fundamental problems in file-system fault handling that are common across many systems. Our results have implications for the design of next-generation fault-tolerant distributed and cloud storage systems.

References

  1. Cords Tool and Results. 2017. Retrieved from http://research.cs.wisc.edu/adsl/Software/cords/.Google ScholarGoogle Scholar
  2. Ramnatthan Alagappan, Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Aws Albarghouthi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2015. Beyond storage APIs: Provable semantics for storage stacks. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems (HOTOS’15).Google ScholarGoogle Scholar
  3. Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache. Cassandra. Retrieved from http://cassandra.apache.org/.Google ScholarGoogle Scholar
  5. Apache. Kakfa. Retrieved from http://kafka.apache.org/.Google ScholarGoogle Scholar
  6. Apache. ZooKeeper. Retrieved from https://zookeeper.apache.org/.Google ScholarGoogle Scholar
  7. Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2015. Operating Systems: Three Easy Pieces (0.91 ed.). Arpaci-Dusseau Books.Google ScholarGoogle Scholar
  8. Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Symposium on File and Storage Technologies (FAST’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’08). Google ScholarGoogle ScholarCross RefCross Ref
  11. J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek. 1990. Fault injection experiments using FIAT. IEEE Trans. Comput. 39, 4 (April 1990), 575–582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Eric Brewer, Lawrence Ying, Lawrence Greenfield, Robert Cypher, and Theodore T’so. 2016. Disks for Data Centers. Technical Report. Google.Google ScholarGoogle Scholar
  13. Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. 2001. An empirical study of operating system errors. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. CockroachDB. CockroachDB. Retrieved from https://www.cockroachlabs.com/.Google ScholarGoogle Scholar
  15. CockroachDB. Disk corruptions and read/write error handling in CockroachDB. Retrieved from https://forum.cockroachlabs.com/t/disk-corruptions-and-read-write-error-handling-in-cockroachdb/258.Google ScholarGoogle Scholar
  16. CockroachDB. Resiliency to disk corruption and storage errors. Retrieved from https://github.com/cockroachdb/cockroach/issues/7882.Google ScholarGoogle Scholar
  17. Miguel Correia, Daniel Gómez Ferro, Flavio P. Junqueira, and Marco Serafini. 2012. Practical hardening of crash-tolerant systems. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC’12).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Data Center Knowledge. Ma.gnolia data is gone for good. Retrieved from http://www.datacenterknowledge.com/archives/2009/02/19/magnolia-data-is-gone-for-good/.Google ScholarGoogle Scholar
  19. Datastax. Netflix Cassandra Use Case. Retrieved from http://www.datastax.com/resources/casestudies/netflix.Google ScholarGoogle Scholar
  20. DataStax. Read Repair: Repair during Read Path. Retrieved from http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html.Google ScholarGoogle Scholar
  21. S. Dawson, F. Jahanian, and T. Mitton. 1996. ORCHESTRA: A probing and fault injection environment for testing protocol implementations. In Proceedings of the 2nd International Computer Performance and Dependability Symposium (IPDS’96). Google ScholarGoogle ScholarCross RefCross Ref
  22. Jeff Dean. Building Large-Scale Internet Services. Retrieved from http://static.googleusercontent.com/media/research.google.com/en//people/jeff/SOCC2010-keynote-slides.pdf.Google ScholarGoogle Scholar
  23. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07). Stevenson, WA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jon Elerath. 2009. Hard-disk drives: The good, the bad, and the ugly. Commun. ACM 52, 6 (June 2009), 38–45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). Google ScholarGoogle ScholarCross RefCross Ref
  26. Daniel Fryer, Dai Qin, Jack Sun, Kah Wai Lee, Angela Demke Brown, and Ashvin Goel. 2014. Checking the integrity of transactional mechanisms. In Proceedings of the 12th USENIX Symposium on File and Storage Technologies (FAST’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST’12).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. FUSE. Linux FUSE (Filesystem in Userspace) interface. Retrieved from https://github.com/libfuse/libfuse.Google ScholarGoogle Scholar
  29. Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17). 149--166.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jim Gray. 1985. Why Do Computers Stop and What Can Be Done About It? Technical Report PN87614. Tandem.Google ScholarGoogle Scholar
  33. Weining Gu, Z. Kalbarczyk, Ravishankar K. Iyer, and Zhenyu Yang. 2003. Characterization of linux kernel behavior under errors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’03).Google ScholarGoogle Scholar
  34. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’14).Google ScholarGoogle Scholar
  35. Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011. Practical software model checking via dynamic interface reduction. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. James R. Hamilton and others. 2007. On designing and deploying internet-scale services. In Proceedings of the 21st Annual Large Installation System Administration Conference (LISA’07).Google ScholarGoogle Scholar
  37. Seungjae Han, Kang G. Shin, and Harold A. Rosenberg. 1995. DOCTOR: An integrated software fault injection environment for distributed real-time systems. In Proceedings of the International Computer Performance and Dependability Symposium (IPDS’95).Google ScholarGoogle Scholar
  38. James Myers. Data Integrity in Solid State Drives. Retrieved from http://intel.ly/2cF0dTT.Google ScholarGoogle Scholar
  39. Jerome Verstrynge. Timestamps in Cassandra. Retrieved from http://docs.oracle.com/cd/B12037_01/server.101/b10726/apphard.htm.Google ScholarGoogle Scholar
  40. Kafka. Data corruption or EIO leads to data loss. https://issues.apache.org/jira/browse/KAFKA-4009.Google ScholarGoogle Scholar
  41. Kimberley Keeton, Cipriano Santos, Dirk Beyer, Jeffrey Chase, and John Wilkes. 2004. Designing for disasters. In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST’04).Google ScholarGoogle Scholar
  42. John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrisha Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao. 2000. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’00).Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kyle Kingsbury. Jepsen. Retrieved from http://jepsen.io/.Google ScholarGoogle Scholar
  44. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google ScholarGoogle Scholar
  45. Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolic. 2016. XFT: Practical fault tolerance beyond crashes. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. LogCabin. LogCabin. Retrieved from https://github.com/logcabin/logcabin.Google ScholarGoogle Scholar
  47. LogCabin. Reaction to disk errors and corruptions. Retrieved from https://groups.google.com/forum/#!topic/logcabin-dev/wqNcdj0IHe4.Google ScholarGoogle Scholar
  48. Mark Adler. Adler32 Collisions. Retrieved from http://stackoverflow.com/questions/13455067/horrific-collisions-of-adler32-hash.Google ScholarGoogle Scholar
  49. Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ningfang Mi, A. Riska, E. Smirni, and E. Riedel. 2008. Enhancing data availability in disk drives through background activities. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’08), Anchorage, Alaska.Google ScholarGoogle Scholar
  51. Michael Rubin. Google moves from ext2 to ext4. Retrieved from http://lists.openwall.net/linux-ext4/2010/01/04/8.Google ScholarGoogle Scholar
  52. MongoDB. MongoDB. Retrieved from https://www.mongodb.org/.Google ScholarGoogle Scholar
  53. MongoDB. MongoDB at eBay. Retrieved from https://www.mongodb.com/presentations/mongodb-ebay.Google ScholarGoogle Scholar
  54. MongoDB. MongoDB WiredTiger. Retrieved from https://docs.mongodb.org/manual/core/wiredtiger/.Google ScholarGoogle Scholar
  55. Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. 2016. SSD failures in datacenters: What? When? And why? In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Netflix. Cassandra at Netflix. Retrieved from http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html.Google ScholarGoogle Scholar
  57. Oracle. Fusion-IO Data Integrity. Retrieved from https://blogs.oracle.com/linux/entry/fusion_io_showcases_data_integrity.Google ScholarGoogle Scholar
  58. Oracle. Preventing Data Corruptions with HARD. Retrieved from http://docs.oracle.com/cd/B12037_01/server.101/b10726/apphard.htm.Google ScholarGoogle Scholar
  59. Patrick ONeil, Edward Cheng, Dieter Gawlick, and Elizabeth ONeil. 1996. The log-structured merge-tree (LSM-tree). Acta Inform. 33, 4 (1996).Google ScholarGoogle Scholar
  60. Bernd Panzer-Steindel. 2007. Data integrity. CERN/IT (2007).Google ScholarGoogle Scholar
  61. Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. Model-based failure analysis of journaling file systems. In The Proceedings of the International Conference on Dependable Systems and Networks (DSN’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Rahul Bhartia. MongoDB on AWS Guidelines and Best Practices. Retrieved from http://media.amazonwebservices.com/AWS_NoSQL_MongoDB.pdf.Google ScholarGoogle Scholar
  65. Redis. Instagram Architecture. Retrieved from http://highscalability.com/blog/2012/4/9/the-instagram-architecture-facebook-bought-for-a-cool-billio.html.Google ScholarGoogle Scholar
  66. Redis. Redis. Retrieved from http://redis.io/.Google ScholarGoogle Scholar
  67. Redis. Redis at Flickr. Retrieved from http://code.flickr.net/2014/07/31/redis-sentinel-at-flickr/.Google ScholarGoogle Scholar
  68. Redis. Silent data corruption in Redis. Retrieved from https://github.com/antirez/redis/issues/3730.Google ScholarGoogle Scholar
  69. RethinkDB. Integrity of read results. Retrieved from https://github.com/rethinkdb/rethinkdb/issues/5925.Google ScholarGoogle Scholar
  70. RethinkDB. RethinkDB. Retrieved from https://www.rethinkdb.com/.Google ScholarGoogle Scholar
  71. RethinkDB. RethinkDB Data Storage. Retrieved from https://www.rethinkdb.com/docs/architecture/#data-storage.Google ScholarGoogle Scholar
  72. RethinkDB. RethinkDB Doc Issues. Retrieved from https://github.com/rethinkdb/docs/issues/1167.Google ScholarGoogle Scholar
  73. RethinkDB. RethinkDB Faq. Retrieved from https://www.rethinkdb.com/faq/.Google ScholarGoogle Scholar
  74. RethinkDB. Silent data loss on metablock corruptions. Retrieved from https://github.com/rethinkdb/rethinkdb/issues/6034.Google ScholarGoogle Scholar
  75. Robert Ricci, Eric Eide, and CloudLab Team. 2014. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login: 39, 6 (2014).Google ScholarGoogle Scholar
  76. Robert Harris. Data corruption is worse than you know. Retrieved from http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/.Google ScholarGoogle Scholar
  77. Ron Kuris. Cassandra From tarball to production. Retrieved from http://www.slideshare.net/planetcassandra/cassandra-from-tarball-to-production-2.Google ScholarGoogle Scholar
  78. Mendel Rosenblum and John Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1 (February 1992), 26–52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. J. H. Saltzer, D. P. Reed, and D. D. Clark. 1984. End-to-end arguments in system design. ACM Trans. Comput. Syst. 2, 4 (1984), 277–288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07).Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. D. P. Siewiorek, J. J. Hudak, B. H. Suh, and Z. Z. Segal. 1993. Development of a benchmark to measure system robustness. In Proceedings of the 23rd International Symposium on Fault-Tolerant Computing (FTCS’23). Google ScholarGoogle ScholarCross RefCross Ref
  84. Gopalan Sivathanu, Charles P. Wright, and Erez Zadok. 2005. Ensuring data integrity in storage: Techniques and applications. In The 1st International Workshop on Storage Security and Survivability (StorageSS’05).Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Mike J. Spreitzer, Marvin M. Theimer, Karin Petersen, Alan J. Demers, and Douglas B. Terry. 1999. Dealing with server corruption in weakly consistent replicated data systems. Wirel. Netw. 5, 5 (October 1999), 357–371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory errors in modern systems: The good, the bad, and the ugly. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15).Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. David T. Stott, Benjamin Floering, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2000. A framework for assessing dependability in distributed systems with lightweight fault injectors. In Proceedings of the 4th International Computer Performance and Dependability Symposium (IPDS’00). Google ScholarGoogle ScholarCross RefCross Ref
  88. Sriram Subramanian, Yupu Zhang, Rajiv Vaidyanathan, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Jeffrey F. Naughton. 2010. Impact of disk corruption on open-source DBMS. In Proceedings of the 26th International Conference on Data Engineering (ICDE’10). Google ScholarGoogle ScholarCross RefCross Ref
  89. Michael M. Swift, Brian N. Bershad, and Henry M. Levy. 2003. Improving the reliability of commodity operating systems. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. T. K. Tsai and R. K. Iyer. 1995. Measuring fault tolerance with the FTAPE fault injection tool. In Proceedings of the 8th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation: Quantitative Evaluation of Computing and Communication Systems (MMB’95). Google ScholarGoogle ScholarCross RefCross Ref
  91. Twitter. Kafka at Twitter. Retrieved from https://blog.twitter.com/2015/handling-five-billion-sessions-a-day-in-real-time.Google ScholarGoogle Scholar
  92. Uber. The Uber Engineering Tech Stack, Part I: The Foundation. Retrieved from https://eng.uber.com/tech-stack-part-one/.Google ScholarGoogle Scholar
  93. Uber. The Uber Engineering Tech Stack, Part II: The Edge And Beyond. Retrieved from https://eng.uber.com/tech-stack-part-two/.Google ScholarGoogle Scholar
  94. Voldemort. Project Voldemort. http://www.project-voldemort.com/voldemort/.Google ScholarGoogle Scholar
  95. Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin. 2013. Robustness in the salus scalable block store. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent model checking of unmodified distributed systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI’09).Google ScholarGoogle Scholar
  97. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google ScholarGoogle Scholar
  98. Yupu Zhang, Chris Dragga, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. 2014. ViewBox: Integrating local file systems with cloud storage services. In Proceedings of the 12th USENIX Symposium on File and Storage Technologies (FAST’14).Google ScholarGoogle Scholar
  99. Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2010. End-to-end data integrity for file systems: A ZFS case study. In Proceedings of the 8th USENIX Symposium on File and Storage Technologies (FAST’10).Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. ZooKeeper. Cluster unavailable on space and write errors. Retrieved from https://issues.apache.org/jira/browse/ZOOKEEPER-2495.Google ScholarGoogle Scholar
  101. ZooKeeper. Crash on detecting a corruption. Retrieved from http://mail-archives.apache.org/mod_mbox/zookeeper-dev/201701.mbox/browser.Google ScholarGoogle Scholar
  102. ZooKeeper. Zookeeper service becomes unavailable when leader fails to write transaction log. Retrieved from https://issues.apache.org/jira/browse/ZOOKEEPER-2247.Google ScholarGoogle Scholar

Index Terms

  1. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Storage
            ACM Transactions on Storage  Volume 13, Issue 3
            Special Issue on FAST 2017 and Regular Papers
            August 2017
            265 pages
            ISSN:1553-3077
            EISSN:1553-3093
            DOI:10.1145/3141876
            • Editor:
            • Sam H. Noh
            Issue’s Table of Contents

            Copyright © 2017 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 28 September 2017
            • Accepted: 1 July 2017
            • Received: 1 June 2017
            Published in tos Volume 13, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!