skip to main content
research-article

PREFAIL: a programmable tool for multiple-failure injection

Published:22 October 2011Publication History
Skip Abstract Section

Abstract

As hardware failures are no longer rare in the era of cloud computing, cloud software systems must "prevail" against multiple, diverse failures that are likely to occur. Testing software against multiple failures poses the problem of combinatorial explosion of multiple failures. To address this problem, we present PreFail, a programmable failure-injection tool that enables testers to write a wide range of policies to prune down the large space of multiple failures. We integrate PreFail to three cloud software systems (HDFS, Cassandra, and ZooKeeper), show a wide variety of useful pruning policies that we can write for them, and evaluate the speed-ups in testing time that we obtain by using the policies. In our experiments, our testing approach with appropriate policies found all the bugs that one can find using exhaustive testing while spending 10X--200X less time than exhaustive testing.

References

  1. Hadoop MapReduce. http://hadoop.apache.org/mapreduce.Google ScholarGoogle Scholar
  2. Jonathan Aldrich and Craig Chambers. Ownership Domains: Separating Aliasing Policy from Mechanism. In Proceedings of the 18th European Conference on Object-Oriented Programming (ECOOP '04), Oslo, Norway, June 2004.Google ScholarGoogle Scholar
  3. Chandrasekhar Boyapati, Sarfraz Khurshid, and Darko Marinov. Korat: Automated Testing Based on Java Predicates. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA '02), pages 123--133, Rome, Italy, July 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Pete Broadwell, Naveen Sastry, and Jonathan Traupman. FIG: A Prototype Tool for Online Verification of Recovery Mechanisms. In Workshop on Self-Healing, Adaptive and Self-Managed Systems.Google ScholarGoogle Scholar
  5. Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI '06), Seattle, Washington, November 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. George Candea and Armando Fox. Crash-Only Software. In The Ninth Workshop on Hot Topics in Operating Systems (HotOS IX), Lihue, Hawaii, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Tushar Chandra, Robert Griesemer, and Joshua Redstone. Paxos Made Live - An Engineering Perspective. In Proceedings of the 26th ACM Symposium on Principles of Distributed Computing (PODC '07), Portland, Oregon, August 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI '06), Seattle, Washington, November 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Eli Collins and Todd Lipcon. Contact Persons at Cloudera Inc., 2011.Google ScholarGoogle Scholar
  10. Brian Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 2010 ACM Symposium on Cloud Computing (SoCC '10), Indianapolis, Indiana, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Brett Daniel, Danny Dig, Kely Garcia, and Darko Marinov. Automated Testing of Refactoring Engines. In Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE '07), Dubrovnik, Croatia, September 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Scott Dawson, Farnam Jahanian, and Todd Mitton. Experiments on Six Commercial TCP Implementations Using a Software Fault Injection Tool. Software--Practice and Experience, 27:1385--1410, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jeffrey Dean. Underneath the covers at google: Current systems and future directions. In Google I/O, 2008.Google ScholarGoogle Scholar
  14. Daniel Ford, Franis Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlna. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, Canada, October 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chen Fu, Barbara G. Ryder, Ana Milanova, and David Wonnacott. Testing of Java Web Services for Robustness. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA '04), Boston, Massachusetts, July 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Garth Gibson. Reliability/Resilience Panel. In High-End Computing File Systems and I/O Workshop (HEC FSIO '10), Arlington, VA, August 2010.Google ScholarGoogle Scholar
  17. Milos Gligoric, Tihomir Gvero, Vilas Jagannath, Sarfraz Khurshid, Viktor Kuncak, and Darko Marinov. Test generation through programming in UDITA. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE '10), pages 225--234, Cape Town, South Africa, May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Koushik Sen. Fate and Destini: A Framework for Cloud Recovery Testing. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI '11), Boston, Massachusetts, March 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Koushik Sen. Towards Automatically Checking Thousands of Failures with Micro-specifications. In The 6th Workshop on Hot Topics in System Dependability (HotDep '10), Vancouver, Canada, October 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Alyssa Henry. Cloud Storage FUD: Failure and Uncertainty and Durability. In Proceedings of the 7th USENIX Symposium on File and Storage Technologies (FAST '09), San Francisco, California, February 2009.Google ScholarGoogle Scholar
  21. William Hoarau, Sebastien Tixeuil, and Fabien Vauchelles. FAIL-FCI: Versatile fault injection. Journal of Future Generation Computer Systems archive, Volume 23 Issue 7, August, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Todd Hoff. Netflix: Continually Test by Failing Servers with Chaos Monkey. http://highscalability.com, December 2010.Google ScholarGoogle Scholar
  23. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC '10), Boston, Massachusetts, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Andreas Johansson and Neeraj Suri. Error Propagation Profiling of Operating Systems . In Proceedings of the International Conference on Dependable Systems and Networks (DSN '05), Yokohama, Japan, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. PreFail: A Programmable Failure-Injection Framework. UC Berkeley Technical Report UCB/EECS-2011--30, April 2011.Google ScholarGoogle Scholar
  26. Lukasz Juszczyk and Schahram Dustdar. Programmable Fault Injection Testbeds for Complex SOA. In Proceedings of the 8th International Conference on Service Oriented Computing (ICSOC '10), San Francisco, California, December 2010.Google ScholarGoogle Scholar
  27. Lorenzo Keller, Paul Marinescu, and George Candea. AFEX: An Automated Fault Explorer for Faster System Testing, 2008.Google ScholarGoogle Scholar
  28. Philip Koopman and John DeVale. Comparing the Robustness of POSIX Operating Systems. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing (FTCS-29), Madison, Wisconsin, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Avinash Lakshman and Prashant Malik. Cassandra - a decentralized structured storage system. In The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS '09), Florianopolis, Brazil, October 2009.Google ScholarGoogle Scholar
  30. R. Levin, E. Cohen, W. Corwin, F. J. Pollack, and W. Wulf. Policy/mechanism separation in Hydra. In Proceedings of the 5th ACM Symposium on Operating Systems Principles (SOSP '75), Austin, TX, November 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Paul Marinescu and George Candea. LFI: A Practical and General Library-Level Fault Injector. In Proceedings of the International Conference on Dependable Systems and Networks (DSN '09), Lisbon, Portugal, June 2009.Google ScholarGoogle Scholar
  32. Paul D. Marinescu, Radu Banabic, and George Candea. An Extensible Technique for High-Precision Testing of Recovery Code. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC '10), Boston, Massachusetts, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. David Patterson, Garth Gibson, and Randy Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD Conference on the Management of Data (SIGMOD '88), Chicago, Illinois, June 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST '07), San Jose, California, February 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. IRON File Systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP '05), Brighton, United Kingdom, October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. J. Price and N. S. Taylor. Automated multiple failure FMEA. Reliability Engineering and System Safety, 76(1):1--10, April 2002.Google ScholarGoogle ScholarCross RefCross Ref
  37. Bianca Schroeder and Garth Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST '07), San Jose, California, February 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST '10), Incline Village, Nevada, May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Alex C. Snoeren and Barath Raghavan. Decoupling Policy from Mechanism in Internet Routing. ACM SIGCOMM Computer Communication Review, 34(1), January 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Hadoop Team. Hadoop Fault Injection Framework and Development Guide. http://hadoop.apache.org/hdfs/docs/r0.21.0/faultinject_framework.html.Google ScholarGoogle Scholar
  41. Kashi Vishwanath and Nachi Nagappan. Characterizing Cloud Computing Hardware Reliability. In Proceedings of the 2010 ACM Symposium on Cloud Computing (SoCC '10), Indianapolis, Indiana, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Tom White. Hadoop The Definitive Guide. O'Reilly, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI '09), Boston, Massachusetts, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using Model Checking to Find Serious File System Errors. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI '04), San Francisco, California, December 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. PREFAIL: a programmable tool for multiple-failure injection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 46, Issue 10
        OOPSLA '11
        October 2011
        1063 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2076021
        Issue’s Table of Contents
        • cover image ACM Conferences
          OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
          October 2011
          1104 pages
          ISBN:9781450309400
          DOI:10.1145/2048066

        Copyright © 2011 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 October 2011

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!