Abstract
As hardware failures are no longer rare in the era of cloud computing, cloud software systems must "prevail" against multiple, diverse failures that are likely to occur. Testing software against multiple failures poses the problem of combinatorial explosion of multiple failures. To address this problem, we present PreFail, a programmable failure-injection tool that enables testers to write a wide range of policies to prune down the large space of multiple failures. We integrate PreFail to three cloud software systems (HDFS, Cassandra, and ZooKeeper), show a wide variety of useful pruning policies that we can write for them, and evaluate the speed-ups in testing time that we obtain by using the policies. In our experiments, our testing approach with appropriate policies found all the bugs that one can find using exhaustive testing while spending 10X--200X less time than exhaustive testing.
- Hadoop MapReduce. http://hadoop.apache.org/mapreduce.Google Scholar
- Jonathan Aldrich and Craig Chambers. Ownership Domains: Separating Aliasing Policy from Mechanism. In Proceedings of the 18th European Conference on Object-Oriented Programming (ECOOP '04), Oslo, Norway, June 2004.Google Scholar
- Chandrasekhar Boyapati, Sarfraz Khurshid, and Darko Marinov. Korat: Automated Testing Based on Java Predicates. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA '02), pages 123--133, Rome, Italy, July 2002. Google Scholar
Digital Library
- Pete Broadwell, Naveen Sastry, and Jonathan Traupman. FIG: A Prototype Tool for Online Verification of Recovery Mechanisms. In Workshop on Self-Healing, Adaptive and Self-Managed Systems.Google Scholar
- Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI '06), Seattle, Washington, November 2006. Google Scholar
Digital Library
- George Candea and Armando Fox. Crash-Only Software. In The Ninth Workshop on Hot Topics in Operating Systems (HotOS IX), Lihue, Hawaii, May 2003. Google Scholar
Digital Library
- Tushar Chandra, Robert Griesemer, and Joshua Redstone. Paxos Made Live - An Engineering Perspective. In Proceedings of the 26th ACM Symposium on Principles of Distributed Computing (PODC '07), Portland, Oregon, August 2007. Google Scholar
Digital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI '06), Seattle, Washington, November 2006. Google Scholar
Digital Library
- Eli Collins and Todd Lipcon. Contact Persons at Cloudera Inc., 2011.Google Scholar
- Brian Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 2010 ACM Symposium on Cloud Computing (SoCC '10), Indianapolis, Indiana, June 2010. Google Scholar
Digital Library
- Brett Daniel, Danny Dig, Kely Garcia, and Darko Marinov. Automated Testing of Refactoring Engines. In Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE '07), Dubrovnik, Croatia, September 2007. Google Scholar
Digital Library
- Scott Dawson, Farnam Jahanian, and Todd Mitton. Experiments on Six Commercial TCP Implementations Using a Software Fault Injection Tool. Software--Practice and Experience, 27:1385--1410, 1997. Google Scholar
Digital Library
- Jeffrey Dean. Underneath the covers at google: Current systems and future directions. In Google I/O, 2008.Google Scholar
- Daniel Ford, Franis Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlna. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, Canada, October 2010. Google Scholar
Digital Library
- Chen Fu, Barbara G. Ryder, Ana Milanova, and David Wonnacott. Testing of Java Web Services for Robustness. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA '04), Boston, Massachusetts, July 2004. Google Scholar
Digital Library
- Garth Gibson. Reliability/Resilience Panel. In High-End Computing File Systems and I/O Workshop (HEC FSIO '10), Arlington, VA, August 2010.Google Scholar
- Milos Gligoric, Tihomir Gvero, Vilas Jagannath, Sarfraz Khurshid, Viktor Kuncak, and Darko Marinov. Test generation through programming in UDITA. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE '10), pages 225--234, Cape Town, South Africa, May 2010. Google Scholar
Digital Library
- Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Koushik Sen. Fate and Destini: A Framework for Cloud Recovery Testing. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI '11), Boston, Massachusetts, March 2011. Google Scholar
Digital Library
- Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Koushik Sen. Towards Automatically Checking Thousands of Failures with Micro-specifications. In The 6th Workshop on Hot Topics in System Dependability (HotDep '10), Vancouver, Canada, October 2010. Google Scholar
Digital Library
- Alyssa Henry. Cloud Storage FUD: Failure and Uncertainty and Durability. In Proceedings of the 7th USENIX Symposium on File and Storage Technologies (FAST '09), San Francisco, California, February 2009.Google Scholar
- William Hoarau, Sebastien Tixeuil, and Fabien Vauchelles. FAIL-FCI: Versatile fault injection. Journal of Future Generation Computer Systems archive, Volume 23 Issue 7, August, 2007. Google Scholar
Digital Library
- Todd Hoff. Netflix: Continually Test by Failing Servers with Chaos Monkey. http://highscalability.com, December 2010.Google Scholar
- Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC '10), Boston, Massachusetts, June 2010. Google Scholar
Digital Library
- Andreas Johansson and Neeraj Suri. Error Propagation Profiling of Operating Systems . In Proceedings of the International Conference on Dependable Systems and Networks (DSN '05), Yokohama, Japan, June 2005. Google Scholar
Digital Library
- Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. PreFail: A Programmable Failure-Injection Framework. UC Berkeley Technical Report UCB/EECS-2011--30, April 2011.Google Scholar
- Lukasz Juszczyk and Schahram Dustdar. Programmable Fault Injection Testbeds for Complex SOA. In Proceedings of the 8th International Conference on Service Oriented Computing (ICSOC '10), San Francisco, California, December 2010.Google Scholar
- Lorenzo Keller, Paul Marinescu, and George Candea. AFEX: An Automated Fault Explorer for Faster System Testing, 2008.Google Scholar
- Philip Koopman and John DeVale. Comparing the Robustness of POSIX Operating Systems. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing (FTCS-29), Madison, Wisconsin, June 1999. Google Scholar
Digital Library
- Avinash Lakshman and Prashant Malik. Cassandra - a decentralized structured storage system. In The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS '09), Florianopolis, Brazil, October 2009.Google Scholar
- R. Levin, E. Cohen, W. Corwin, F. J. Pollack, and W. Wulf. Policy/mechanism separation in Hydra. In Proceedings of the 5th ACM Symposium on Operating Systems Principles (SOSP '75), Austin, TX, November 1975. Google Scholar
Digital Library
- Paul Marinescu and George Candea. LFI: A Practical and General Library-Level Fault Injector. In Proceedings of the International Conference on Dependable Systems and Networks (DSN '09), Lisbon, Portugal, June 2009.Google Scholar
- Paul D. Marinescu, Radu Banabic, and George Candea. An Extensible Technique for High-Precision Testing of Recovery Code. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC '10), Boston, Massachusetts, June 2010. Google Scholar
Digital Library
- David Patterson, Garth Gibson, and Randy Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD Conference on the Management of Data (SIGMOD '88), Chicago, Illinois, June 1988. Google Scholar
Digital Library
- Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST '07), San Jose, California, February 2007. Google Scholar
Digital Library
- Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. IRON File Systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP '05), Brighton, United Kingdom, October 2005. Google Scholar
Digital Library
- C. J. Price and N. S. Taylor. Automated multiple failure FMEA. Reliability Engineering and System Safety, 76(1):1--10, April 2002.Google Scholar
Cross Ref
- Bianca Schroeder and Garth Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST '07), San Jose, California, February 2007. Google Scholar
Digital Library
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST '10), Incline Village, Nevada, May 2010. Google Scholar
Digital Library
- Alex C. Snoeren and Barath Raghavan. Decoupling Policy from Mechanism in Internet Routing. ACM SIGCOMM Computer Communication Review, 34(1), January 2004. Google Scholar
Digital Library
- Hadoop Team. Hadoop Fault Injection Framework and Development Guide. http://hadoop.apache.org/hdfs/docs/r0.21.0/faultinject_framework.html.Google Scholar
- Kashi Vishwanath and Nachi Nagappan. Characterizing Cloud Computing Hardware Reliability. In Proceedings of the 2010 ACM Symposium on Cloud Computing (SoCC '10), Indianapolis, Indiana, June 2010. Google Scholar
Digital Library
- Tom White. Hadoop The Definitive Guide. O'Reilly, 2009. Google Scholar
Digital Library
- Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI '09), Boston, Massachusetts, April 2009. Google Scholar
Digital Library
- Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using Model Checking to Find Serious File System Errors. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI '04), San Francisco, California, December 2004. Google Scholar
Digital Library
Index Terms
PREFAIL: a programmable tool for multiple-failure injection
Recommendations
PREFAIL: a programmable tool for multiple-failure injection
OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applicationsAs hardware failures are no longer rare in the era of cloud computing, cloud software systems must "prevail" against multiple, diverse failures that are likely to occur. Testing software against multiple failures poses the problem of combinatorial ...
ZERMIA - A Fault Injector Framework for Testing Byzantine Fault Tolerant Protocols
Network and System SecurityAbstractByzantine fault tolerant (BFT) protocols are designed to increase system dependability and security. They guarantee liveness and correctness even in the presence of arbitrary faults. However, testing and validating BFT systems is not an easy task. ...
Component customization testing technique using fault injection technique and mutation test criteria
Mutation testing for the new centuryA testing technique to detect failures caused by component customization is necessary. In this paper, we propose a component customization testing technique by using the fault injection technique and the mutation test criteria. We first define the ...







Comments