Abstract
A critical part of developing a reliable software system is testing its recovery code. This code is traditionally difficult to test in the lab, and, in the field, it rarely gets to run; yet, when it does run, it must execute flawlessly in order to recover the system from failure. In this article, we present a library-level fault injection engine that enables the productive use of fault injection for software testing. We describe automated techniques for reliably identifying errors that applications may encounter when interacting with their environment, for automatically identifying high-value injection targets in program binaries, and for producing efficient injection test scenarios. We present a framework for writing precise triggers that inject desired faults, in the form of error return codes and corresponding side effects, at the boundary between applications and libraries. These techniques are embodied in LFI, a new fault injection engine we are distributing http://lfi.epfl.ch. This article includes a report of our initial experience using LFI. Most notably, LFI found 12 serious, previously unreported bugs in the MySQL database server, Git version control system, BIND name server, Pidgin IM client, and PBFT replication system with no developer assistance and no access to source code. LFI also increased recovery-code coverage from virtually zero up to 60% entirely automatically without requiring new tests or human involvement.
- Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google Scholar
Digital Library
- Apache. 2010. Apache Benchmark (AB). http://httpd.apache.org/docs/2.0/programs/ab.html.Google Scholar
- Arlat, J., Aguera, M., Amat, L., Crouzet, Y., Fabre, J.-C., Laprie, J.-C., Martins, E., and Powell, D. 1990. Fault injection for dependability validation: A methodology and some applications. IEEE Trans. Softw. Engin. 16, 2. Google Scholar
Digital Library
- Bairavasundaram, L. N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Swift, M. M. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the International Conference on Dependable Systems and Networks.Google Scholar
- Barbosa, R., Silva, N., Duraes, J., and Madeira, H. 2007. Verification and validation of (real time) COTS products using fault injection techniques. In Proceedings of the International Conference on Commercial-off-the-Shelf-Based Software Systems. Google Scholar
Digital Library
- Barton, J., Czeck, E., Segall, Z., and Siewiorek, D. 1990. Fault injection experiments using FIAT. IEEE Trans. Comput. 39, 4. Google Scholar
Digital Library
- Bieman, J. M., Dreilinger, D., and Lin, L. 1996. Using fault injection to increase software test coverage. In Proceedings of the International Symposium on Software Reliability Engineering. Google Scholar
Digital Library
- BIND. 2010a. BIND aborts in dst_api.c. https://lists.isc.org/pipermail/bind-users/2010-January/078493.html.Google Scholar
- BIND. 2010b. BIND crashes in statschannel.c. https://lists.isc.org/pipermail/bind-users/2010-January/078428.html.Google Scholar
- Bisolfati, E., Marinescu, P. D., and Candea, G. 2010. Studying application--library interaction and behavior with LibTrac. In Proceedings of the International Conference on Dependable Systems and Networks.Google Scholar
- Broadwell, P. A., Sastry, N., and Traupman, J. 2002. FIG: A prototype tool for online verification of recovery mechanisms. In Proceedings of the Workshop on Self-Healing, Adaptive and Self-Managed Systems.Google Scholar
- Cadar, C., Dunbar, D., and Engler, D. R. 2008. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the Symposium on Operating System Design and Implem. Google Scholar
Digital Library
- Candea, G., Delgado, M., Chen, M., and Fox, A. 2003. Automatic failure-path inference: A generic introspection technique for software systems. In Proceedings of the Workshop on Internet Applications. Google Scholar
Digital Library
- Castro, M. and Liskov, B. 1999. Practical Byzantine fault tolerance. In Proceedings of the Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Chillarege, R. and Bowen, N. S. 1989. Understanding large system failures - a fault injection experiment”. In Intl. Symp. on Fault-Tolerant Computing.Google Scholar
- Chipounov, V., Kuznetsov, V., and Candea, G. 2011. S2E: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google Scholar
Digital Library
- Cook, W. R. 2007. Applescript. In Proceedings of the Conference on History of Programming Languages. Google Scholar
Digital Library
- CPython. 2011. http://www.python.org/.Google Scholar
- Curry, T. W. 1994. Profiling and tracing dynamic library usage via interposition. In Proceedings of the USENIX Summer Technical Conference. Google Scholar
Digital Library
- Dawson, S., Jahanian, F., and Mitton, T. 1997. Experiments on six commercial TCP implementations using a software fault injection tool. Softw. Pract. Exper. 27. Google Scholar
Digital Library
- Dowson, M. 1997. The Ariane 5 software failure. ACM SIGSOFT Softw. Engin. Notes 22, 2. Google Scholar
Digital Library
- ELSA. 2009. http://www.eecs.berkeley.edu/~smcpeak/elkhound/sources/elsa/. (Accessed on 3/09).Google Scholar
- Fu, C., Ryder, B. G., Milanova, A., and Wonnacott, D. 2004. Testing of Java web services for robustness. In Proceedings of the International Symposium on Software Testing and Analysis. Google Scholar
Digital Library
- Gcov. 2010. GCC coverage testing tool. http://gcc.gnu.org/onlinedocs/gcc/Gcov.html.Google Scholar
- Git. 2010a. Git crashes on make test. http://marc.info/?l=git&m=125985479417107.Google Scholar
- Git. 2010b. Git fails when running commands in wrong environment. http://marc.info/?l=git&m=125986795807036.Google Scholar
- Git. 2010c. Git unchecked malloc’s. http://marc.info/?l=git&m=126298802319662.Google Scholar
- Gunawi, H. S., Rubio-González, C., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Liblit, B. 2008. EIO: Error handling is occasionally correct. In Proceedings of the USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Gunawi, H. S., Do, T., Joshi, P., Hellerstein, J. M., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Sen, K. 2010. Towards automatically checking thousands of failures with micro-specifications. Tech. rep. UCB/EECS-2010-98, University of California.Google Scholar
- Guo, Z., Wang, X., Tang, J., Liu, X., Xu, Z., Wu, M., Kaashoek, M. F., and Zhang, Z. 2008. R2: An application-level kernel for record and replay. In Proceedings of the Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Hadoop FI. 2010. Hadoop Fault Injection framework. http://hadoop.apache.org/hdfs/docs/r0.21.0/faultinject_framework.html.Google Scholar
- Holodeck. 2010. Win32 fuzz testing and fault injection. http://www.securityinnovation.com/holodeck/.Google Scholar
- Hunt, G. and Brubacher, D. 1999. Detours: Binary Interception of Win32 Functions. In Proceedings of the USENIX Windows NT Symposium. Google Scholar
Digital Library
- Johansson, A., Suri, N., and Murphy, B. 2007. On the impact of injection triggers for OS robustness evaluation. In Proceedings of the International Symposium on Software Reliability Engineering. Google Scholar
Digital Library
- Kanawati, G. A., Kanawati, N. A., and Abraham, J. A. 1995. FERRARI: A flexible software-based fault and error injection system. IEEE Trans. Comput. 44, 2. Google Scholar
Digital Library
- Killian, C., Anderson, J. W., Jhala, R., and Vahdat, A. 2007. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proceedings of the Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Koopman, P., Sung, J., Dingman, C., Siewiorek, D., and Marz, T. 1997. Comparing operating systems using robustness benchmarks. In Proceedings of the International Symposium on Software Reliability Engineering.Google Scholar
- Lcov. 2010. LTP gcov extension. http://ltp.sourceforge.net/coverage/lcov.php.Google Scholar
- Lethbridge, T. C., Singer, J., and Forward, A. 2003. How software engineers use documentation: The state of the practice. IEEE Softw. 20, 6. Google Scholar
Digital Library
- Li, X., Martin, R., Nagaraja, K., Nguyen, T. D., and Zhang, B. 2002. Mendosus: A san-based fault-injection test-bed for the construction of highly available network services. In Proceedings of the Workshop on Novel Uses of System Area Networks.Google Scholar
- Libdwarf. 2010. Libdwarf. http://reality.sgiweb.org/davea/dwarf.html.Google Scholar
- Marinescu, P. D., Banabic, R., and Candea, G. 2010. An extensible technique for high-precision testing of recovery code. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- MySQL. 2009. MySQL crashes due to bus error during shutdown. http://bugs.mysql.com/bug.php?id=42109.Google Scholar
- MySQL. 2010a. MySQL crashes due to double unlock. http://bugs.mysql.com/bug.php?id=53268.Google Scholar
- MySQL. 2010b. MySQL crashes due to error while reading errmsg.sys. http://bugs.mysql.com/bug.php?id=53393.Google Scholar
- MySQL. 2010c. MySQL InnoDB crashes during shutdown. http://bugs.mysql.com/bug.php?id=52546.Google Scholar
- MySQL. 2010d. http://www.mysql.com/.Google Scholar
- Ng, W. T. and Chen, P. M. 1999. The systematic improvement of fault tolerance in the Rio file cache. In Proceedings of the International Symposium on Fault-Tolerant Computing. Google Scholar
Digital Library
- Pidgin. 2009. Pidgin SIGABRTs on memory alloc. http://developer.pidgin.im/ticket/8672.Google Scholar
- Pidgin. 2010. Pidgin. http://www.pidgin.im.Google Scholar
- Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005. IRON file systems. In Proceedings of the Symposium on Operating Systems Principles. Google Scholar
Digital Library
- Prasad, M. and Chiueh, T. 2003. A binary rewriting defense against stack-based buffer overflow attacks. In Proceedings of the USENIX Annual Technical Conference.Google Scholar
- Rubio-González, C. and Liblit, B. 2010. Expect the unexpected: Error code mismatches between documentation and the real world. In Proceedings of the Workshop on Program Analysis for Software Tools and Engineering. Google Scholar
Digital Library
- Singer, J. 1998. Practices of software maintenance. In Proceedings of the International Conference on Software Maintenance. Google Scholar
Digital Library
- Slowinska, A., Stancescu, T., and Bos, H. 2011. Howard: A dynamic excavator for reverse engineering data structures. In Proceedings of the Network and Distributed System Security Symposium.Google Scholar
- Stott, D. T., Floering, B., Kalbarczyk, Z., and Iyer, R. K. 2000. A framework for assessing dependability in distributed systems with lightweight fault injectors. In Proceedings of the International Computer Performance and Dependability Symposium. Google Scholar
Digital Library
- Süßkraut, M. and Fetzer, C. 2006. Automatically finding and patching bad error handling. In Proceedings of the European Dependable Computing Conference. Google Scholar
Digital Library
- SysBench. 2010. http://sysbench.sourceforge.net.Google Scholar
- TestApi. 2010. Library of test and utility APIs. http://testapi.codeplex.com/.Google Scholar
- Tsai, T. K. and Iyer, R. K. 1995. Measuring fault tolerance with the FTAPE fault injection tool. In Proceedings of the International Conference on Modelling Techniques and Tools for Computer Performance Evaluation. Google Scholar
Digital Library
- V8 JavaScript Engine. 2011. http://code.google.com/p/v8/.Google Scholar
- Weimer, W. and Necula, G. C. 2008. Exceptional situations and program reliability. ACM Trans. Program. Lang. Syst. 30, 2. Google Scholar
Digital Library
- Zhang, J., Zhao, R., and Pang, J. 2007. Parameter and return-value analysis of binary executables. In Proceedings of the Annual Intrenational Computer Software and Applications Conference. Google Scholar
Digital Library
Index Terms
Efficient Testing of Recovery Code Using Fault Injection
Recommendations
Fast black-box testing of system recovery code
EuroSys '12: Proceedings of the 7th ACM european conference on Computer SystemsFault injection---a key technique for testing the robustness of software systems---ends up rarely being used in practice, because it is labor-intensive and one needs to choose between performing random injections (which leads to poor coverage and low ...
A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
IPDS '00: Proceedings of the 4th International Computer Performance and Dependability SymposiumMany fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is ...
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems
The authors describe a dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process. The main characteristics of a fault injection ...






Comments