skip to main content
research-article

Efficient Testing of Recovery Code Using Fault Injection

Authors Info & Claims
Published:01 December 2011Publication History
Skip Abstract Section

Abstract

A critical part of developing a reliable software system is testing its recovery code. This code is traditionally difficult to test in the lab, and, in the field, it rarely gets to run; yet, when it does run, it must execute flawlessly in order to recover the system from failure. In this article, we present a library-level fault injection engine that enables the productive use of fault injection for software testing. We describe automated techniques for reliably identifying errors that applications may encounter when interacting with their environment, for automatically identifying high-value injection targets in program binaries, and for producing efficient injection test scenarios. We present a framework for writing precise triggers that inject desired faults, in the form of error return codes and corresponding side effects, at the boundary between applications and libraries. These techniques are embodied in LFI, a new fault injection engine we are distributing http://lfi.epfl.ch. This article includes a report of our initial experience using LFI. Most notably, LFI found 12 serious, previously unreported bugs in the MySQL database server, Git version control system, BIND name server, Pidgin IM client, and PBFT replication system with no developer assistance and no access to source code. LFI also increased recovery-code coverage from virtually zero up to 60% entirely automatically without requiring new tests or human involvement.

References

  1. Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apache. 2010. Apache Benchmark (AB). http://httpd.apache.org/docs/2.0/programs/ab.html.Google ScholarGoogle Scholar
  3. Arlat, J., Aguera, M., Amat, L., Crouzet, Y., Fabre, J.-C., Laprie, J.-C., Martins, E., and Powell, D. 1990. Fault injection for dependability validation: A methodology and some applications. IEEE Trans. Softw. Engin. 16, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bairavasundaram, L. N., Rungta, M., Agrawal, N., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Swift, M. M. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the International Conference on Dependable Systems and Networks.Google ScholarGoogle Scholar
  5. Barbosa, R., Silva, N., Duraes, J., and Madeira, H. 2007. Verification and validation of (real time) COTS products using fault injection techniques. In Proceedings of the International Conference on Commercial-off-the-Shelf-Based Software Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Barton, J., Czeck, E., Segall, Z., and Siewiorek, D. 1990. Fault injection experiments using FIAT. IEEE Trans. Comput. 39, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bieman, J. M., Dreilinger, D., and Lin, L. 1996. Using fault injection to increase software test coverage. In Proceedings of the International Symposium on Software Reliability Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. BIND. 2010a. BIND aborts in dst_api.c. https://lists.isc.org/pipermail/bind-users/2010-January/078493.html.Google ScholarGoogle Scholar
  9. BIND. 2010b. BIND crashes in statschannel.c. https://lists.isc.org/pipermail/bind-users/2010-January/078428.html.Google ScholarGoogle Scholar
  10. Bisolfati, E., Marinescu, P. D., and Candea, G. 2010. Studying application--library interaction and behavior with LibTrac. In Proceedings of the International Conference on Dependable Systems and Networks.Google ScholarGoogle Scholar
  11. Broadwell, P. A., Sastry, N., and Traupman, J. 2002. FIG: A prototype tool for online verification of recovery mechanisms. In Proceedings of the Workshop on Self-Healing, Adaptive and Self-Managed Systems.Google ScholarGoogle Scholar
  12. Cadar, C., Dunbar, D., and Engler, D. R. 2008. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the Symposium on Operating System Design and Implem. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Candea, G., Delgado, M., Chen, M., and Fox, A. 2003. Automatic failure-path inference: A generic introspection technique for software systems. In Proceedings of the Workshop on Internet Applications. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Castro, M. and Liskov, B. 1999. Practical Byzantine fault tolerance. In Proceedings of the Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chillarege, R. and Bowen, N. S. 1989. Understanding large system failures - a fault injection experiment”. In Intl. Symp. on Fault-Tolerant Computing.Google ScholarGoogle Scholar
  16. Chipounov, V., Kuznetsov, V., and Candea, G. 2011. S2E: A platform for in-vivo multi-path analysis of software systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Cook, W. R. 2007. Applescript. In Proceedings of the Conference on History of Programming Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. CPython. 2011. http://www.python.org/.Google ScholarGoogle Scholar
  19. Curry, T. W. 1994. Profiling and tracing dynamic library usage via interposition. In Proceedings of the USENIX Summer Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dawson, S., Jahanian, F., and Mitton, T. 1997. Experiments on six commercial TCP implementations using a software fault injection tool. Softw. Pract. Exper. 27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Dowson, M. 1997. The Ariane 5 software failure. ACM SIGSOFT Softw. Engin. Notes 22, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. ELSA. 2009. http://www.eecs.berkeley.edu/~smcpeak/elkhound/sources/elsa/. (Accessed on 3/09).Google ScholarGoogle Scholar
  23. Fu, C., Ryder, B. G., Milanova, A., and Wonnacott, D. 2004. Testing of Java web services for robustness. In Proceedings of the International Symposium on Software Testing and Analysis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gcov. 2010. GCC coverage testing tool. http://gcc.gnu.org/onlinedocs/gcc/Gcov.html.Google ScholarGoogle Scholar
  25. Git. 2010a. Git crashes on make test. http://marc.info/?l=git&m=125985479417107.Google ScholarGoogle Scholar
  26. Git. 2010b. Git fails when running commands in wrong environment. http://marc.info/?l=git&m=125986795807036.Google ScholarGoogle Scholar
  27. Git. 2010c. Git unchecked malloc’s. http://marc.info/?l=git&m=126298802319662.Google ScholarGoogle Scholar
  28. Gunawi, H. S., Rubio-González, C., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Liblit, B. 2008. EIO: Error handling is occasionally correct. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gunawi, H. S., Do, T., Joshi, P., Hellerstein, J. M., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Sen, K. 2010. Towards automatically checking thousands of failures with micro-specifications. Tech. rep. UCB/EECS-2010-98, University of California.Google ScholarGoogle Scholar
  30. Guo, Z., Wang, X., Tang, J., Liu, X., Xu, Z., Wu, M., Kaashoek, M. F., and Zhang, Z. 2008. R2: An application-level kernel for record and replay. In Proceedings of the Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hadoop FI. 2010. Hadoop Fault Injection framework. http://hadoop.apache.org/hdfs/docs/r0.21.0/faultinject_framework.html.Google ScholarGoogle Scholar
  32. Holodeck. 2010. Win32 fuzz testing and fault injection. http://www.securityinnovation.com/holodeck/.Google ScholarGoogle Scholar
  33. Hunt, G. and Brubacher, D. 1999. Detours: Binary Interception of Win32 Functions. In Proceedings of the USENIX Windows NT Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Johansson, A., Suri, N., and Murphy, B. 2007. On the impact of injection triggers for OS robustness evaluation. In Proceedings of the International Symposium on Software Reliability Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kanawati, G. A., Kanawati, N. A., and Abraham, J. A. 1995. FERRARI: A flexible software-based fault and error injection system. IEEE Trans. Comput. 44, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Killian, C., Anderson, J. W., Jhala, R., and Vahdat, A. 2007. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proceedings of the Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Koopman, P., Sung, J., Dingman, C., Siewiorek, D., and Marz, T. 1997. Comparing operating systems using robustness benchmarks. In Proceedings of the International Symposium on Software Reliability Engineering.Google ScholarGoogle Scholar
  38. Lcov. 2010. LTP gcov extension. http://ltp.sourceforge.net/coverage/lcov.php.Google ScholarGoogle Scholar
  39. Lethbridge, T. C., Singer, J., and Forward, A. 2003. How software engineers use documentation: The state of the practice. IEEE Softw. 20, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Li, X., Martin, R., Nagaraja, K., Nguyen, T. D., and Zhang, B. 2002. Mendosus: A san-based fault-injection test-bed for the construction of highly available network services. In Proceedings of the Workshop on Novel Uses of System Area Networks.Google ScholarGoogle Scholar
  41. Libdwarf. 2010. Libdwarf. http://reality.sgiweb.org/davea/dwarf.html.Google ScholarGoogle Scholar
  42. Marinescu, P. D., Banabic, R., and Candea, G. 2010. An extensible technique for high-precision testing of recovery code. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. MySQL. 2009. MySQL crashes due to bus error during shutdown. http://bugs.mysql.com/bug.php?id=42109.Google ScholarGoogle Scholar
  44. MySQL. 2010a. MySQL crashes due to double unlock. http://bugs.mysql.com/bug.php?id=53268.Google ScholarGoogle Scholar
  45. MySQL. 2010b. MySQL crashes due to error while reading errmsg.sys. http://bugs.mysql.com/bug.php?id=53393.Google ScholarGoogle Scholar
  46. MySQL. 2010c. MySQL InnoDB crashes during shutdown. http://bugs.mysql.com/bug.php?id=52546.Google ScholarGoogle Scholar
  47. MySQL. 2010d. http://www.mysql.com/.Google ScholarGoogle Scholar
  48. Ng, W. T. and Chen, P. M. 1999. The systematic improvement of fault tolerance in the Rio file cache. In Proceedings of the International Symposium on Fault-Tolerant Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Pidgin. 2009. Pidgin SIGABRTs on memory alloc. http://developer.pidgin.im/ticket/8672.Google ScholarGoogle Scholar
  50. Pidgin. 2010. Pidgin. http://www.pidgin.im.Google ScholarGoogle Scholar
  51. Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005. IRON file systems. In Proceedings of the Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Prasad, M. and Chiueh, T. 2003. A binary rewriting defense against stack-based buffer overflow attacks. In Proceedings of the USENIX Annual Technical Conference.Google ScholarGoogle Scholar
  53. Rubio-González, C. and Liblit, B. 2010. Expect the unexpected: Error code mismatches between documentation and the real world. In Proceedings of the Workshop on Program Analysis for Software Tools and Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Singer, J. 1998. Practices of software maintenance. In Proceedings of the International Conference on Software Maintenance. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Slowinska, A., Stancescu, T., and Bos, H. 2011. Howard: A dynamic excavator for reverse engineering data structures. In Proceedings of the Network and Distributed System Security Symposium.Google ScholarGoogle Scholar
  56. Stott, D. T., Floering, B., Kalbarczyk, Z., and Iyer, R. K. 2000. A framework for assessing dependability in distributed systems with lightweight fault injectors. In Proceedings of the International Computer Performance and Dependability Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Süßkraut, M. and Fetzer, C. 2006. Automatically finding and patching bad error handling. In Proceedings of the European Dependable Computing Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. SysBench. 2010. http://sysbench.sourceforge.net.Google ScholarGoogle Scholar
  59. TestApi. 2010. Library of test and utility APIs. http://testapi.codeplex.com/.Google ScholarGoogle Scholar
  60. Tsai, T. K. and Iyer, R. K. 1995. Measuring fault tolerance with the FTAPE fault injection tool. In Proceedings of the International Conference on Modelling Techniques and Tools for Computer Performance Evaluation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. V8 JavaScript Engine. 2011. http://code.google.com/p/v8/.Google ScholarGoogle Scholar
  62. Weimer, W. and Necula, G. C. 2008. Exceptional situations and program reliability. ACM Trans. Program. Lang. Syst. 30, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Zhang, J., Zhao, R., and Pang, J. 2007. Parameter and return-value analysis of binary executables. In Proceedings of the Annual Intrenational Computer Software and Applications Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Testing of Recovery Code Using Fault Injection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Computer Systems
        ACM Transactions on Computer Systems  Volume 29, Issue 4
        December 2011
        116 pages
        ISSN:0734-2071
        EISSN:1557-7333
        DOI:10.1145/2063509
        Issue’s Table of Contents

        Copyright © 2011 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 December 2011
        • Revised: 1 August 2011
        • Accepted: 1 August 2011
        • Received: 1 December 2010
        Published in tocs Volume 29, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!