skip to main content
research-article

Fine-grained fault tolerance using device checkpoints

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

Recovering faults in drivers is difficult compared to other code because their state is spread across both memory and a device. Existing driver fault-tolerance mechanisms either restart the driver and discard its state, which can break applications, or require an extensive logging mechanism to replay requests and recreate driver state. Even logging may be insufficient, though, if the semantics of requests are ambiguous. In addition, these systems either require large subsystems that must be kept up-to-date as the kernel changes, or require substantial rewriting of drivers.

We present a new driver fault-tolerance mechanism that provides fine-grained control over the code protected. Fine-Grained Fault Tolerance (FGFT) isolates driver code at the granularity of a single entry point. It executes driver code as a transaction, allowing roll back if the driver fails. We develop a novel checkpointing mechanism to save and restore device state using existing power management code. Unlike past systems, FGFT can be incrementally deployed in a single driver without the need for a large kernel subsystem, but at the cost of small modifications to the driver.

In the evaluation, we show that FGFT can have almost zero runtime cost in many cases, and that checkpoint-based recovery can reduce the duration of a failure by 79% compared to restarting the driver. Finally, we show that applying FGFT to a driver requires little effort, and the majority of drivers in common classes already contain the power-management code needed for checkpoint/restore.

References

  1. K. Bailey, L. Ceze, S. D. Gribble, and H. M. Levy. Operating system implications of fast, cheap, non-volatile memory. In Proc. of the 13th HOTOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Birgisson, U. E. Mohan Dhawan, V. Ganapathy, and L. Iftode. Enforcing authorization policies using transactional memory introspection. In Proc. of the 15th ACM CCS, Oct. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Boyd-Wickizer and N. Zeldovich. Tolerating malicious device drivers in linux. In USENIX ATC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Brumley and D. Song. Privtrans: Automatically partitioning programs for privilege separation. In Proc. of the 13th USENIX Security Symposium, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Butt, V. Ganapathy, M. Swift, and C.-C. Chang. Protecting commodity OS kernels from vulnerable device drivers. In Proc. of 25th ACSAC, Dec. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Castro, M. Costa, J.-P. Martin, M. Peinado, P. Akritidis, A. Donnelly, P. Barham, and R. Black. Fast byte-granularity software fault isolation. In Proc. of the 22nd ACM SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chong et. al. Secure web applications via automatic partitioning. In Proc. of the 21st ACM SOSP, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating system errors. In Proc. of the 18th ACM SOSP, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Chun and P. Maniatis. Augmented smartphone applications through clone cloud execution. In Proc. of the 12th USENIX HotOS. USENIX Association, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Clark et. al. Live migration of virtual machines. In Proc of the 2nd USENIX NSDI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Corbet. Trusting the hardware too much. http://lwn.net/Articles/479653/. LWN February 2012.Google ScholarGoogle Scholar
  12. J. Corbet, A. Rubini, and G. Kroah-Hartman. Linux Device Drivers, 3rd Edition. O'Reilly Associates, Feb. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Corp. Power management and ACPI - architecture and driver support. msdn.microsoft.com/en-us/windows/hardware/gg463220.Google ScholarGoogle Scholar
  14. F. M. David, E. M. Chan, J. C. Carlyle, and R. H. Campbell. CuriOS: Improving reliability through operating system structure. In Proc. of the 8th USENIX OSDI, December 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson. Safe hardware access with the Xen virtual machine monitor. In OASIS Workhop, 2004.Google ScholarGoogle Scholar
  16. V. Ganapathy, M. J. Renzelmann, A. Balakrishnan, M. M. Swift, and S. Jha. The design and implementation of microdrivers. In Proc. of the 13th ACM ASPLOS, Mar. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanenbaum. Failure resilience for device drivers. In Proc. of the 2007 IEEE DSN, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Jones.hrefhttp://www.netperf.orgNetperf: A network performance benchmark, version 2.1, 1995. Available at http://www.netperf.org.Google ScholarGoogle Scholar
  19. A. Kadav, M. J. Renzelmann, and M. M. Swift. Tolerating hardware device failures in software. In Proc. of the 22nd ACM SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Kadav and M. M. Swift. Live migration of direct-access devices. SIGOPS Operating Systems Review, 43:95--104, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Kadav and M. M. Swift. Understanding modern device drivers. In Proc. of 17th ACM ASPLOS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. R. Larus and R. Rajwar. Transactional Memory. Morgan & Claypool Publishers, 2007.Google ScholarGoogle Scholar
  23. B. Leslie et. al. User-level device drivers: Achieved performance. Jour. Comp. Sci. and Tech., 20(5), 2005.Google ScholarGoogle Scholar
  24. B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In Proc of the 26th ACM PLDI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Mahalingam and R. Brunner. I/O Virtualization (IOV) For Dummies. labs.vmware.com/download/80/.Google ScholarGoogle Scholar
  26. Y. Mao, H. Chen, D. Zhou, X. Wang, N. Zeldovich, and M. Kaashoek. Software fault isolation with api integrity and multi-principal modules. In Proc. of the 23rd ACM SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Mochel. The Linux power management summit. http://lwn.net/Articles/181888/, 2006.Google ScholarGoogle Scholar
  28. S. Nagarakatte, J. Zhao, M. M. Martin, and S. Zdancewic. Softbound: highly compatible and complete spatial memory safety for c. In Proc. of the 30th ACM PLDI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: Efficient handling of i/o in highly-available rollback-recovery servers. In Proc. of the 12th HPCA, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  30. D. Narayanan and O. Hodson. Whole-system persistence. In Proc. of the 17th ACM ASPLOS, March 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. C. Necula, S. Mcpeak, S. P. Rahul, and W. Weimer. CIL: Intermediate language and tools for analysis and transformation of C programs. In Proc. of the 11th CC, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. Paxson. Bro: a system for detecting network intruders in real-time. In Proc. of the 7th USENIX Security Symposium, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. PCI-SIG. I/O virtualization. http://www.pcisig.com/specifications/iov/, 2007.Google ScholarGoogle Scholar
  34. D. E. Porter, O. S. Hofmann, C. J. Rossbach, A. Benn, and E. Witchel. Operating systems transactions. In Proc. of the 22nd ACM SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. Ramachandran. Detecting and Recovering from In-Core Hardware Faults Through Software Anomaly Treatment. PhD thesis, University of Illinois, Urbana-Champaign, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. J. Rossbach et. al. TxLinux: Using and managing hardware transactional memory in an operating system. In Proc. of the 21st ACM SOSP, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. L. Ryzhyk, P. Chubb, I. Kuz, and G. Heiser. Dingo: Taming device drivers. In Proc. of the 4th ACM Eurosys, Apr. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Seltzer, Y. Endo, C. Small, and K. Smith. Dealing with disaster: Surviving misbehaved kernel extensions. SIGOPS Operating Systems Review, 30:213--228, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sun Microsystems. Opensolaris community: Fault management. http://opensolaris.org/os/community/fm/.Google ScholarGoogle Scholar
  40. S. Sundararaman et. al. Membrane: Operating system support for restartable file systems. In Proc. of the 8th USENIX FAST, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. In Proc. of the 6th USENIX OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating systems. In Proc. of the 19th ACM SOSP, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Úlfar Erlingsson, M. Abadi, M. Vrable, M. Budiu, and G. C. Necula. Xfi: software guards for system address spaces. In Proc. of the 7th USENIX OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. H. Volos, A. Tack, N. Goyal, M. Swift, and A. Welc. xcalls: safe i/o in memory transactions. In Proc of the 4th ACM Eurosys. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. Efficient software-based fault isolation. In Proc. of the 14th ACM SOSP, Dec. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. D. Williams, P. Reynolds, K. Walsh, E. G. Sirer, and F. B. Schneider. Device driver safety through a reference validation mechanism. In Proc. of the 8th USENIX OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. E. Witchel, J. Rhee, and K. Asanovic. Mondrix: Memory isolation for Linux using Mondriaan memory protection. In Proc. of the 20th ACM SOSP, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. F. Zhou et. al. SafeDrive: Safe and recoverable extensions using language-based techniques. In Proc. of the 7th USENIX OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fine-grained fault tolerance using device checkpoints

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 48, Issue 4
      ASPLOS '13
      April 2013
      540 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2499368
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
        March 2013
        574 pages
        ISBN:9781450318709
        DOI:10.1145/2451116

      Copyright © 2013 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 March 2013

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!