Abstract
Recovering faults in drivers is difficult compared to other code because their state is spread across both memory and a device. Existing driver fault-tolerance mechanisms either restart the driver and discard its state, which can break applications, or require an extensive logging mechanism to replay requests and recreate driver state. Even logging may be insufficient, though, if the semantics of requests are ambiguous. In addition, these systems either require large subsystems that must be kept up-to-date as the kernel changes, or require substantial rewriting of drivers.
We present a new driver fault-tolerance mechanism that provides fine-grained control over the code protected. Fine-Grained Fault Tolerance (FGFT) isolates driver code at the granularity of a single entry point. It executes driver code as a transaction, allowing roll back if the driver fails. We develop a novel checkpointing mechanism to save and restore device state using existing power management code. Unlike past systems, FGFT can be incrementally deployed in a single driver without the need for a large kernel subsystem, but at the cost of small modifications to the driver.
In the evaluation, we show that FGFT can have almost zero runtime cost in many cases, and that checkpoint-based recovery can reduce the duration of a failure by 79% compared to restarting the driver. Finally, we show that applying FGFT to a driver requires little effort, and the majority of drivers in common classes already contain the power-management code needed for checkpoint/restore.
- K. Bailey, L. Ceze, S. D. Gribble, and H. M. Levy. Operating system implications of fast, cheap, non-volatile memory. In Proc. of the 13th HOTOS, 2011. Google Scholar
Digital Library
- A. Birgisson, U. E. Mohan Dhawan, V. Ganapathy, and L. Iftode. Enforcing authorization policies using transactional memory introspection. In Proc. of the 15th ACM CCS, Oct. 2008. Google Scholar
Digital Library
- S. Boyd-Wickizer and N. Zeldovich. Tolerating malicious device drivers in linux. In USENIX ATC, 2010. Google Scholar
Digital Library
- D. Brumley and D. Song. Privtrans: Automatically partitioning programs for privilege separation. In Proc. of the 13th USENIX Security Symposium, 2004. Google Scholar
Digital Library
- S. Butt, V. Ganapathy, M. Swift, and C.-C. Chang. Protecting commodity OS kernels from vulnerable device drivers. In Proc. of 25th ACSAC, Dec. 2009. Google Scholar
Digital Library
- M. Castro, M. Costa, J.-P. Martin, M. Peinado, P. Akritidis, A. Donnelly, P. Barham, and R. Black. Fast byte-granularity software fault isolation. In Proc. of the 22nd ACM SOSP, 2009. Google Scholar
Digital Library
- S. Chong et. al. Secure web applications via automatic partitioning. In Proc. of the 21st ACM SOSP, 2007. Google Scholar
Digital Library
- A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating system errors. In Proc. of the 18th ACM SOSP, 2001. Google Scholar
Digital Library
- B. Chun and P. Maniatis. Augmented smartphone applications through clone cloud execution. In Proc. of the 12th USENIX HotOS. USENIX Association, 2009. Google Scholar
Digital Library
- C. Clark et. al. Live migration of virtual machines. In Proc of the 2nd USENIX NSDI, 2005. Google Scholar
Digital Library
- J. Corbet. Trusting the hardware too much. http://lwn.net/Articles/479653/. LWN February 2012.Google Scholar
- J. Corbet, A. Rubini, and G. Kroah-Hartman. Linux Device Drivers, 3rd Edition. O'Reilly Associates, Feb. 2005. Google Scholar
Digital Library
- M. Corp. Power management and ACPI - architecture and driver support. msdn.microsoft.com/en-us/windows/hardware/gg463220.Google Scholar
- F. M. David, E. M. Chan, J. C. Carlyle, and R. H. Campbell. CuriOS: Improving reliability through operating system structure. In Proc. of the 8th USENIX OSDI, December 2008. Google Scholar
Digital Library
- K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson. Safe hardware access with the Xen virtual machine monitor. In OASIS Workhop, 2004.Google Scholar
- V. Ganapathy, M. J. Renzelmann, A. Balakrishnan, M. M. Swift, and S. Jha. The design and implementation of microdrivers. In Proc. of the 13th ACM ASPLOS, Mar. 2008. Google Scholar
Digital Library
- J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanenbaum. Failure resilience for device drivers. In Proc. of the 2007 IEEE DSN, June 2007. Google Scholar
Digital Library
- R. Jones.hrefhttp://www.netperf.orgNetperf: A network performance benchmark, version 2.1, 1995. Available at http://www.netperf.org.Google Scholar
- A. Kadav, M. J. Renzelmann, and M. M. Swift. Tolerating hardware device failures in software. In Proc. of the 22nd ACM SOSP, 2009. Google Scholar
Digital Library
- A. Kadav and M. M. Swift. Live migration of direct-access devices. SIGOPS Operating Systems Review, 43:95--104, 2009. Google Scholar
Digital Library
- A. Kadav and M. M. Swift. Understanding modern device drivers. In Proc. of 17th ACM ASPLOS, 2012. Google Scholar
Digital Library
- J. R. Larus and R. Rajwar. Transactional Memory. Morgan & Claypool Publishers, 2007.Google Scholar
- B. Leslie et. al. User-level device drivers: Achieved performance. Jour. Comp. Sci. and Tech., 20(5), 2005.Google Scholar
- B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In Proc of the 26th ACM PLDI, 2005. Google Scholar
Digital Library
- M. Mahalingam and R. Brunner. I/O Virtualization (IOV) For Dummies. labs.vmware.com/download/80/.Google Scholar
- Y. Mao, H. Chen, D. Zhou, X. Wang, N. Zeldovich, and M. Kaashoek. Software fault isolation with api integrity and multi-principal modules. In Proc. of the 23rd ACM SOSP, 2011. Google Scholar
Digital Library
- P. Mochel. The Linux power management summit. http://lwn.net/Articles/181888/, 2006.Google Scholar
- S. Nagarakatte, J. Zhao, M. M. Martin, and S. Zdancewic. Softbound: highly compatible and complete spatial memory safety for c. In Proc. of the 30th ACM PLDI, 2009. Google Scholar
Digital Library
- J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: Efficient handling of i/o in highly-available rollback-recovery servers. In Proc. of the 12th HPCA, 2006.Google Scholar
Cross Ref
- D. Narayanan and O. Hodson. Whole-system persistence. In Proc. of the 17th ACM ASPLOS, March 2012. Google Scholar
Digital Library
- G. C. Necula, S. Mcpeak, S. P. Rahul, and W. Weimer. CIL: Intermediate language and tools for analysis and transformation of C programs. In Proc. of the 11th CC, 2002. Google Scholar
Digital Library
- V. Paxson. Bro: a system for detecting network intruders in real-time. In Proc. of the 7th USENIX Security Symposium, 1998. Google Scholar
Digital Library
- PCI-SIG. I/O virtualization. http://www.pcisig.com/specifications/iov/, 2007.Google Scholar
- D. E. Porter, O. S. Hofmann, C. J. Rossbach, A. Benn, and E. Witchel. Operating systems transactions. In Proc. of the 22nd ACM SOSP, 2009. Google Scholar
Digital Library
- P. Ramachandran. Detecting and Recovering from In-Core Hardware Faults Through Software Anomaly Treatment. PhD thesis, University of Illinois, Urbana-Champaign, 2011. Google Scholar
Digital Library
- C. J. Rossbach et. al. TxLinux: Using and managing hardware transactional memory in an operating system. In Proc. of the 21st ACM SOSP, 2007. Google Scholar
Digital Library
- L. Ryzhyk, P. Chubb, I. Kuz, and G. Heiser. Dingo: Taming device drivers. In Proc. of the 4th ACM Eurosys, Apr. 2009. Google Scholar
Digital Library
- M. Seltzer, Y. Endo, C. Small, and K. Smith. Dealing with disaster: Surviving misbehaved kernel extensions. SIGOPS Operating Systems Review, 30:213--228, 1996. Google Scholar
Digital Library
- Sun Microsystems. Opensolaris community: Fault management. http://opensolaris.org/os/community/fm/.Google Scholar
- S. Sundararaman et. al. Membrane: Operating system support for restartable file systems. In Proc. of the 8th USENIX FAST, 2010. Google Scholar
Digital Library
- M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. In Proc. of the 6th USENIX OSDI, 2004. Google Scholar
Digital Library
- M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating systems. In Proc. of the 19th ACM SOSP, Oct. 2003. Google Scholar
Digital Library
- Úlfar Erlingsson, M. Abadi, M. Vrable, M. Budiu, and G. C. Necula. Xfi: software guards for system address spaces. In Proc. of the 7th USENIX OSDI, 2006. Google Scholar
Digital Library
- H. Volos, A. Tack, N. Goyal, M. Swift, and A. Welc. xcalls: safe i/o in memory transactions. In Proc of the 4th ACM Eurosys. ACM, 2009. Google Scholar
Digital Library
- R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. Efficient software-based fault isolation. In Proc. of the 14th ACM SOSP, Dec. 1993. Google Scholar
Digital Library
- D. Williams, P. Reynolds, K. Walsh, E. G. Sirer, and F. B. Schneider. Device driver safety through a reference validation mechanism. In Proc. of the 8th USENIX OSDI, 2008. Google Scholar
Digital Library
- E. Witchel, J. Rhee, and K. Asanovic. Mondrix: Memory isolation for Linux using Mondriaan memory protection. In Proc. of the 20th ACM SOSP, 2005. Google Scholar
Digital Library
- F. Zhou et. al. SafeDrive: Safe and recoverable extensions using language-based techniques. In Proc. of the 7th USENIX OSDI, 2006. Google Scholar
Digital Library
Index Terms
Fine-grained fault tolerance using device checkpoints
Recommendations
Fine-grained fault tolerance using device checkpoints
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsRecovering faults in drivers is difficult compared to other code because their state is spread across both memory and a device. Existing driver fault-tolerance mechanisms either restart the driver and discard its state, which can break applications, or ...
Fine-grained fault tolerance using device checkpoints
ASPLOS '13Recovering faults in drivers is difficult compared to other code because their state is spread across both memory and a device. Existing driver fault-tolerance mechanisms either restart the driver and discard its state, which can break applications, or ...
Understanding modern device drivers
ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating SystemsDevice drivers are the single largest contributor to operating-system kernel code with over 5 million lines of code in the Linux kernel, and cause significant complexity, bugs and development costs. Recent years have seen a flurry of research aimed at ...







Comments