Abstract
With existing virtualized systems, hypervisor failures lead to overall system failure and the loss of all the work in progress of virtual machines (VMs) running on the system. We introduce ReHype, a mechanism for recovery from hypervisor failures by booting a new instance of the hypervisor while preserving the state of running VMs. VMs are stalled during the hypervisor reboot and resume normal execution once the new hypervisor instance is running. Hypervisor failures can lead to arbitrary state corruption and inconsistencies throughout the system. ReHype deals with the challenge of protecting the recovered hypervisor instance from such corrupted state and resolving inconsistencies between different parts of hypervisor state as well as between the hypervisor and VMs and between the hypervisor and the hardware. We have implemented ReHype for the Xen hypervisor. The implementation was done incrementally, using results from fault injection experiments to identify the sources of dangerous state corruption and inconsistencies. The implementation of ReHype involved only 880 LOC added or modified in Xen. The memory space overhead of ReHype is only 2.1MB for a pristine copy of the hypervisor code and static data plus a small reserved memory area. The fault injection campaigns used to evaluate the effectiveness of ReHype involved a system with multiple VMs running I/O and hypercall-intensive benchmarks. Our experimental results show that the ReHype prototype can successfully recover from over 90% of detected hypervisor failures.
- "UnixBench," www.tux.org/pub/tux/benchmarks/System/unixbench.Google Scholar
- P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, "Xen and the Art of Virtualization," Nineteenth ACM Symposium on Operating Systems Principles, Bolton Landing, NY, pp. 164--177 (October 2003). Google Scholar
Digital Library
- G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox, "Microreboot - A Technique for Cheap Recovery," 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA, pp. 31--44 (December 2004). Google Scholar
Digital Library
- A. Depoutovitch and M. Stumm, "Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes," 5th ACM European Conference on Computer Systems, Paris, France, pp. 181--194 (April 2010). Google Scholar
Digital Library
- K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson, "Safe Hardware Access with the Xen Virtual Machine Monitor," 1st Workshop on Operating System and Architectural Support for the on demand IT InfraStructure (OASIS) (ASPLOS) (October 2004).Google Scholar
- V. Goyal, E. Biederman, and H. Nellitheertha, "Kdump, A Kexec- based Kernel Crash Dumping Mechanism," lse.sourceforge.net/kdump/documentation/ols2oo5-kdump-paper.pdf (2005).Google Scholar
- W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang, "Characterization of Linux Kernel Behavior Under Errors," International Conference on Dependable Systems and Networks, San Francisco, CA, pp. 459--468 (June 2003).Google Scholar
- W. Gu, Z. Kalbarczyk, and R. K. Iyer, "Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors," International Conference on Dependable Systems and Networks, Florence, Italy, pp. 887--896 (June 2004). Google Scholar
Digital Library
- I. Hsu, A. Gallagher, M. Le, and Y. Tamir, "Using Virtualization to Validate Fault-Tolerant Distributed Systems," International Conference on Parallel and Distributed Computing and Systems, Marina del Rey, CA, pp. 210--217 (November 2010).Google Scholar
- Intel Corporation, Intel 64 and IA-32 Architectures Software Developer's Manual: Volume 3A, 2010.Google Scholar
- H. Jo, H. Kim, J.-W. Jang, J. Lee, and S. Maeng, "Transparent Fault Tolerance of Device Drivers for Virtual Machines," IEEE Transactions on Computers, Vol. 59(11), pp. 1466--1479 (November 2010). Google Scholar
Digital Library
- K. Kourai and S. Chiba, "A Fast Rejuvenation Technique for Server Consolidation with Virtual Machines," International Conference on Dependable Systems and Networks, Edinburgh, UK, pp. 245--255 (June 2007). Google Scholar
Digital Library
- K. Kourai and S. Chiba, "Fast Software Rejuvenation of Virtual Machine Monitors," IEEE Transactions on Dependable and Secure Computing (May 2010). Google Scholar
Digital Library
- M. Le, A. Gallagher, and Y. Tamir, "Challenges and Opportunities with Fault Injection in Virtualized Systems," First International Workshop on Virtualization Performance: Analysis, Characterization, and Tools, Austin, TX (April 2008).Google Scholar
- M. Le, A. Gallagher, Y. Tamir, and Y. Turner, "Maintaining Network QoS Across NIC Device Driver Failures Using Virtualization," 8th IEEE International Symposium on Network Computing and Applications, Cambridge, MA, pp. 195--202 (July 2009). Google Scholar
Digital Library
- A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and W. Zwaenepoel, "Diagnosing Performance Overheads in the Xen Virtual Machine Environment," First ACM/USENIX Conference on Virtual Execution Environments, Chicago, IL, pp. 13--23 (June 2005). Google Scholar
Digital Library
- W. T. Ng and P. M. Chen, "The Systematic Improvement of Fault Tolerance in the Rio File Cache," 29th Fault Tolerant Computing Symposium, Madison, WI, USA, pp. 76--83 (June 1999). Google Scholar
Digital Library
- A. Pfiffer, "Reducing System Reboot Time With kexec," devresources.linuxfoundation.org/andyp/kexec/whitepaper/kexec.pdf (April 2003).Google Scholar
- M. Rosenblum and T. Garfinkel, "Virtual Machine Monitors: Current Technology and Future Trends," IEEE Computer, Vol. 38(5), pp. 39--47 (May 2005). Google Scholar
Digital Library
- M. Sullivan and R. Chillarege, "Software Defects and their Impact on System Availability: A Study of Field Failures in Operating Systems," 21st Fault-Tolerant Computing Symposium, Montreal, Quebec, Canada, pp. 2--9 (June 1991).Google Scholar
- M. M. Swift, B. N. Bershad, and H. M. Levy, "Improving the Reliability of Commodity Operating Systems," ACM Transactions on Computer Systems, Vol. 23(1), pp. 77--110 (February 2005). Google Scholar
Digital Library
Index Terms
ReHype: enabling VM survival across hypervisor failures
Recommendations
ReHype: enabling VM survival across hypervisor failures
VEE '11: Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsWith existing virtualized systems, hypervisor failures lead to overall system failure and the loss of all the work in progress of virtual machines (VMs) running on the system. We introduce ReHype, a mechanism for recovery from hypervisor failures by ...
Resilient Virtual Clusters
PRDC '11: Proceedings of the 2011 IEEE 17th Pacific Rim International Symposium on Dependable ComputingClusters of computers can provide, in aggregate, reliable services despite the failure of individual computers. System-level virtualization is widely used to consolidate the workload of multiple physical systems as multiple virtual machines (VMs) on a ...
Virtual Machine Migration Method between Different Hypervisor Implementations and Its Evaluation
WAINA '12: Proceedings of the 2012 26th International Conference on Advanced Information Networking and Applications WorkshopsVirtualization technologies are an important building block for cloud services. Each service will run on virtual machines (VMs) deployed over different hyper visors in the future. Therefore, a VM migration method between different hyper visor ...







Comments