skip to main content
research-article

ReHype: enabling VM survival across hypervisor failures

Authors Info & Claims
Published:09 March 2011Publication History
Skip Abstract Section

Abstract

With existing virtualized systems, hypervisor failures lead to overall system failure and the loss of all the work in progress of virtual machines (VMs) running on the system. We introduce ReHype, a mechanism for recovery from hypervisor failures by booting a new instance of the hypervisor while preserving the state of running VMs. VMs are stalled during the hypervisor reboot and resume normal execution once the new hypervisor instance is running. Hypervisor failures can lead to arbitrary state corruption and inconsistencies throughout the system. ReHype deals with the challenge of protecting the recovered hypervisor instance from such corrupted state and resolving inconsistencies between different parts of hypervisor state as well as between the hypervisor and VMs and between the hypervisor and the hardware. We have implemented ReHype for the Xen hypervisor. The implementation was done incrementally, using results from fault injection experiments to identify the sources of dangerous state corruption and inconsistencies. The implementation of ReHype involved only 880 LOC added or modified in Xen. The memory space overhead of ReHype is only 2.1MB for a pristine copy of the hypervisor code and static data plus a small reserved memory area. The fault injection campaigns used to evaluate the effectiveness of ReHype involved a system with multiple VMs running I/O and hypercall-intensive benchmarks. Our experimental results show that the ReHype prototype can successfully recover from over 90% of detected hypervisor failures.

References

  1. "UnixBench," www.tux.org/pub/tux/benchmarks/System/unixbench.Google ScholarGoogle Scholar
  2. P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, "Xen and the Art of Virtualization," Nineteenth ACM Symposium on Operating Systems Principles, Bolton Landing, NY, pp. 164--177 (October 2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox, "Microreboot - A Technique for Cheap Recovery," 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA, pp. 31--44 (December 2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Depoutovitch and M. Stumm, "Otherworld - Giving Applications a Chance to Survive OS Kernel Crashes," 5th ACM European Conference on Computer Systems, Paris, France, pp. 181--194 (April 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson, "Safe Hardware Access with the Xen Virtual Machine Monitor," 1st Workshop on Operating System and Architectural Support for the on demand IT InfraStructure (OASIS) (ASPLOS) (October 2004).Google ScholarGoogle Scholar
  6. V. Goyal, E. Biederman, and H. Nellitheertha, "Kdump, A Kexec- based Kernel Crash Dumping Mechanism," lse.sourceforge.net/kdump/documentation/ols2oo5-kdump-paper.pdf (2005).Google ScholarGoogle Scholar
  7. W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang, "Characterization of Linux Kernel Behavior Under Errors," International Conference on Dependable Systems and Networks, San Francisco, CA, pp. 459--468 (June 2003).Google ScholarGoogle Scholar
  8. W. Gu, Z. Kalbarczyk, and R. K. Iyer, "Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors," International Conference on Dependable Systems and Networks, Florence, Italy, pp. 887--896 (June 2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. Hsu, A. Gallagher, M. Le, and Y. Tamir, "Using Virtualization to Validate Fault-Tolerant Distributed Systems," International Conference on Parallel and Distributed Computing and Systems, Marina del Rey, CA, pp. 210--217 (November 2010).Google ScholarGoogle Scholar
  10. Intel Corporation, Intel 64 and IA-32 Architectures Software Developer's Manual: Volume 3A, 2010.Google ScholarGoogle Scholar
  11. H. Jo, H. Kim, J.-W. Jang, J. Lee, and S. Maeng, "Transparent Fault Tolerance of Device Drivers for Virtual Machines," IEEE Transactions on Computers, Vol. 59(11), pp. 1466--1479 (November 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Kourai and S. Chiba, "A Fast Rejuvenation Technique for Server Consolidation with Virtual Machines," International Conference on Dependable Systems and Networks, Edinburgh, UK, pp. 245--255 (June 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Kourai and S. Chiba, "Fast Software Rejuvenation of Virtual Machine Monitors," IEEE Transactions on Dependable and Secure Computing (May 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Le, A. Gallagher, and Y. Tamir, "Challenges and Opportunities with Fault Injection in Virtualized Systems," First International Workshop on Virtualization Performance: Analysis, Characterization, and Tools, Austin, TX (April 2008).Google ScholarGoogle Scholar
  15. M. Le, A. Gallagher, Y. Tamir, and Y. Turner, "Maintaining Network QoS Across NIC Device Driver Failures Using Virtualization," 8th IEEE International Symposium on Network Computing and Applications, Cambridge, MA, pp. 195--202 (July 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and W. Zwaenepoel, "Diagnosing Performance Overheads in the Xen Virtual Machine Environment," First ACM/USENIX Conference on Virtual Execution Environments, Chicago, IL, pp. 13--23 (June 2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. T. Ng and P. M. Chen, "The Systematic Improvement of Fault Tolerance in the Rio File Cache," 29th Fault Tolerant Computing Symposium, Madison, WI, USA, pp. 76--83 (June 1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Pfiffer, "Reducing System Reboot Time With kexec," devresources.linuxfoundation.org/andyp/kexec/whitepaper/kexec.pdf (April 2003).Google ScholarGoogle Scholar
  19. M. Rosenblum and T. Garfinkel, "Virtual Machine Monitors: Current Technology and Future Trends," IEEE Computer, Vol. 38(5), pp. 39--47 (May 2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Sullivan and R. Chillarege, "Software Defects and their Impact on System Availability: A Study of Field Failures in Operating Systems," 21st Fault-Tolerant Computing Symposium, Montreal, Quebec, Canada, pp. 2--9 (June 1991).Google ScholarGoogle Scholar
  21. M. M. Swift, B. N. Bershad, and H. M. Levy, "Improving the Reliability of Commodity Operating Systems," ACM Transactions on Computer Systems, Vol. 23(1), pp. 77--110 (February 2005). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ReHype: enabling VM survival across hypervisor failures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 46, Issue 7
      VEE '11
      July 2011
      231 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2007477
      Issue’s Table of Contents
      • cover image ACM Conferences
        VEE '11: Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
        March 2011
        250 pages
        ISBN:9781450306874
        DOI:10.1145/1952682

      Copyright © 2011 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 March 2011

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!