Abstract
Crash dump, or core dump is the typical way to save memory image on system crash for future offline debugging and analysis. However, for typical server machines with likely abundant memory, the time of core dump can significantly increase the mean time to repair (MTTR) by delaying the reboot-based recovery, while not dumping the failure context for analysis would risk recurring crashes on the same problems.
In this paper, we propose several optimization techniques for core dump in virtualized environments, in order to shorten the MTTR of consolidated virtual machines during crashes. First, we parallelize the process of crash dump and the process of rebooting the crashed VM, by dynamically reclaiming and allocating memory between the crashed VM and the newly spawned VM. Second, we use the virtual machine management layer to introspect the critical data structures of the crashed VM to filter out the dump of unused memory. Finally, we implement disk I/O rate control between core dump and the newly spawned VM according to user-tuned rate control policy to balance the time of crash dump and quality of services in the recovery VM.
We have implemented a working prototype, Vicover, that optimizes core dump on system crash of a virtual machine in Xen, to minimize the MTTR of core dump and recovery as a whole. In our experiment on a virtualized TPC-W server, Vicover shortens the downtime caused by crash dump by around 5X.
- Critix Inc. XenServer. http://www.citrix.com, 2009.Google Scholar
- VMware Inc. VMware ESX Server. http://www.vmware.com/products/esx/index.html, 2009.Google Scholar
- Ganapathi, A. and Patterson, D. Crash Data Collection: A Windows Case Study. In Proceedings of Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 280--285, 2005. Google Scholar
Digital Library
- GNU. The GNU Project Debugger. http://www.gnu.org/software/gdb/, 2009.Google Scholar
- Ganapathi, A. and Ganapathi, V. and Patterson, D. Windows XP kernel crash analysis. In Proceedings of Usenix Large Installation System Administration Conference, 2006. Google Scholar
Digital Library
- David Anderson. White Paper: Red Hat Crash Utility. http://people.redhat.com/anderson/crash whitepaper/, 2008.Google Scholar
- Goyal, V. and Biederman, E. and Nellitheertha, H. Kdump, A Kexec-based Kernel Crash Dumping Mechanism. In Proceedings of Annual Ottawa Linux Symposium, pages 169--180, 2005.Google Scholar
- Barham, P. and Dragovic, B. and Fraser, K. and Hand, S. and Harris, T. and Ho, A. and Neugebauer, R. and Pratt, I. and Warfield, A. Xen and the Art of Virtualization. In Proceedings of ACM Symposium on Operating Systems Principles, pages 164--177, 2003. Google Scholar
Digital Library
- Mauro, J. and Zhu, J. and Pramanick, I. The system recovery benchmark. In Proceedings of Pacific Rim International Symposium on Dependable Computing, pages 271--280, 2004. Google Scholar
Digital Library
- Patterson, D. and Brown, A. and Broadwell, P. and Candea, G. and Chen, M. and Culter, J. and Enriquez, P. and Fox, A. and Kcman, E. and Merzbacher, M. and Oppenheimer, D. and Sastry, N. and Tetzlaff, W. and Traupman, J. and Treuhaft, N. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Computer Science Division, U.C. Berkeley, UCB//CSD-02-1175, 2002. Google Scholar
Digital Library
- Fox, A. and Patterson, D. When Does Fast Recovery Trump High Reliability? In Proceedings of 2nd Workshop on Evaluating and Architecting System Dependability, 2002.Google Scholar
- Candea, G. and Kawamoto, S. and Fujiki, Y. and Friedman, G. and Fox, A. Microreboot -- A Technique for Cheap Recovery. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation, pages 31--44, 2004. Google Scholar
Digital Library
- Candea, G. and Fox, A. Crash-only Software. In Proceedings of Workshop on Hot Topics in Operating Systems, pages 67--72, 2003 Google Scholar
Digital Library
- Baker, M. and Sullivan, M. The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment. In Proceeding of Summer USENIX Technical Conference, 1992.Google Scholar
- Bird, T. Methods to Improve Bootup Time in Linux. In Proceeding of Ottawa Linux Symposium, pages 79--88, 2004.Google Scholar
- Padala, P. and Hou, K. and Shin, K. and Zhu, X. and Uysal, M. and Wang, Z. and Singhal, S. and Merchant, A. Automated Control of Multiple Virtualized Resources. In Proceedings of European Conference on Computer Systems, pages 13--26, year 2009. Google Scholar
Digital Library
- Waldspurger, C. Memory Resource Management in VMware ESX Server. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation, pages 181--194, 2002. Google Scholar
Digital Library
- Zhao, W. and Wang, Z. Dynamic Memory Balancing for Virtual Machines. In Proceedings of ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pages 21--30, 2009. Google Scholar
Digital Library
- Garfinkel, T. and Rosenblum, M. A Virtual Machine Introspection Based Architecture for Intrusion Detection. In Proceedings of Annual Network & Distributed System Security Conference, pages 191--206, 2003.Google Scholar
- Jones, S. and Arpaci-Dusseau, A. and Arpaci-Dusseau, R. Antfarm: Tracking Processes in a Virtual Machine Environment. Proceeding of Usenix Annual Technical Conference, pages 1--14, 2006. Google Scholar
Digital Library
- Chen, X. and Garfinkel, T. and Christopher Lewis, E. and Subrahmanyam, P. and Waldspurger, C. and Boneh, D. and Dwoskin, J. and Ports, D. Overshadow: A Virtualization-Based Approach to Retrofitting Protection in Commodity Operating Systems. In Proceeding of International Conference on Architectural Support for Programming Languages and Operating Systems, 2008. Google Scholar
Digital Library
- Jones, S. and Arpaci-Dusseau, A. and Arpaci-Dusseau, R. VMM-based Hidden Process Detection and Identification Using Lycosid. Proceedings of ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pages 91--100, 2008. Google Scholar
Digital Library
- Nance, K. and Bishop, M. and Hay, B. Virtual Machine Introspection: Observation or Interference. In Proceeding of IEEE Symposium on Security and Privacy, pages 32--37, 2008.Google Scholar
- Zhang, Y. and Bestavros, A. and Guirguis, M. and Matta, I. and West, R. Friendly Virtual Machines: Leveraging a Feedback-control Model for Application Adaptation. In Proceedings of ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pages 2--12, 2005. Google Scholar
Digital Library
- Transaction Processing Performance Council. The TPC-W Benchmark. http://www.tpc.org/tpcw/default.asp, 2009.Google Scholar
- Bezenek, T. and Cain, T. and Dickson, R. and Heil, T. and Martin, M. and McCurdy, C. and Rajwar, R. and Weglarz, E. and Zilles, C. and Lipasti, M. Characterizing a Java implementation of TPC-W. In Third Workshop On Computer Architecture Evaluation Using Commercial Workloads, 2000.Google Scholar
Index Terms
Optimizing crash dump in virtualized environments
Recommendations
Optimizing crash dump in virtualized environments
VEE '10: Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsCrash dump, or core dump is the typical way to save memory image on system crash for future offline debugging and analysis. However, for typical server machines with likely abundant memory, the time of core dump can significantly increase the mean time ...
Transparently bridging semantic gap in CPU management for virtualized environments
Consolidated environments are progressively accommodating diverse and unpredictable workloads in conjunction with virtual desktop infrastructure and cloud computing. Unpredictable workloads, however, aggravate the semantic gap between the virtual ...
Pre-Copy and post-copy VM live migration for memory intensive applications
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshopsVirtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts ...







Comments