Abstract
Efficient deterministic replay of whole operating systems is feasible and useful, so why isn't replay a default part of the software stack? While implementing deterministic replay is hard, we argue that the main reason is the lack of general abstractions for understanding and addressing the significant engineering challenges involved in the development of a replay engine for a modern VMM. We present a design blueprint---a set of abstractions, general principles, and low-level implementation details---for efficient deterministic replay in a modern hypervisor. We build and evaluate our architecture in Xen, a full-featured hypervisor. Our architecture can be readily followed and adopted, enabling replay as a ubiquitous part of a modern virtualization stack.
- G. Altekar and I. Stoica. ODR: Output-deterministic replay for multicore debugging. In Proc. SOSP, pages 193--206, Oct. 2009. 10.1145/1629575.1629594.Google Scholar
Digital Library
- Amazon Web Services, Inc. Amazon EC2 -- virtual server hosting, 2016. URL https://aws.amazon.com/ec2/.Google Scholar
- 0AMD Corporation. AMD64 Architecture Programmer's Manual Volume 2: System Programming, 2007.Google Scholar
- M. Attariyan, M. Chow, and J. Flinn. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proc. OSDI, Oct. 2012. URL https://www.usenix.org/conference/osdi12/technical-sessions/presentation/attariyan.Google Scholar
Digital Library
- P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proc. SOSP, pages 164--177, Oct. 2003. 10.1145/945445.945462.Google Scholar
Digital Library
- J. F. Bartlett. A nonstop kernel. In Proc. SOSP, pages 22--29, Dec. 1981. 10.1145/800216.806587.Google Scholar
Digital Library
- T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic process groups in dOS. In Proc. OSDI, pages 177--192, Oct. 2010. URL https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Bergan.pdf.Google Scholar
- A. Borg, J. Baumbach, and S. Glazer. A message system supporting fault tolerance. In Proc. SOSP, pages 90--99, Oct. 1983. 10.1145/773379.806617.Google Scholar
Digital Library
- T. C. Bressoud and F. B. Schneider. Hypervisor-based fault-tolerance. In Proc. SOSP, pages 1--11, Dec. 1995. 10.1145/224056.224058.Google Scholar
Digital Library
- T. A. Cargill and B. N. Locanthi. Cheap hardware support for software debugging and profiling. In Proc. ASPLOS, pages 82--83, Oct. 1987. 10.1145/36177.36187.Google Scholar
- A. Chen, W. B. Moore, H. Xiao, A. Haeberlen, L. T. X. an, M. Sherr, and W. Zhou. Detecting covert timing channels with time-deterministic replay. In Proc. OSDI, pages 541--554, Oct. 2014. URL https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chen_ang.Google Scholar
- Y. Chen and H. Chen. Scalable deterministic replay in a parallel full-system emulator. In Proc. PPoPP, pages 207--218, Feb. 2013. 10.1145/2442516.2442537.Google Scholar
Digital Library
- Y. Chen, S. Zhang, Q. Guo, L. Li, R. Wu, and T. Chen. Deterministic replay: A survey. ACM Comput. Surv., 48 (2), Nov. 2015. 10.1145/2790077.Google Scholar
Digital Library
- D. Chisnall. The Definitive Guide to the Xen Hypervisor. Prentice Hall, first edition, 2007. ISBN 978-0132349710.Google Scholar
- J. Chow, T. Garfinkel, and P. M. Chen. Decoupling dynamic program analysis from execution in virtual environments. In Proc. USENIX ATC, pages 1--14, June 2008. URL https://www.usenix.org/legacy/event/usenix08/tech/full_papers/chow/chow.pdf.Google Scholar
- F. Cornelis, A. Georges, M. Christiaens, M. Ronsse, T. Ghesquiere, and K. D. Bosschere. A taxonomy of execution replay systems. In International Conference on Advances in Infrastructure for Electronic Business, Education, Science, Medicine, and Mobile Technologies on the Internet, 2003.Google Scholar
- F. Cornelis, M. Ronsse, and K. De Bosschere. TORNADO: A novel input replay tool. In Proc. PDPTA, 2003\natexlabb.Google Scholar
- R. Curtis and L. D. Wittie. BUGNET: A debugging system for parallel programming environments. In Proc. ICDCS, pages 394--400, Oct. 1982.Google Scholar
- D. A. S. de Oliveira, J. R. Crandall, G. Wassermann, S. F. Wu, Z. Su, and F. T. Chong. ExecRecorder: VM-based full-system replay for attack analysis and system recovery. In Proc. 1st Workshop on Architectural and System Support for Improving Software Dependability, pages 66--71, Oct. 2006. 10.1145/1181309.1181320.Google Scholar
Digital Library
- C. Dionne, M. Feeley, and J. Desbiens. A taxonomy of distributed debuggers based on execution replay. In Proc. PDPTA, Aug. 1996.Google Scholar
- G. Dunlap. Personal communication, 2012.Google Scholar
- G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proc. OSDI, pages 211--224, Dec. 2002. URL https://www.usenix.org/legacy/event/osdi02/tech/dunlap.html.Google Scholar
Cross Ref
- G. W. Dunlap, D. G. Lucchetti, P. M. Chen, and M. A. Fetterman. Execution replay for multiprocessor virtual machines. In Proc. VEE, Mar. 2008. 10.1145/1346256.1346273.Google Scholar
Digital Library
- G. W. Dunlap III. Execution Replay for Intrusion Analysis. D thesis, University of Michigan, 2006.Google Scholar
- K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson. Safe hardware access with the Xen virtual machine monitor. In Proc. 1st Workshop on Operating System and Architectural Support for the On Demand IT Infrastructure (OASIS), Oct. 2004. URL https://www.cl.cam.ac.uk/research/srg/netos/papers/2004-safehw-oasis.pdf.Google Scholar
- D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay debugging for distributed applications. In Proc. USENIX ATC, pages 289--300, May--June 2006. URL https://www.usenix.org/legacy/events/usenix06/tech/geels.html.Google Scholar
- Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In Proc. OSDI, pages 193--208, Dec. 2008. URL https://www.usenix.org/legacy/events/osdi08/tech/full_papers/guo/guo.pdf.Google Scholar
- A. Haeberlen, P. Aditya, R. Rodrigues, and P. Druschel. Accountable virtual machines. In Proc. OSDI, pages 119--134, Oct. 2010. URL https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Haeberlen.pdf.Google Scholar
- N. Honarmand and J. Torrellas. RelaxReplay: Record and replay for relaxed-consistency multiprocessors. In Proc. ASPLOS, Mar. 2014. 10.1145/2541940.2541979.Google Scholar
Digital Library
- J. Huselius. Debugging parallel systems: A state of the art report. MTRC Report 63, Malardalens University, Vaster's, Sweden, Sept. 2002. URL http://www.es.mdh.se/publications/366-Debugging_Parallel_Systems__A_State_of_the_Art_Report.Google Scholar
- Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3 (3A, 3B, 3C, and 3D): System Programming Guide, 2015.Google Scholar
- S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proc. USENIX ATC, pages 1--15, Apr. 2005. URL https://www.usenix.org/legacy/events/usenix05/tech/general/king.html.Google Scholar
- O. Laadan, N. Viennot, and J. Nieh. Transparent, lightweight application execution replay on commodity multiprocessor operating systems. In Proc. SIGMETRICS, pages 155--166, June 2010. 10.1145/1811039.1811057.Google Scholar
Digital Library
- G. B. Leeman, Jr. A formal approach to undo operations in programming languages. ACM TOPLAS, 8 (1): 50--87, Jan. 1986. 10.1145/5001.5005.Google Scholar
Digital Library
- G. Lefebvre, B. Cully, C. Head, M. Spear, N. Hutchinson, M. Feeley, and A. Warfield. Execution mining. In Proc. VEE, pages 145--158, Mar. 2012. 10.1145/2151024.2151044.Google Scholar
Digital Library
- J. M. Mellor-Crummey and T. J. LeBlanc. A software instruction counter. In Proc. ASPLOS, pages 78--86, Apr. 1989. 10.1145/70082.68189.Google Scholar
Digital Library
- Mozilla Foundation. rr: lightweight recording & deterministic debugging, Feb. 2016. URL http://rr-project.org/.Google Scholar
- S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. PRES: Probabilistic replay with execution sketching on multiprocessors. In Proc. SOSP, pages 177--192, Oct. 2009. 10.1145/1629575.1629593.Google Scholar
Digital Library
- R. Russell. virtio: Towards a de-facto standard for virtual I/O devices. ACM SIGOPS OSR, 42 (5): 95--103, July 2008. 10.1145/1400097.1400108.Google Scholar
Digital Library
- Y. Saito. Jockey: A user-space library for record-replay debugging\balance. In Proc. AADEBUG, pages 69--76, Sept. 2005. 10.1145/1085130.1085139.Google Scholar
Digital Library
- S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback: A lightweight extension for rollback and deterministic replay for software debugging. In Proc. USENIX ATC, pages 29--44, June--July 2004. URL https://www.usenix.org/legacy/event/usenix04/tech/general/srinivasan.html.Google Scholar
- G. Venkitachalam, M. Nelson, B. Weissman, M. Xu, and V. V. Malyugin. Using branch instruction counts to facilitate replay of virtual machine instruction execution. U.S. patent 7,844,954, Nov. 2010.Google Scholar
- VMware. VMware vSere 4 Fault Tolerance: Architecture and performance. White paper, Aug. 2009. URL https://www.vmware.com/resources/techresources/10058.Google Scholar
- VMware. Protecting Hadoop with VMware vSere 5 Fault Tolerance. Technical white paper, Aug. 2012. URL https://www.vmware.com/resources/techresources/10301.Google Scholar
- VMware. VMware vSere 6 Fault Tolerance: Architecture and performance. Technical white paper, Dec. 2015. URL https://www.vmware.com/resources/techresources/10514.Google Scholar
- B. Weissman, V. V. Malyugin, P. Vandrovec, G. Venkitachalam, and M. Xu. Precise branch counting in virtualization systems. U.S. patent 9,027,003, May 2015.Google Scholar
- B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An integrated experimental environment for distributed systems and networks. In Proc. OSDI, pages 255--270, Dec. 2002. URL https://www.usenix.org/legacy/event/osdi02/tech/white.html.Google Scholar
Digital Library
- M. Xu, V. Malyugin, J. Sheldon, G. Venkitachalam, and B. Weissman. ReTrace: Collecting execution trace with virtual machine deterministic replay. In Proc. 3rd Annual Workshop on Modeling, Benchmarking and Simulation, June 2007. URL https://labs.vmware.com/academic/publications/retrace.Google Scholar
Index Terms
Abstractions for Practical Virtual Machine Replay
Recommendations
Abstractions for Practical Virtual Machine Replay
VEE '16: Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution EnvironmentsEfficient deterministic replay of whole operating systems is feasible and useful, so why isn't replay a default part of the software stack? While implementing deterministic replay is hard, we argue that the main reason is the lack of general ...
Virtual Machine Migration Method between Different Hypervisor Implementations and Its Evaluation
WAINA '12: Proceedings of the 2012 26th International Conference on Advanced Information Networking and Applications WorkshopsVirtualization technologies are an important building block for cloud services. Each service will run on virtual machines (VMs) deployed over different hyper visors in the future. Therefore, a VM migration method between different hyper visor ...
Virtual Machine Replay Update: Improved Implementation for Modern Hardware Architecture
SERE-C '12: Proceedings of the 2012 IEEE Sixth International Conference on Software Security and Reliability CompanionThis paper describes a successive and updated work of Revirt project which presents a virtual machine replay framework on Xen hyper visor. As both the commodity hardware and Xen hyper visor have been changed significantly since the first publication of ...







Comments