ABSTRACT
Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an open problem. This paper presents Respec, a new way to support deterministic replay of shared memory multithreaded programs on commodity multiprocessor hardware. Respec targets online replay in which the recorded and replayed processes execute concurrently.
Respec uses two strategies to reduce overhead while still ensuring correctness: speculative logging and externally deterministic replay. Speculative logging optimistically logs less information about shared memory dependencies than is needed to guarantee deterministic replay, then recovers and retries if the replayed process diverges from the recorded process. Externally deterministic replay relaxes the degree to which the two executions must match by requiring only their system output and final program states match. We show that the combination of these two techniques results in low recording and replay overhead for the common case of data-race-free execution intervals and still ensures correct replay for execution intervals that have data races.
We modified the Linux kernel to implement our techniques. Our software system adds on average about 18% overhead to the execution time for recording and replaying programs with two threads and 55% overhead for programs with four threads.
- G. Altekar and I. Stoica. ODR: Output-deterministic replay for multicore debugging. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles, October 2009. Google Scholar
Digital Library
- D. F. Bacon and S. C. Goldstein. Hardware assisted replay of multiprocessor programs. In Proceedings of the 1991 ACM/ONR Workshop on Parallel and Distributed Debugging, pages 194--206. ACM Press, 1991. Google Scholar
Digital Library
- S. Bhansali, W. Chen, S. de Jong, A. Edwards, and M. Drinic. Framework for instruction-level tracing and analysis of programs. In Second International Conference on Virtual Execution Environments, June 2006. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008. Google Scholar
Digital Library
- H. J. Boehm and S. Adve. Foundations of the c concurrency memory model. In Proceedings of PLDI, pages 68--78. ACM, 2008. Google Scholar
Digital Library
- B. Boothe. Efficient algorithms for bidirectional debugging. In Proceedings of the ACM SIGPLAN conference on programming language design and implementation, pages 299--310, 2000. Google Scholar
Digital Library
- T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems, 14(1):80--107, February 1996. Google Scholar
Digital Library
- J. D. Choi, B. Alpern, T. Ngo, and M. Sridharan. A perturbation free replay platform for cross-optimized multithreaded applications. In Proceedings of the 15th International Parallel and Distributed Processing Symposium, April 2001. Google Scholar
Digital Library
- J. Chow, T. Garfinkel, and P. M. Chen. Decoupling dynamic program analysis from execution in virtual environments. In Proceedings of the 2008 USENIX Technical Conference, pages 1--14, June 2008. Google Scholar
Digital Library
- G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation, pages 211--224, Boston, MA, December 2002. Google Scholar
Digital Library
- G. W. Dunlap, D. G. Lucchetti, M. Fetterman, and P. M. Chen. Execution replay on multiprocessor virtual machines. In Proceedings of the 2008 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), pages 121--130, March 2008. Google Scholar
Digital Library
- S. I. Feldman and C. B. Brown. Igor: a system for program debugging via reversible execution. In PADD '88: Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging, pages 112--123, 1988. Google Scholar
Digital Library
- K. Fraser and F. Chang. Operating system I/O speculation: How two invocations are faster than one. In Proceedings of the 2003 USENIX Technical Conference, pages 325--338, San Antonio, TX, June 2003.Google Scholar
- A. Georges, M. Christiaens, M. Ronsse, and K. D. Bosschere. Jarec: A portable record/replay environment for multi-threaded java applications. In Software: Practice and Experience, 2004. Google Scholar
Digital Library
- D. R. Hower and M. D. Hill. Rerun: Exploiting Episodes for Lightweight memory Race Recording. In Proceedings of the 2008 International Symposium on Computer Architecture, pages 265--276, June 2008. Google Scholar
Digital Library
- S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the 2005 USENIX Technical Conference, pages 1--15, April 2005. Google Scholar
Digital Library
- T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE Transaction on Computers, 36(4):471--482, 1987. Google Scholar
Digital Library
- D. Lee, M. Said, S. Narayanasamy, Z. J. Yang, and C. Pereira. Offline Symbolic Analysis for Multi-Processor Execution Replay. In International Symposium on Microarchitecture (MICRO), 2009. Google Scholar
Digital Library
- D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation, San Diego, CA, October 2000. Google Scholar
Digital Library
- J. Manson, W. Pugh, and S. Adve. The java memory model. In Proceedings of POPL, pages 378--391. ACM, 2005. Google Scholar
Digital Library
- P. Montesinos, L. Ceze, and J. Torrellas. DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Efficiently . In Proceedings of the 2008 International Symposium on Computer Architecture, pages 289--300, June 2008. Google Scholar
Digital Library
- P. Montesinos, M. Hicks, S. T. King, and J. Torrellas. Capo: a software-hardware interface for practical deterministic multiprocessor replay. In Proceedings of the 14th International conference on Architectural support for programming languages and operating systems (ASPLOS), pages 73--84, 2009. Google Scholar
Digital Library
- S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. In ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, pages 229--240, 2006. Google Scholar
Digital Library
- S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic logging of operating system effects to guide application-level architecture simulation. In International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS), June 2006. Google Scholar
Digital Library
- S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder. Automatically classifying benign and harmful data races using replay analysis. In PLDI, June 2007. Google Scholar
Digital Library
- R. H. B. Netzer. Optimal tracing and replay for debugging shared-memory parallel programs. In Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging, pages 1--11, 1993. Google Scholar
Digital Library
- E. B. Nightingale, P. M. Chen, and J. Flinn. Speculative execution in a distributed file system. In Proceedings of the 20th ACM Symposium on Operating Systems Principles, pages 191--205, Brighton, United Kingdom, October 2005. Google Scholar
Digital Library
- E. B. Nightingale, D. Peek, P. M. Chen, and J. Flinn. Parallelizing security checks on commodity hardware. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 308--318, Seattle, WA, March 2008. Google Scholar
Digital Library
- E. B. Nightingale, K. Veeraraghavan, P. M. Chen, and J. Flinn. Rethink the sync. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, pages 1--14, Seattle, WA, October 2006. Google Scholar
Digital Library
- M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: efficient deterministic multithreading in software. In Proceedings of the 2009 International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2009. Google Scholar
Digital Library
- S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of Zap: A system for migrating computing environments. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation, pages 361--376, Boston, MA, December 2002. Google Scholar
Digital Library
- S. Park, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, S. Lu, and Y. Zhou. Do you have to reproduce the bug at the first replay attempt? -- PRES: Probabilistic replay with execution sketching on multiprocessors. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles, October 2009. Google Scholar
Digital Library
- F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as allergies -- a safe method to survive software failures. In Proceedings of the 20th ACM Symposium on Operating Systems Principles, pages 235--248, Brighton, United Kingdom, October 2005. Google Scholar
Digital Library
- M. Ronsse and K. D. Bosschere. RecPlay: A Full Integrated Practical Record/Replay System. ACM Transactions on Computer Systems, 17(2):133--152, May 1999. Google Scholar
Digital Library
- S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas. Patching processor design errors with programmable hardware. IEEE Micro Top Picks, 27(1):12--25, 2007. Google Scholar
Digital Library
- S. Srinivasan, C. Andrews, S. Kandula, and Y. Zhou. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. In Proceedings of the 2004 USENIX Technical Conference, Boston, MA, June 2004. Google Scholar
Digital Library
- J. Steven, P. Chandra, B. Fleck, and A. Podgurski. jrapture: A capture replay tool for observation-based testing. In Proceedings of the International Symposium on Software Testing and Analysis, 2000. Google Scholar
Digital Library
- J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing Production Run Failures at the User's Site. In Proceedings of the 2007 Symposium on Operating Systems Principles, pages 131--144, October 2007. Google Scholar
Digital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 24--36, June 1995. Google Scholar
Digital Library
- M. Xu, R. Bodik, and M. D. Hill. A Flight Data Recorder for Enabling Full-system Multiprocessor Deterministic Replay. In Proceedings of the 2003 International Symposium on Computer Architecture, June 2003. Google Scholar
Digital Library
- M. Xu, M. D. Hill, and R. Bodik. A regulated transitive reduction (RTR) for longer memory race recording. In ASPLOS--XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, pages 49--60, 2006. Google Scholar
Digital Library
- M. Xu, V. Malyugin, J. Sheldon, G. Venkitachalam, and B. Weissman. ReTrace: Collecting Execution Trace with Virtual Machine Deterministic Replay. In Proceedings of the 2007 Workshop on Modeling, Benchmarking and Simulation (MoBS), June 2007.Google Scholar
Index Terms
Respec: efficient online multiprocessor replayvia speculation and external determinism
Recommendations
Respec: efficient online multiprocessor replayvia speculation and external determinism
ASPLOS '10Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an ...
Respec: efficient online multiprocessor replayvia speculation and external determinism
ASPLOS '10Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an ...
PRES: probabilistic replay with execution sketching on multiprocessors
SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principlesBug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-...








Comments