Abstract
Concurrency bugs that stem from schedule-dependent branches are hard to understand and debug, because their root causes imply not only different event orderings, but also changes in the control-flow between failing and non-failing executions. We present Cortex: a system that helps exposing and understanding concurrency bugs that result from schedule-dependent branches, without relying on information from failing executions. Cortex preemptively exposes failing executions by perturbing the order of events and control-flow behavior in non-failing schedules from production runs of a program. By leveraging this information from production runs, Cortex synthesizes executions to guide the search for failing schedules. Production-guided search helps cope with the large execution search space by targeting failing executions that are similar to observed non-failing executions. Evaluation on popular benchmarks shows that Cortex is able to expose failing schedules with only a few perturbations to non-failing executions, and takes a practical amount of time.
- P. Abdulla, S. Aronis, B. Jonsson, and K. Sagonas. Optimal dynamic partial order reduction. In POPL'14, 2014. Google Scholar
Digital Library
- J. Arulraj, G. Jin, and S. Lu. Leveraging the short-term memory of hardware to diagnose production-run software failures. In ASPLOS' 14, 2014. Google Scholar
Digital Library
- T. Ball and J. R. Larus. Optimally profiling and tracing programs. ACM Trans. Program. Lang. Syst., 16(4), July 1994. Google Scholar
Digital Library
- C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI'08, 2008. Google Scholar
Digital Library
- G. Candea. Exterminating bugs via collective information recycling. In HotDep'11, 2011. Google Scholar
Digital Library
- E. M. Clarke, O. Grumberg, M. Minea, and D. Peled. State space reduction using partial order techniques. In STTT'98, 1998.Google Scholar
- L. De Moura and N. Bjørner. Z3: An efficient SMT solver. In TACAS'08/ETAPS'08, 2008. Google Scholar
Digital Library
- M. Emmi, S. Qadeer, and Z. Rakamarić. Delay-bounded scheduling. In POPL'11, 2011. Google Scholar
Digital Library
- D. Engler and K. Ashcraft. RacerX: Effective, static detection of race conditions and deadlocks. In SOSP'03, 2003. Google Scholar
Digital Library
- E. Farchi, Y. Nir, and S. Ur. Concurrent bug patterns and how to test them. In IPDPS'03, 2003. Google Scholar
Digital Library
- A. Farzan, A. Holzer, N. Razavi, and H. Veith. Con2colic testing. In ESEC/FSE'13, 2013. Google Scholar
Digital Library
- C. Flanagan and P. Godefroid. Dynamic partial-order reduction for model checking software. In POPL'05, 2005. Google Scholar
Digital Library
- C. Flanagan and S. Qadeer. A type and effect system for atomicity. In PLDI'03, 2003. Google Scholar
Digital Library
- C. Flanagan, S. N. Freund, and J. Yi. Velodrome: A sound and complete dynamic atomicity checker for multithreaded programs. In PLDI'08, 2008. Google Scholar
Digital Library
- P. Godefroid. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. Springer-Verlag, 1996. Google Scholar
Digital Library
- P. Godefroid. Model checking for programming languages using verisoft. In POPL'97, 1997. Google Scholar
Digital Library
- J. Huang. Stateless model checking concurrent programs with maximal causality reduction. In PLDI'15, 2015. Google Scholar
Digital Library
- J. Huang and L. Rauchwerger. Finding schedule-sensitive branches. In ESEC/FSE'15, 2015. Google Scholar
Digital Library
- J. Huang, P. Liu, and C. Zhang. LEAP: Lightweight deterministic multi-processor replay of concurrent java programs. In FSE'10, 2010. Google Scholar
Digital Library
- J. Huang, C. Zhang, and J. Dolby. Clap: Recording local executions to reproduce concurrency failures. In PLDI'13, 2013. Google Scholar
Digital Library
- G. Jin, A. Thakur, B. Liblit, and S. Lu. Instrumentation and sampling strategies for cooperative concurrency bug isolation. In OOPSLA'10, 2010. Google Scholar
Digital Library
- B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea. Failure sketching: A technique for automated root cause diagnosis of in-production failures. In SOSP'15, 2015. Google Scholar
Digital Library
- J. C. King. Symbolic execution and program testing. Commun. ACM, 19(7), July 1976. Google Scholar
Digital Library
- M. Kusano, A. Chattopadhyay, and C. Wang. Dynamic generation of likely invariants for multithreaded programs. In ICSE '15, 2015. Google Scholar
Digital Library
- L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7), July 1978. Google Scholar
Digital Library
- B. Liblit, A. Aiken, A. Zheng, and M. Jordan. Bug isolation via remote program sampling. In PLDI'03, 2003. Google Scholar
Digital Library
- S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: Detecting atomicity violations via access interleaving invariants. In ASPLOS XII, 2006. Google Scholar
Digital Library
- S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: A comprehensive study on real world concurrency bug characteristics. In ASPLOS XIII, 2008. Google Scholar
Digital Library
- B. Lucia and L. Ceze. Cooperative empirical failure avoidance for multithreaded programs. In ASPLOS'13, 2013. Google Scholar
Digital Library
- B. Lucia, L. Ceze, and K. Strauss. ColorSafe: Architectural support for debugging and dynamically avoiding multi-variable atomicity violations. In ISCA'10, 2010. Google Scholar
Digital Library
- B. Lucia, B. P. Wood, and L. Ceze. Isolating and understanding concurrency errors using reconstructed execution fragments. In PLDI'11, 2011. Google Scholar
Digital Library
- N. Machado, P. Romano, and L. Rodrigues. Lightweight cooperative logging for fault replication in concurrent programs. In DSN'12, 2012. Google Scholar
Digital Library
- N. Machado, B. Lucia, and L. Rodrigues. Concurrency debugging with differential schedule projections. In PLDI'15, 2015. Google Scholar
Digital Library
- M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu. Finding and reproducing heisenbugs in concurrent programs. In OSDI'08, 2008. Google Scholar
Digital Library
- C.-S. Park and K. Sen. Randomized active atomicity violation detection in concurrent programs. In FSE'08, 2008. Google Scholar
Digital Library
- S. Park, S. Lu, and Y. Zhou. Ctrigger: Exposing atomicity violation bugs from their hiding places. In ASPLOS XIV, 2009. Google Scholar
Digital Library
- S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. PRES: Probabilistic replay with execution sketching on multiprocessors. In SOSP'09, 2009. Google Scholar
Digital Library
- S. Park, R. W. Vuduc, and M. J. Harrold. Falcon: Fault localization in concurrent programs. In ICSE '10, 2010. Google Scholar
Digital Library
- S. Qadeer. Partial-order reduction for context-bounded state exploration. Technical Report MSR- TR-2007-12, Microsoft Research, 2007.Google Scholar
- M. Samak and M. K. Ramanathan. Multithreaded test synthesis for deadlock detection. In OOPSLA '14, 2014. Google Scholar
Digital Library
- M. Samak and M. K. Ramanathan. Synthesizing tests for detecting atomicity violations. In ESEC/FSE 2015, 2015. Google Scholar
Digital Library
- M. Samak, M. K. Ramanathan, and S. Jagannathan. Synthesizing racy tests. In PLDI 2015, 2015. Google Scholar
Digital Library
- S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4), Nov. 1997. ISSN 0734-2071. Google Scholar
Digital Library
- K. Sen. Race directed random testing of concurrent programs. In PLDI'08, 2008. Google Scholar
Digital Library
- B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.Google Scholar
- N. Sinha and C. Wang. On interference abstractions. In POPL '11, 2011. Google Scholar
Digital Library
- P. Thomson, A. F. Donaldson, and A. Betts. Concurrency testing using schedule bounding: An empirical study. In PPoPP'14, 2014. Google Scholar
Digital Library
- N. Tillmann and J. De Halleux. Pex: White box test generation for .net. In TAP'08, 2008. Google Scholar
Digital Library
- R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a java bytecode optimization framework. In CASCON'99, 1999. Google Scholar
Digital Library
- K. Vaswani, M. J. Thazhuthaveetil, and Y. N. Srikant. A programmable hardware path profiler. In CGO'05, 2005. Google Scholar
Digital Library
- W. Visser, C. S. Păsăreanu, and S. Khurshid. Test input generation with java pathfinder. In ISSTA'04, 2004. Google Scholar
Digital Library
- C. Wang, R. Limaye, M. Ganai, and A. Gupta. Trace-based symbolic analysis for atomicity violations. In TACAS'10, 2010. Google Scholar
Digital Library
- W. Zhang, C. Sun, and S. Lu. ConMem: Detecting severe concurrency bugs through an effect-oriented approach. In ASPLOS XV, 2010. Google Scholar
Digital Library
- W. Zhang, J. Lim, R. Olichandran, J. Scherpelz, G. Jin, S. Lu, and T. Reps. Conseq: Detecting concurrency bugs through sequential errors. In ASPLOS XVI, 2011. Google Scholar
Digital Library
- J. Zhou, X. Xiao, and C. Zhang. Stride: Search-based deterministic replay in polynomial time via bounded linkage. In ICSE'12, 2012. Google Scholar
Digital Library
Recommendations
Production-guided concurrency debugging
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingConcurrency bugs that stem from schedule-dependent branches are hard to understand and debug, because their root causes imply not only different event orderings, but also changes in the control-flow between failing and non-failing executions. We present ...
Applying hardware transactional memory for concurrency-bug failure recovery in production runs
USENIX ATC '18: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical ConferenceConcurrency bugs widely exist and severely threaten system availability. Techniques that help recover from concurrency-bug failures during production runs are highly desired. This paper proposes BugTM, an approach that leverages Hardware Transactional ...
Lazy Diagnosis of In-Production Concurrency Bugs
SOSP '17: Proceedings of the 26th Symposium on Operating Systems PrinciplesDiagnosing concurrency bugs---the process of understanding the root causes of concurrency failures---is hard. Developers depend on reproducing concurrency bugs to diagnose them. Traditionally, systems that attempt to reproduce concurrency bugs record ...






Comments