Abstract
Fixing concurrency bugs (or "crugs") is critical in modern software systems. Static analyses to find crugs such as data races and atomicity violations scale poorly, while dynamic approaches incur high run-time overheads. Crugs manifest only under specific execution interleavings that may not arise during in-house testing, thereby demanding a lightweight program monitoring technique that can be used post-deployment.
We present Cooperative Crug Isolation (CCI), a low-overhead instrumentation framework to diagnose production-run failures caused by crugs. CCI tracks specific thread interleavings at run-time, and uses statistical models to identify strong failure predictors among these. We offer a varied suite of predicates that represent different trade-offs between complexity and fault isolation capability. We also develop variant random sampling strategies that suit different types of predicates and help keep the run-time overhead low. Experiments with 9 real-world bugs in 6 non-trivial C applications show that these schemes span a wide spectrum of performance and diagnosis capabilities, each suitable for different usage scenarios.
- }}The Apache Software Foundation. Apache HTTP Server Project. http://httpd.apache.org/, 2009.Google Scholar
- }}C. Armour-Brown, J. Fitzhardinge, T. Hughes, N. Nethercote, P. Mackerras, D. Mueller, J. Seward, B. V. Assche, R. Walsh, and J. Weidendorfer. Valgrind User Manual. Valgrind project, 3.5.0 edition, Aug. 2009. http://valgrind.org/docs/manual/manual.html.Google Scholar
- }}P. Arumuga Nainar, T. Chen, J. Rosin, and B. Liblit. Statistical debugging using compound Boolean predicates. In ISSTA, 2007. Google Scholar
Digital Library
- }}E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for c/c++. In OOPSLA, 2009. Google Scholar
Digital Library
- }}M. D. Bond, K. E. Coons, and K. S. McKinley. Pacer: Proportional detection of data races. In PLDI, 2010. Google Scholar
Digital Library
- }}A. Bron, E. Farchi, Y. Magid, Y. Nir, and S. Ur. Applications of synchronization coverage. In PPOPP, 2005. Google Scholar
Digital Library
- }}J. Burnim and K. Sen. Asserting and checking determinism for multithreaded programs. In FSE, 2009. Google Scholar
Digital Library
- }}T. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. HOLMES: Effective statistical debugging via efficient path profiling. In ICSE, May 2009. Google Scholar
Digital Library
- }}J.-D. Choi, K. Lee, A. Loginov, R. O'Callahan, V. Sarkar, and M. Sridharan. Efficient and precise datarace detection for multithreaded object-oriented programs. In PLDI, 2002. Google Scholar
Digital Library
- }}G. Dunlap, D. Lucchetti, M. Fetterman, and P. Chen. Execution replay of multiprocessor virtual machines. In VEE, 2008. Google Scholar
Digital Library
- }}T. Elmas, S. Qadeer, and S. Tasiran. A calculus of atomic actions. In POPL, 2009. Google Scholar
Digital Library
- }}D. Engler and K. Ashcraft. RacerX: Effective, static detection of race conditions and deadlocks. SIGOPS Oper. Syst. Rev., 37(5), 2003. Google Scholar
Digital Library
- }}C. Flanagan and S. N. Freund. Atomizer: a dynamic atomicity checker for multithreaded programs. In POPL, 2004. Google Scholar
Digital Library
- }}C. Flanagan and S. N. Freund. Fasttrack: efficient and precise dynamic race detection. In PLDI, 2009. Google Scholar
Digital Library
- }}J. Gilchrist. PBZIP2: Parallel BZIP2 Data Compression Software. http://compression.ca/pbzip2/, 2009.Google Scholar
- }}P. Godefroid and N. Nagappan. Concurrency at Microsoft - an exploratory survey. In Workshop on Exploiting Concurrency Efficiently and Correctly, 2008.Google Scholar
- }}T. A. Henzinger, R. Jhala, and R. Majumdar. Race checking by context inference. In PLDI, 2004. Google Scholar
Digital Library
- }}M. Herlihy. Wait-free synchronization. ACM Trans. Program. Lang. Syst., 13(1), 1991. Google Scholar
Digital Library
- }}M. Hirzel and T. M. Chilimbi. Bursty tracing: A framework for low-overhead temporal profiling. In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization, 2001.Google Scholar
- }}D. R. Hower, P. Montesinos, L. Ceze, M. D. Hill, and J. Torrellas. Two hardware-based approaches for deterministic multi-processor replay. CACM, 2009. Google Scholar
Digital Library
- }}V. Kahlon, Y. Yang, S. Sankaranarayanan, and A. Gupta. Fast and accurate static data-race detection for concurrent programs. In CAV, 2007. Google Scholar
Digital Library
- }}T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE Trans. Comput., 36(4), 1987. Google Scholar
Digital Library
- }}B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI, 2003. Google Scholar
Digital Library
- }}B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Public deployment of Cooperative Bug Isolation. In RAMSS, 2004.Google Scholar
Cross Ref
- }}B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In PLDI, 2005. Google Scholar
Digital Library
- }}S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: Detecting atomicity violations via access-interleaving invariants. In ASPLOS, 2006. Google Scholar
Digital Library
- }}S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes - a comprehensive study of real world concurrency bug characteristics. In ASPLOS, March 2008. Google Scholar
Digital Library
- }}B. Lucia, J. Devietti, L. Ceze, and K. Strauss. Atom-aid: Detecting and surviving atomicity violations. IEEE Micro, 29 (1), 2009. Google Scholar
Digital Library
- }}D. Marino, M. Musuvathi, and S. Narayanasamy. Effective sampling for lightweight data-race detection. In PLDI, 2009. Google Scholar
Digital Library
- }}M. Musuvathi and S. Qadeer. Iterative context bounding for systematic testing of multithreaded programs. In PLDI, 2007. Google Scholar
Digital Library
- }}S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder. Automatically classifying benign and harmful data races using replay analysis. In PLDI, 2007. Google Scholar
Digital Library
- }}S. Park, S. Lu, and Y. Zhou. CTrigger: exposing atomicity violation bugs from their hiding places. In ASPLOS, 2009. Google Scholar
Digital Library
- }}W. Pugh and N. Ayewah. Unit testing concurrent software. In ASE, 2007. Google Scholar
Digital Library
- }}S. Rajamani, G. Ramalingam, V.-P. Ranganath, and K. Vaswani. Isolator: dynamically ensuring isolation in comcurrent programs. In ASPLOS, 2009. Google Scholar
Digital Library
- }}P. Ratanaworabhan, M. Burtscher, D. Kirovski, B. Zorn, R. Nagpal, and K. Pattabiraman. Detecting and tolerating asymmetric races. In PPoPP, 2009. Google Scholar
Digital Library
- }}S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems, 15, 1997. Google Scholar
Digital Library
- }}SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016, Feb. 2004.Google Scholar
- }}K. Sen. Race directed random testing of concurrent programs. In PLDI, 2008. Google Scholar
Digital Library
- }}Y. Shi, S. Park, Z. Yin, S. Lu, Y. Zhou, W. Chen, and W. Zheng. Do I Use the Wrong Definition? DefUse: Definition-Use Invariants for Detecting Concurrency and Sequential Bugs. In OOPSLA, 2010. Google Scholar
Digital Library
- }}A. Sussman and J. Trawick. Corrupt log lines at high volumes. https://issues.apache.org/bugzilla/show_bug.cgi?id=25520, 2003.Google Scholar
- }}A. Thakur, R. Sen, B. Liblit, and S. Lu. Cooperative Crug Isolation. In WODA, 2009. Google Scholar
Digital Library
- }}M. Vaziri, F. Tip, and J. Dolby. Associating synchronization constraints with data in an object-oriented language. In POPL, 2006. Google Scholar
Digital Library
- }}M. T. Vechev, E. Yahav, and G. Yorsh. Abstraction-guided synthesis of synchronization. In POPL, 2010. Google Scholar
Digital Library
- }}S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In ISCA, 1995. Google Scholar
Digital Library
- }}J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ISCA, 2009. Google Scholar
Digital Library
- }}Y. Yu, T. Rodeheffer, and W. Chen. Racetrack: efficient detection of data race conditions via adaptive tracking. In SOSP, 2005. Google Scholar
Digital Library
Index Terms
Instrumentation and sampling strategies for cooperative concurrency bug isolation
Recommendations
Instrumentation and sampling strategies for cooperative concurrency bug isolation
OOPSLA '10: Proceedings of the ACM international conference on Object oriented programming systems languages and applicationsFixing concurrency bugs (or "crugs") is critical in modern software systems. Static analyses to find crugs such as data races and atomicity violations scale poorly, while dynamic approaches incur high run-time overheads. Crugs manifest only under ...
Cooperative crug isolation
WODA '09: Proceedings of the Seventh International Workshop on Dynamic AnalysisWith the widespread deployment of multi-core hardware, writing concurrent programs has become inescapable. This has made fixing concurrency bugs (or crugs) critical in modern software systems. Static analysis techniques to find crugs such as data races ...
Isolating and understanding concurrency errors using reconstructed execution fragments
PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and ImplementationIn this paper we propose Recon, a new general approach to concurrency debugging. Recon goes beyond just detecting bugs, it also presents to the programmer short fragments of buggy execution schedules that illustrate how and why bugs happened. These ...







Comments