Abstract
Sequential and concurrency bugs are widespread in deployed software. They cause severe failures and huge financial loss during production runs. Tools that diagnose production-run failures with low overhead are needed. The state-of-the-art diagnosis techniques use software instrumentation to sample program properties at run time and use off-line statistical analysis to identify properties most correlated with failures. Although promising, these techniques suffer from high run-time overhead, which is sometimes over 100%, for concurrency-bug failure diagnosis and hence are not suitable for production-run usage.
We present PBI, a system that uses existing hardware performance counters to diagnose production-run failures caused by sequential and concurrency bugs with low overhead. PBI is designed based on several key observations. First, a few widely supported performance counter events can reflect a wide variety of common software bugs and can be monitored by hardware with almost no overhead. Second, the counter overflow interrupt supported by existing hardware and operating systems provides a natural and effective mechanism to conduct event sampling at user level. Third, the noise and non-determinism in interrupt delivery complements well with statistical processing.
We evaluate PBI using 13 real-world concurrency and sequential bugs from representative open-source server, client, and utility programs, and 10 bugs from a widely used software-testing benchmark. Quantitatively, PBI can effectively diagnose failures caused by these bugs with a small overhead that is never higher than 10%. Qualitatively, PBI does not require any change to software and presents a novel use of existing hardware performance counters.
- V. R. I. Alex Mericas, Brad Elkin. Comprehensive pmu event reference - POWER7. https://www.power.org/events/Power7/.Google Scholar
- P. Arumuga Nainar. Applications of Static Analysis and Program Structure in Statistical Debugging. PhD thesis, University of Wisconsin -- Madison, Aug. 2012.Google Scholar
- M. D. Bond, K. E. Coons, and K. S. McKinley. Pacer: Proportional detection of data races. In PLDI, 2010. Google Scholar
Digital Library
- L. Ceze, P. Montesinos, C. von Praun, and J. Torrellas. Colorama: Architectural support for data-centric synchronization. In HPCA, 2007. Google Scholar
Digital Library
- J. Demme and S. Sethumadhavan. Rapid identification of architectural bottlenecks via precise event counting. In ISCA, 2011. Google Scholar
Digital Library
- D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: A general approach to inferring errors in systems code. In SOSP, pages 57--72, 2001. Google Scholar
Digital Library
- S. Eranian. Perfmon2, 2010. http://perfmon2.sourceforge.net.Google Scholar
- M. Ernst, A. Czeisler, W. G. Griswold, and D. Notkin. Quickly detecting relevant program invariants. In ICSE, 2000. Google Scholar
Digital Library
- K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. C. Hunt. Debugging in the (very) large: ten years of implementation and experience. In SOSP, 2009. Google Scholar
Digital Library
- P. Godefroid and N. Nagappan. Concurrency at Microsoft -- an exploratory survey. In Workshop on Exploiting Concurrency Efficiently and Correctly, 2008.Google Scholar
- J. L. Greathouse, Z. Ma, M. I. Frank, R. Peri, and T. M. Austin. Demand-driven software race detection using hardware performance counters. In ISCA, 2011. Google Scholar
Digital Library
- S. Hangal and M. S. Lam. Tracking down software bugs using automatic anomaly detection. In ICSE, 2002. Google Scholar
Digital Library
- M. J. Harrold and G. Rothermel. Siemens Programs, HR Variants. http://www.cc.gatech.edu/aristotle/Tools/subjects/.Google Scholar
- M. Hirzel and T. M. Chilimbi. Bursty tracing: A framework for low-overhead temporal profiling. In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization, 2001.Google Scholar
- }arianeJ. L. Lions et. al. ARIANE 5 Flight 501 Failure -- report by the inquiry board. http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html.Google Scholar
- G. Jin, A. Thakur, B. Liblit, and S. Lu. Instrumentation and sampling strategies for Cooperative Concurrency Bug Isolation. In OOPSLA, 2010. Google Scholar
Digital Library
- L. kernel developers. Perf profiling framework, 2012. https://perf.wiki.kernel.org/index.php/Main_Page.Google Scholar
- }intellbrD. Levinthal. Performance analysis guide for intel processors. Intel manual, Feb. 2009. http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf.Google Scholar
- D. Levinthal. Ia-32 architectures software developer's manual volume 3b: System programming guide, part 2. Intel manual, June 2009.Google Scholar
- Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have things changed now?: an empirical study of bug characteristics in modern open source software. In ASID, 2006. Google Scholar
Digital Library
- B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI, 2003. Google Scholar
Digital Library
- B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In PLDI, 2005. Google Scholar
Digital Library
- S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: detecting atomicity violations via access interleaving invariants. In ASPLOS, 2006. Google Scholar
Digital Library
- S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study of real world concurrency bug characteristics. In ASPLOS, 2008. Google Scholar
Digital Library
- B. Lucia and L. Ceze. Finding concurrency bugs with context-aware communication graphs. In MICRO, 2009. Google Scholar
Digital Library
- D. Marino, M. Musuvathi, and S. Narayanasamy. Effective sampling for lightweight data-race detection. In PLDI, 2009. Google Scholar
Digital Library
- G. C. Necula, S. McPeak, S. P. Rahul, and W. Weimer. CIL: Intermediate language and tools for analysis and transformation of c programs. In International Conference on Compiler Construction, 2002. Google Scholar
Digital Library
- S. Park, R. W. Vuduc, and M. J. Harrold. Falcon: fault localization in concurrent programs. In ICSE, 2010. Google Scholar
Digital Library
- PCWorld. Nasdaq's Facebook Glitch Came From Race Conditions. http://www.pcworld.com/businesscenter/article/255911/nasdaqs_ facebook_glitch_came_from_race_conditions.html.Google Scholar
- A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys, 2010. Google Scholar
Digital Library
- M. Prvulovic. Cord:cost-effective (and nearly overhead-free) order-reordering and data race detection. In HPCA, 2006.Google Scholar
- M. Prvulovic and J. Torrellas. Reenact: using thread-level speculation mechanisms to debug data races in multithreaded codes. In ISCA, 2003. Google Scholar
Digital Library
- R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold. Lightweight fault-localization using multiple coverage types. In ICSE, 2009. Google Scholar
Digital Library
- SDTimes. Testers spend too much time testing. http://www.sdtimes.com/SearchResult/31134.Google Scholar
- SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016.Google Scholar
- T. Sheng, N. Vachharajani, S. Eranian, R. Hundt, W. Chen, and W. Zheng. Racez: a lightweight and non-invasive race detection tool for production applications. In ICSE, 2011. Google Scholar
Digital Library
- M. L. Soffa, K. R. Walcott, and J. Mars. Exploiting hardware advances for software testing and debugging (nier track). In ICSE, 2011. Google Scholar
Digital Library
- Uh, Gang-ryung, Cohn, Robert, Ayyagari, and Ravi. Analyzing dynamic binary instrumentation overhead. WBIA at ASPLOS, 2006.Google Scholar
- M. Vaziri, F. Tip, and J. Dolby. Associating synchronization constraints with data in an object-oriented language. In POPL, 2006. Google Scholar
Digital Library
- G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic. Memtracker: Efficient and programmable support for memory access monitoring and debugging. In HPCA, 2007. Google Scholar
Digital Library
- Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. N. Bairavasundaram. How do fixes become bugs? In FSE, 2011. Google Scholar
Digital Library
- J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ISCA, 2009. Google Scholar
Digital Library
- L. Yuan, W. Xing, H. Chen, and B. Zang. Security breaches as pmu deviation: detecting and identifying security attacks using performance counters. In APSys, 2011. Google Scholar
Digital Library
- W. Zhang, J. Lim, R. Olichandran, J. Scherpelz, G. Jin, S. Lu, and T. Reps. ConSeq: detecting concurrency bugs through sequential errors. In ASPLOS, 2011. Google Scholar
Digital Library
- P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient Architecture Support for Software Debugging. In ISCA, 2004. Google Scholar
Digital Library
- P. Zhou, R. Teodorescu, and Y. Zhou. Hard: Hardware-assisted lockset-based race detection. In HPCA, 2007. Google Scholar
Digital Library
Index Terms
Production-run software failure diagnosis via hardware performance counters
Recommendations
Leveraging the short-term memory of hardware to diagnose production-run software failures
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsFailures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and ...
Production-run software failure diagnosis via hardware performance counters
ASPLOS '13Sequential and concurrency bugs are widespread in deployed software. They cause severe failures and huge financial loss during production runs. Tools that diagnose production-run failures with low overhead are needed. The state-of-the-art diagnosis ...
Production-run software failure diagnosis via hardware performance counters
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsSequential and concurrency bugs are widespread in deployed software. They cause severe failures and huge financial loss during production runs. Tools that diagnose production-run failures with low overhead are needed. The state-of-the-art diagnosis ...







Comments