skip to main content
research-article

Production-run software failure diagnosis via hardware performance counters

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

Sequential and concurrency bugs are widespread in deployed software. They cause severe failures and huge financial loss during production runs. Tools that diagnose production-run failures with low overhead are needed. The state-of-the-art diagnosis techniques use software instrumentation to sample program properties at run time and use off-line statistical analysis to identify properties most correlated with failures. Although promising, these techniques suffer from high run-time overhead, which is sometimes over 100%, for concurrency-bug failure diagnosis and hence are not suitable for production-run usage.

We present PBI, a system that uses existing hardware performance counters to diagnose production-run failures caused by sequential and concurrency bugs with low overhead. PBI is designed based on several key observations. First, a few widely supported performance counter events can reflect a wide variety of common software bugs and can be monitored by hardware with almost no overhead. Second, the counter overflow interrupt supported by existing hardware and operating systems provides a natural and effective mechanism to conduct event sampling at user level. Third, the noise and non-determinism in interrupt delivery complements well with statistical processing.

We evaluate PBI using 13 real-world concurrency and sequential bugs from representative open-source server, client, and utility programs, and 10 bugs from a widely used software-testing benchmark. Quantitatively, PBI can effectively diagnose failures caused by these bugs with a small overhead that is never higher than 10%. Qualitatively, PBI does not require any change to software and presents a novel use of existing hardware performance counters.

References

  1. V. R. I. Alex Mericas, Brad Elkin. Comprehensive pmu event reference - POWER7. https://www.power.org/events/Power7/.Google ScholarGoogle Scholar
  2. P. Arumuga Nainar. Applications of Static Analysis and Program Structure in Statistical Debugging. PhD thesis, University of Wisconsin -- Madison, Aug. 2012.Google ScholarGoogle Scholar
  3. M. D. Bond, K. E. Coons, and K. S. McKinley. Pacer: Proportional detection of data races. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Ceze, P. Montesinos, C. von Praun, and J. Torrellas. Colorama: Architectural support for data-centric synchronization. In HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Demme and S. Sethumadhavan. Rapid identification of architectural bottlenecks via precise event counting. In ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: A general approach to inferring errors in systems code. In SOSP, pages 57--72, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Eranian. Perfmon2, 2010. http://perfmon2.sourceforge.net.Google ScholarGoogle Scholar
  8. M. Ernst, A. Czeisler, W. G. Griswold, and D. Notkin. Quickly detecting relevant program invariants. In ICSE, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. C. Hunt. Debugging in the (very) large: ten years of implementation and experience. In SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Godefroid and N. Nagappan. Concurrency at Microsoft -- an exploratory survey. In Workshop on Exploiting Concurrency Efficiently and Correctly, 2008.Google ScholarGoogle Scholar
  11. J. L. Greathouse, Z. Ma, M. I. Frank, R. Peri, and T. M. Austin. Demand-driven software race detection using hardware performance counters. In ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Hangal and M. S. Lam. Tracking down software bugs using automatic anomaly detection. In ICSE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. J. Harrold and G. Rothermel. Siemens Programs, HR Variants. http://www.cc.gatech.edu/aristotle/Tools/subjects/.Google ScholarGoogle Scholar
  14. M. Hirzel and T. M. Chilimbi. Bursty tracing: A framework for low-overhead temporal profiling. In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization, 2001.Google ScholarGoogle Scholar
  15. }arianeJ. L. Lions et. al. ARIANE 5 Flight 501 Failure -- report by the inquiry board. http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html.Google ScholarGoogle Scholar
  16. G. Jin, A. Thakur, B. Liblit, and S. Lu. Instrumentation and sampling strategies for Cooperative Concurrency Bug Isolation. In OOPSLA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. kernel developers. Perf profiling framework, 2012. https://perf.wiki.kernel.org/index.php/Main_Page.Google ScholarGoogle Scholar
  18. }intellbrD. Levinthal. Performance analysis guide for intel processors. Intel manual, Feb. 2009. http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf.Google ScholarGoogle Scholar
  19. D. Levinthal. Ia-32 architectures software developer's manual volume 3b: System programming guide, part 2. Intel manual, June 2009.Google ScholarGoogle Scholar
  20. Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have things changed now?: an empirical study of bug characteristics in modern open source software. In ASID, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In PLDI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: detecting atomicity violations via access interleaving invariants. In ASPLOS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study of real world concurrency bug characteristics. In ASPLOS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Lucia and L. Ceze. Finding concurrency bugs with context-aware communication graphs. In MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Marino, M. Musuvathi, and S. Narayanasamy. Effective sampling for lightweight data-race detection. In PLDI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. C. Necula, S. McPeak, S. P. Rahul, and W. Weimer. CIL: Intermediate language and tools for analysis and transformation of c programs. In International Conference on Compiler Construction, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Park, R. W. Vuduc, and M. J. Harrold. Falcon: fault localization in concurrent programs. In ICSE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. PCWorld. Nasdaq's Facebook Glitch Came From Race Conditions. http://www.pcworld.com/businesscenter/article/255911/nasdaqs_ facebook_glitch_came_from_race_conditions.html.Google ScholarGoogle Scholar
  30. A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Prvulovic. Cord:cost-effective (and nearly overhead-free) order-reordering and data race detection. In HPCA, 2006.Google ScholarGoogle Scholar
  32. M. Prvulovic and J. Torrellas. Reenact: using thread-level speculation mechanisms to debug data races in multithreaded codes. In ISCA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold. Lightweight fault-localization using multiple coverage types. In ICSE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. SDTimes. Testers spend too much time testing. http://www.sdtimes.com/SearchResult/31134.Google ScholarGoogle Scholar
  35. SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016.Google ScholarGoogle Scholar
  36. T. Sheng, N. Vachharajani, S. Eranian, R. Hundt, W. Chen, and W. Zheng. Racez: a lightweight and non-invasive race detection tool for production applications. In ICSE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. L. Soffa, K. R. Walcott, and J. Mars. Exploiting hardware advances for software testing and debugging (nier track). In ICSE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Uh, Gang-ryung, Cohn, Robert, Ayyagari, and Ravi. Analyzing dynamic binary instrumentation overhead. WBIA at ASPLOS, 2006.Google ScholarGoogle Scholar
  39. M. Vaziri, F. Tip, and J. Dolby. Associating synchronization constraints with data in an object-oriented language. In POPL, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic. Memtracker: Efficient and programmable support for memory access monitoring and debugging. In HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. N. Bairavasundaram. How do fixes become bugs? In FSE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. Yuan, W. Xing, H. Chen, and B. Zang. Security breaches as pmu deviation: detecting and identifying security attacks using performance counters. In APSys, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. W. Zhang, J. Lim, R. Olichandran, J. Scherpelz, G. Jin, S. Lu, and T. Reps. ConSeq: detecting concurrency bugs through sequential errors. In ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient Architecture Support for Software Debugging. In ISCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. P. Zhou, R. Teodorescu, and Y. Zhou. Hard: Hardware-assisted lockset-based race detection. In HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Production-run software failure diagnosis via hardware performance counters

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 48, Issue 4
      ASPLOS '13
      April 2013
      540 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2499368
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
        March 2013
        574 pages
        ISBN:9781450318709
        DOI:10.1145/2451116

      Copyright © 2013 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 March 2013

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!