skip to main content
research-article
Open Access

WATCHER: in-situ failure diagnosis

Published:13 November 2020Publication History
Skip Abstract Section

Abstract

Diagnosing software failures is important but notoriously challenging. Existing work either requires extensive manual effort, imposing a serious privacy concern (for in-production systems), or cannot report sufficient information for bug fixes. This paper presents a novel diagnosis system, named WATCHER, that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern. It combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. It further proposes two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks. WATCHER can be easily deployed, without requiring custom hardware or operating system, program modification, or recompilation. We evaluate WATCHER with 24 program failures in real-world deployed software, including large-scale applications, such as Memcached, SQLite, and OpenJPEG. Experimental results show that WATCHER can accurately identify the root causes in only a few seconds.

Skip Supplemental Material Section

Supplemental Material

Auxiliary Presentation Video

This is a presentation video of my talk at OOPSLA 2020 on our paper accepted in the research track. This paper presents a novel diagnosis system, named WATCHER, that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern. It combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. It further proposes two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks. We evaluate WATCHER with 24 program failures in real-world deployed software, including large-scale applications, such as Memcached, SQLite, and OpenJPEG. Experimental results show that WATCHER can accurately identify the root causes in only a few seconds.

References

  1. Francesc Alted. 2010. Why modern CPUs are starving and what can be done about it. Computing in Science & Engineering 12, 2 ( 2010 ), 68.Google ScholarGoogle Scholar
  2. Matthew Arnold, Martin Vechev, and Eran Yahav. 2008. QVM: An Eficient Runtime for Detecting Defects in Deployed Systems. In Proceedings of the 23rd ACM SIGPLAN Conference on Objectoriented Programming Systems Languages and Applications (Nashville, TN, USA) ( OOPSLA '08). ACM, New York, NY, USA, 143-162. https://doi.org/10.1145/1449764.1449776 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jonathan Bell, Nikhil Sarda, and Gail Kaiser. 2013. Chronicler: Lightweight Recording to Reproduce Field Failures. In Proceedings of the 2013 International Conference on Software Engineering (San Francisco, CA, USA) ( ICSE '13). IEEE Press, Piscataway, NJ, USA, 362-371. http://dl.acm.org/ citation.cfm?id= 2486788. 2486836Google ScholarGoogle ScholarCross RefCross Ref
  4. Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: A Scalable Memory Allocator for Multithreaded Applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, Massachusetts, USA) (ASPLOS IX). ACM, New York, NY, USA, 117-128. https://doi.org/10.1145/378993.379232 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael D. Bond, Nicholas Nethercote, Stephen W. Kent, Samuel Z. Guyer, and Kathryn S. McKinley. 2007. Tracking Bad Apples: Reporting the Origin of Null and Undefined Value Errors. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (Montreal, Quebec, Canada) ( OOPSLA '07). Association for Computing Machinery, New York, NY, USA, 405-422. https://doi.org/10.1145/1297027.1297057 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fred A. Bower, Daniel J. Sorin, and Sule Ozev. 2005. A Mechanism for Online Diagnosis of Hard Faults in Microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (Barcelona, Spain) (MICRO 38). IEEE Computer Society, Washington, DC, USA, 197-208. https://doi.org/10.1109/MICRO. 2005.8 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jong-Deok Choi, Keunwoo Lee, Alexey Loginov, Robert O'Callahan, Vivek Sarkar, Vivek Sarkar, and Manu Sridharan. 2002. Eficient and Precise Datarace Detection for Multithreaded Objectoriented Programs. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation (Berlin, Germany) ( PLDI '02). ACM, New York, NY, USA, 258-269. https://doi.org/10.1145/512529.512560 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Weidong Cui, Xinyang Ge, Baris Kasikci, Ben Niu, Upamanyu Sharma, Ruoyu Wang, and Insu Yun. 2018. REPT: Reverse Debugging of Failures in Deployed Software. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. 17-32. https://www.usenix.org/conference/osdi18/presentation/weidongGoogle ScholarGoogle Scholar
  9. Weidong Cui, Marcus Peinado, Sang Kil Cha, Yanick Fratantonio, and Vasileios P. Kemerlis. 2016. RETracer: Triaging Crashes by Reverse Execution from Partial Memory Dumps. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) ( ICSE '16). ACM, New York, NY, USA, 820-831. https://doi.org/10.1145/2884781.2884844 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Eficient and Precise Dynamic Race Detection. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (Dublin, Ireland) ( PLDI '09). ACM, New York, NY, USA, 121-133. https://doi.org/10.1145/1542476.1542490 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Freyja. 2017. How much could software errors be costing your company? https://raygun.com/blog/ cost-of-software-errors/.Google ScholarGoogle Scholar
  12. Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt. 2009. Debugging in the (Very) Large: Ten Years of Implementation and Experience. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (Big Sky, Montana, USA) ( SOSP '09). ACM, New York, NY, USA, 103-116. https://doi.org/10.1145/1629575.1629586 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Patrice Godefroid and Nachiappan Nagappan. 2008. Concurrency at Microsoft: An exploratory survey. In CAV Workshop on Exploiting Concurrency Eficiently and Correctly.Google ScholarGoogle Scholar
  14. Godefroid, Patrice and Nagappan, Nachi. 2008. Concurrency at Microsoft-An Exploratory Survey. http://www.microsoft.com/en-us/research/publication/concurrency-at-microsoft-anexploratory-survey/.Google ScholarGoogle Scholar
  15. Mark Harman and Robert Hierons. 2001. An overview of program slicing. software focus 2, 3 ( 2001 ), 85-92.Google ScholarGoogle Scholar
  16. Jef Huang, Charles Zhang, and Julian Dolby. 2013. CLAP: Recording Local Executions to Reproduce Concurrency Failures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) ( PLDI '13). ACM, New York, NY, USA, 141-152. https://doi.org/10.1145/2491956.2462167 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Intel. 2017. Intel XED. Retrieved December, 2017 from https://intelxed.github.io/Google ScholarGoogle Scholar
  18. Guoliang Jin, Aditya Thakur, Ben Liblit, and Shan Lu. 2010. Instrumentation and Sampling Strategies for Cooperative Concurrency Bug Isolation. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (Reno/Tahoe, Nevada, USA) ( OOPSLA '10). ACM, New York, NY, USA, 241-255. https://doi.org/10.1145/1869459.1869481 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Pallavi Joshi, Chang-Seo Park, Koushik Sen, and Mayur Naik. 2009. A Randomized Dynamic Program Analysis Technique for Detecting Real Deadlocks. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (Dublin, Ireland) ( PLDI '09). ACM, New York, NY, USA, 110-120. https://doi.org/10.1145/1542476.1542489 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Baris Kasikci, Weidong Cui, Xinyang Ge, and Ben Niu. 2017. Lazy Diagnosis of In-Production Concurrency Bugs. In Proceedings of the 26th Symposium on Operating Systems Principles (Shanghai, China) (SOSP '17). ACM, New York, NY, USA, 582-598. https://doi.org/10.1145/3132747.3132767 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. 2015. Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-production Failures. In Proceedings of the 25th Symposium on Operating Systems Principles (Monterey, California) ( SOSP '15). ACM, New York, NY, USA, 344-360. https://doi.org/10.1145/2815400.2815412 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Samuel T. King, George W. Dunlap, and Peter M. Chen. 2005. Debugging Operating Systems with Time-traveling Virtual Machines. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (Anaheim, CA) ( ATEC '05). USENIX Association, Berkeley, CA, USA, 1-1. http://dl.acm.org/citation.cfm?id= 1247360. 1247361Google ScholarGoogle Scholar
  23. B. Korel and J. Laski. 1988. Dynamic Program Slicing. Inf. Process. Lett. 29, 3 (Oct. 1988 ), 155-163. https://doi.org/10.1016/ 0020-0190 ( 88 ) 90054-3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM 21, 7 ( July 1978 ), 558-565. https://doi.org/10.1145/359545.359563 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hongyu Liu, Sam Silvestro, Wei Wang, Chen Tian, and Tongping Liu. 2018. iReplayer: In-situ and Identical Record-and-replay for Multithreaded Applications. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) ( PLDI 2018). ACM, New York, NY, USA, 344-358. https://doi.org/10.1145/3192366.3192380 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Shan Lu, Weihang Jiang, and Yuanyuan Zhou. 2007. A study of interleaving coverage criteria. In Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering (Dubrovnik, Croatia) (ESEC-FSE '07). ACM, New York, NY, USA, 533-536. https://doi.org/10.1145/1287624.1287703 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Shan Lu, Zhenmin Li, Feng Qin, Lin Tan, Pin Zhou, and Yuanyuan Zhou. 2005. Bugbench: Benchmarks for evaluating bug detection tools. In In Workshop on the Evaluation of Software Defect Detection Tools.Google ScholarGoogle Scholar
  28. Nuno Machado, Brandon Lucia, and Luís Rodrigues. 2015a. Concurrency Debugging with Diferential Schedule Projections. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (Portland, OR, USA) ( PLDI '15). ACM, New York, NY, USA, 586-595. https://doi.org/10.1145/2737924.2737973 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nuno Machado, Brandon Lucia, and Luís Rodrigues. 2015b. Concurrency debugging with diferential schedule projections. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 586-595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazieres, and Mendel Rosenblum. 2017. Towards Practical Default-On Multi-Core Record/Replay. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). ACM, New York, NY, USA, 693-708. https://doi.org/10. 1145/3037697.3037751 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gerard Basler, Piramanayagam Arumuga Nainar, and Iulian Neamtiu. 2008. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (San Diego, California) ( OSDI'08). USENIX Association, Berkeley, CA, USA, 267-280. http://dl.acm.org/citation.cfm?id= 1855741. 1855760Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hiep Nguyen, Daniel J. Dean, Kamal Kc, and Xiaohui Gu. 2014. Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (Philadelphia, PA) (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 269-280. http://dl.acm.org/citation.cfm?id= 2643634. 2643663Google ScholarGoogle Scholar
  33. Feng Qin, Joseph Tucek, Jagadeesan Sundaresan, and Yuanyuan Zhou. 2005. Rx: Treating Bugs As Allergies-a Safe Method to Survive Software Failures. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles (Brighton, United Kingdom) (SOSP '05). ACM, New York, NY, USA, 235-248. https://doi.org/10.1145/1095810.1095833 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Quora. 2015. What is a coder's worst nightmare? https://www.quora.com/What-is-a-coders-worstnightmare.Google ScholarGoogle Scholar
  35. Swarup Kumar Sahoo, John Criswell, and Vikram Adve. 2010. An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering-Volume 1 ( Cape Town, South Africa) (ICSE '10). ACM, New York, NY, USA, 485-494. https://doi.org/10.1145/1806799.1806870 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Swarup Kumar Sahoo, John Criswell, Chase Geigle, and Vikram Adve. 2013. Using Likely Invariants for Automated Software Fault Localization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) ( ASPLOS '13). ACM, New York, NY, USA, 139-152. https://doi.org/10.1145/2451116.2451131 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Alex Sanchez-Stern, Pavel Panchekha, Sorin Lerner, and Zachary Tatlock. 2018. Finding Root Causes of Floating Point Error. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) ( PLDI 2018). ACM, New York, NY, USA, 256-269. https://doi.org/10.1145/3192366.3192411 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM Trans. Comput. Syst. 15, 4 (Nov. 1997 ), 391-411. https://doi.org/10.1145/265924.265927 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: a fast address sanity checker. In Proceedings of the 2012 USENIX conference on Annual Technical Conference (Boston, MA) ( USENIX ATC'12). USENIX Association, Berkeley, CA, USA, 28-28. http://dl.acm.org/citation.cfm?id= 2342821. 2342849Google ScholarGoogle Scholar
  40. Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. 2013. SoK: Eternal War in Memory. In Proceedings of the 2013 IEEE Symposium on Security and Privacy (SP '13). IEEE Computer Society, Washington, DC, USA, 48-62. https://doi.org/10.1109/SP. 2013.13 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou. 2007. Triage: Diagnosing Production Run Failures at the User's Site. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (Stevenson, Washington, USA) ( SOSP '07). ACM, New York, NY, USA, 131-144. https://doi.org/10.1145/1294261.1294275 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yan Wang, Harish Patil, Cristiano Pereira, Gregory Lueck, Rajiv Gupta, and Iulian Neamtiu. 2014. DrDebug: Deterministic Replay Based Cyclic Debugging with Dynamic Slicing. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (Orlando, FL, USA) ( CGO '14). ACM, New York, NY, USA, Article 98, 11 pages. https://doi.org/10.1145/ 2544137.2544152 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Paul F Wilson. 1993. Root cause analysis: A tool for total quality management. ASQ Quality Press.Google ScholarGoogle Scholar
  44. Jun Xu, Dongliang Mu, Ping Chen, Xinyu Xing, Pei Wang, and Peng Liu. 2016. CREDAL: Towards Locating a Memory Corruption Vulnerability with Your Core Dump. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) ( CCS '16). ACM, New York, NY, USA, 529-540. https://doi.org/10.1145/2976749.2978340 Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jun Xu, Dongliang Mu, Xinyu Xing, Peng Liu, Ping Chen, and Bing Mao. 2017. Postmortem program analysis with hardware-enhanced post-crash artifacts. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17). 17-32.Google ScholarGoogle Scholar
  46. Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy, and Lakshmi Bairavasundaram. 2011. How Do Fixes Become Bugs?. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (Szeged, Hungary) (ESEC/FSE '11). ACM, New York, NY, USA, 26-36. https://doi.org/10.1145/2025113.2025121 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tingting Yu, Tarannum S. Zaman, and Chao Wang. 2017. DESCRY: Reproducing System-Level Concurrency Failures. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017 ). Association for Computing Machinery, New York, NY, USA, 694-704. https://doi.org/10.1145/3106237.3106266 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. ZadYree. 2011. Unrar 3.9.3-Local Stack Overflow. Retrieved October 8, 2018 from https://www. exploit-db.com/exploits/17611/Google ScholarGoogle Scholar
  49. Tong Zhang, Changhee Jung, and Dongyoon Lee. 2017. ProRace: Practical Data Race Detection for Production Use. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). ACM, New York, NY, USA, 149-162. https://doi.org/10.1145/3037697.3037708 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xiangyu Zhang, Rajiv Gupta, and Youtao Zhang. 2003. Precise Dynamic Slicing Algorithms. In Proceedings of the 25th International Conference on Software Engineering (Portland, Oregon) (ICSE '03). IEEE Computer Society, Washington, DC, USA, 319-329. http://dl.acm.org/citation.cfm?id= 776816. 776855Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. 2019. The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) ( SOSP '19). ACM, New York, NY, USA, 131-146. https://doi.org/10.1145/3341301.3359650 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. WATCHER: in-situ failure diagnosis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the ACM on Programming Languages
        Proceedings of the ACM on Programming Languages  Volume 4, Issue OOPSLA
        November 2020
        3108 pages
        EISSN:2475-1421
        DOI:10.1145/3436718
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 November 2020
        Published in pacmpl Volume 4, Issue OOPSLA

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!