Abstract
Diagnosing software failures is important but notoriously challenging. Existing work either requires extensive manual effort, imposing a serious privacy concern (for in-production systems), or cannot report sufficient information for bug fixes. This paper presents a novel diagnosis system, named WATCHER, that can pinpoint root causes of program failures within the failing process ("in-situ"), eliminating the privacy concern. It combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. It further proposes two optimizations to reduce the diagnosis time and diagnose failures with control flow hijacks. WATCHER can be easily deployed, without requiring custom hardware or operating system, program modification, or recompilation. We evaluate WATCHER with 24 program failures in real-world deployed software, including large-scale applications, such as Memcached, SQLite, and OpenJPEG. Experimental results show that WATCHER can accurately identify the root causes in only a few seconds.
Supplemental Material
- Francesc Alted. 2010. Why modern CPUs are starving and what can be done about it. Computing in Science & Engineering 12, 2 ( 2010 ), 68.Google Scholar
- Matthew Arnold, Martin Vechev, and Eran Yahav. 2008. QVM: An Eficient Runtime for Detecting Defects in Deployed Systems. In Proceedings of the 23rd ACM SIGPLAN Conference on Objectoriented Programming Systems Languages and Applications (Nashville, TN, USA) ( OOPSLA '08). ACM, New York, NY, USA, 143-162. https://doi.org/10.1145/1449764.1449776 Google Scholar
Digital Library
- Jonathan Bell, Nikhil Sarda, and Gail Kaiser. 2013. Chronicler: Lightweight Recording to Reproduce Field Failures. In Proceedings of the 2013 International Conference on Software Engineering (San Francisco, CA, USA) ( ICSE '13). IEEE Press, Piscataway, NJ, USA, 362-371. http://dl.acm.org/ citation.cfm?id= 2486788. 2486836Google Scholar
Cross Ref
- Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: A Scalable Memory Allocator for Multithreaded Applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (Cambridge, Massachusetts, USA) (ASPLOS IX). ACM, New York, NY, USA, 117-128. https://doi.org/10.1145/378993.379232 Google Scholar
Digital Library
- Michael D. Bond, Nicholas Nethercote, Stephen W. Kent, Samuel Z. Guyer, and Kathryn S. McKinley. 2007. Tracking Bad Apples: Reporting the Origin of Null and Undefined Value Errors. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (Montreal, Quebec, Canada) ( OOPSLA '07). Association for Computing Machinery, New York, NY, USA, 405-422. https://doi.org/10.1145/1297027.1297057 Google Scholar
Digital Library
- Fred A. Bower, Daniel J. Sorin, and Sule Ozev. 2005. A Mechanism for Online Diagnosis of Hard Faults in Microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (Barcelona, Spain) (MICRO 38). IEEE Computer Society, Washington, DC, USA, 197-208. https://doi.org/10.1109/MICRO. 2005.8 Google Scholar
Digital Library
- Jong-Deok Choi, Keunwoo Lee, Alexey Loginov, Robert O'Callahan, Vivek Sarkar, Vivek Sarkar, and Manu Sridharan. 2002. Eficient and Precise Datarace Detection for Multithreaded Objectoriented Programs. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation (Berlin, Germany) ( PLDI '02). ACM, New York, NY, USA, 258-269. https://doi.org/10.1145/512529.512560 Google Scholar
Digital Library
- Weidong Cui, Xinyang Ge, Baris Kasikci, Ben Niu, Upamanyu Sharma, Ruoyu Wang, and Insu Yun. 2018. REPT: Reverse Debugging of Failures in Deployed Software. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. 17-32. https://www.usenix.org/conference/osdi18/presentation/weidongGoogle Scholar
- Weidong Cui, Marcus Peinado, Sang Kil Cha, Yanick Fratantonio, and Vasileios P. Kemerlis. 2016. RETracer: Triaging Crashes by Reverse Execution from Partial Memory Dumps. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) ( ICSE '16). ACM, New York, NY, USA, 820-831. https://doi.org/10.1145/2884781.2884844 Google Scholar
Digital Library
- Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Eficient and Precise Dynamic Race Detection. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (Dublin, Ireland) ( PLDI '09). ACM, New York, NY, USA, 121-133. https://doi.org/10.1145/1542476.1542490 Google Scholar
Digital Library
- Freyja. 2017. How much could software errors be costing your company? https://raygun.com/blog/ cost-of-software-errors/.Google Scholar
- Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt. 2009. Debugging in the (Very) Large: Ten Years of Implementation and Experience. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (Big Sky, Montana, USA) ( SOSP '09). ACM, New York, NY, USA, 103-116. https://doi.org/10.1145/1629575.1629586 Google Scholar
Digital Library
- Patrice Godefroid and Nachiappan Nagappan. 2008. Concurrency at Microsoft: An exploratory survey. In CAV Workshop on Exploiting Concurrency Eficiently and Correctly.Google Scholar
- Godefroid, Patrice and Nagappan, Nachi. 2008. Concurrency at Microsoft-An Exploratory Survey. http://www.microsoft.com/en-us/research/publication/concurrency-at-microsoft-anexploratory-survey/.Google Scholar
- Mark Harman and Robert Hierons. 2001. An overview of program slicing. software focus 2, 3 ( 2001 ), 85-92.Google Scholar
- Jef Huang, Charles Zhang, and Julian Dolby. 2013. CLAP: Recording Local Executions to Reproduce Concurrency Failures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) ( PLDI '13). ACM, New York, NY, USA, 141-152. https://doi.org/10.1145/2491956.2462167 Google Scholar
Digital Library
- Intel. 2017. Intel XED. Retrieved December, 2017 from https://intelxed.github.io/Google Scholar
- Guoliang Jin, Aditya Thakur, Ben Liblit, and Shan Lu. 2010. Instrumentation and Sampling Strategies for Cooperative Concurrency Bug Isolation. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (Reno/Tahoe, Nevada, USA) ( OOPSLA '10). ACM, New York, NY, USA, 241-255. https://doi.org/10.1145/1869459.1869481 Google Scholar
Digital Library
- Pallavi Joshi, Chang-Seo Park, Koushik Sen, and Mayur Naik. 2009. A Randomized Dynamic Program Analysis Technique for Detecting Real Deadlocks. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (Dublin, Ireland) ( PLDI '09). ACM, New York, NY, USA, 110-120. https://doi.org/10.1145/1542476.1542489 Google Scholar
Digital Library
- Baris Kasikci, Weidong Cui, Xinyang Ge, and Ben Niu. 2017. Lazy Diagnosis of In-Production Concurrency Bugs. In Proceedings of the 26th Symposium on Operating Systems Principles (Shanghai, China) (SOSP '17). ACM, New York, NY, USA, 582-598. https://doi.org/10.1145/3132747.3132767 Google Scholar
Digital Library
- Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. 2015. Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-production Failures. In Proceedings of the 25th Symposium on Operating Systems Principles (Monterey, California) ( SOSP '15). ACM, New York, NY, USA, 344-360. https://doi.org/10.1145/2815400.2815412 Google Scholar
Digital Library
- Samuel T. King, George W. Dunlap, and Peter M. Chen. 2005. Debugging Operating Systems with Time-traveling Virtual Machines. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (Anaheim, CA) ( ATEC '05). USENIX Association, Berkeley, CA, USA, 1-1. http://dl.acm.org/citation.cfm?id= 1247360. 1247361Google Scholar
- B. Korel and J. Laski. 1988. Dynamic Program Slicing. Inf. Process. Lett. 29, 3 (Oct. 1988 ), 155-163. https://doi.org/10.1016/ 0020-0190 ( 88 ) 90054-3 Google Scholar
Digital Library
- Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM 21, 7 ( July 1978 ), 558-565. https://doi.org/10.1145/359545.359563 Google Scholar
Digital Library
- Hongyu Liu, Sam Silvestro, Wei Wang, Chen Tian, and Tongping Liu. 2018. iReplayer: In-situ and Identical Record-and-replay for Multithreaded Applications. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) ( PLDI 2018). ACM, New York, NY, USA, 344-358. https://doi.org/10.1145/3192366.3192380 Google Scholar
Digital Library
- Shan Lu, Weihang Jiang, and Yuanyuan Zhou. 2007. A study of interleaving coverage criteria. In Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering (Dubrovnik, Croatia) (ESEC-FSE '07). ACM, New York, NY, USA, 533-536. https://doi.org/10.1145/1287624.1287703 Google Scholar
Digital Library
- Shan Lu, Zhenmin Li, Feng Qin, Lin Tan, Pin Zhou, and Yuanyuan Zhou. 2005. Bugbench: Benchmarks for evaluating bug detection tools. In In Workshop on the Evaluation of Software Defect Detection Tools.Google Scholar
- Nuno Machado, Brandon Lucia, and Luís Rodrigues. 2015a. Concurrency Debugging with Diferential Schedule Projections. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (Portland, OR, USA) ( PLDI '15). ACM, New York, NY, USA, 586-595. https://doi.org/10.1145/2737924.2737973 Google Scholar
Digital Library
- Nuno Machado, Brandon Lucia, and Luís Rodrigues. 2015b. Concurrency debugging with diferential schedule projections. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 586-595.Google Scholar
Digital Library
- Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazieres, and Mendel Rosenblum. 2017. Towards Practical Default-On Multi-Core Record/Replay. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). ACM, New York, NY, USA, 693-708. https://doi.org/10. 1145/3037697.3037751 Google Scholar
Digital Library
- Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gerard Basler, Piramanayagam Arumuga Nainar, and Iulian Neamtiu. 2008. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (San Diego, California) ( OSDI'08). USENIX Association, Berkeley, CA, USA, 267-280. http://dl.acm.org/citation.cfm?id= 1855741. 1855760Google Scholar
Digital Library
- Hiep Nguyen, Daniel J. Dean, Kamal Kc, and Xiaohui Gu. 2014. Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (Philadelphia, PA) (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 269-280. http://dl.acm.org/citation.cfm?id= 2643634. 2643663Google Scholar
- Feng Qin, Joseph Tucek, Jagadeesan Sundaresan, and Yuanyuan Zhou. 2005. Rx: Treating Bugs As Allergies-a Safe Method to Survive Software Failures. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles (Brighton, United Kingdom) (SOSP '05). ACM, New York, NY, USA, 235-248. https://doi.org/10.1145/1095810.1095833 Google Scholar
Digital Library
- Quora. 2015. What is a coder's worst nightmare? https://www.quora.com/What-is-a-coders-worstnightmare.Google Scholar
- Swarup Kumar Sahoo, John Criswell, and Vikram Adve. 2010. An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering-Volume 1 ( Cape Town, South Africa) (ICSE '10). ACM, New York, NY, USA, 485-494. https://doi.org/10.1145/1806799.1806870 Google Scholar
Digital Library
- Swarup Kumar Sahoo, John Criswell, Chase Geigle, and Vikram Adve. 2013. Using Likely Invariants for Automated Software Fault Localization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Houston, Texas, USA) ( ASPLOS '13). ACM, New York, NY, USA, 139-152. https://doi.org/10.1145/2451116.2451131 Google Scholar
Digital Library
- Alex Sanchez-Stern, Pavel Panchekha, Sorin Lerner, and Zachary Tatlock. 2018. Finding Root Causes of Floating Point Error. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) ( PLDI 2018). ACM, New York, NY, USA, 256-269. https://doi.org/10.1145/3192366.3192411 Google Scholar
Digital Library
- Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM Trans. Comput. Syst. 15, 4 (Nov. 1997 ), 391-411. https://doi.org/10.1145/265924.265927 Google Scholar
Digital Library
- Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: a fast address sanity checker. In Proceedings of the 2012 USENIX conference on Annual Technical Conference (Boston, MA) ( USENIX ATC'12). USENIX Association, Berkeley, CA, USA, 28-28. http://dl.acm.org/citation.cfm?id= 2342821. 2342849Google Scholar
- Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. 2013. SoK: Eternal War in Memory. In Proceedings of the 2013 IEEE Symposium on Security and Privacy (SP '13). IEEE Computer Society, Washington, DC, USA, 48-62. https://doi.org/10.1109/SP. 2013.13 Google Scholar
Digital Library
- Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou. 2007. Triage: Diagnosing Production Run Failures at the User's Site. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (Stevenson, Washington, USA) ( SOSP '07). ACM, New York, NY, USA, 131-144. https://doi.org/10.1145/1294261.1294275 Google Scholar
Digital Library
- Yan Wang, Harish Patil, Cristiano Pereira, Gregory Lueck, Rajiv Gupta, and Iulian Neamtiu. 2014. DrDebug: Deterministic Replay Based Cyclic Debugging with Dynamic Slicing. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (Orlando, FL, USA) ( CGO '14). ACM, New York, NY, USA, Article 98, 11 pages. https://doi.org/10.1145/ 2544137.2544152 Google Scholar
Digital Library
- Paul F Wilson. 1993. Root cause analysis: A tool for total quality management. ASQ Quality Press.Google Scholar
- Jun Xu, Dongliang Mu, Ping Chen, Xinyu Xing, Pei Wang, and Peng Liu. 2016. CREDAL: Towards Locating a Memory Corruption Vulnerability with Your Core Dump. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) ( CCS '16). ACM, New York, NY, USA, 529-540. https://doi.org/10.1145/2976749.2978340 Google Scholar
Digital Library
- Jun Xu, Dongliang Mu, Xinyu Xing, Peng Liu, Ping Chen, and Bing Mao. 2017. Postmortem program analysis with hardware-enhanced post-crash artifacts. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17). 17-32.Google Scholar
- Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy, and Lakshmi Bairavasundaram. 2011. How Do Fixes Become Bugs?. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (Szeged, Hungary) (ESEC/FSE '11). ACM, New York, NY, USA, 26-36. https://doi.org/10.1145/2025113.2025121 Google Scholar
Digital Library
- Tingting Yu, Tarannum S. Zaman, and Chao Wang. 2017. DESCRY: Reproducing System-Level Concurrency Failures. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017 ). Association for Computing Machinery, New York, NY, USA, 694-704. https://doi.org/10.1145/3106237.3106266 Google Scholar
Digital Library
- ZadYree. 2011. Unrar 3.9.3-Local Stack Overflow. Retrieved October 8, 2018 from https://www. exploit-db.com/exploits/17611/Google Scholar
- Tong Zhang, Changhee Jung, and Dongyoon Lee. 2017. ProRace: Practical Data Race Detection for Production Use. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). ACM, New York, NY, USA, 149-162. https://doi.org/10.1145/3037697.3037708 Google Scholar
Digital Library
- Xiangyu Zhang, Rajiv Gupta, and Youtao Zhang. 2003. Precise Dynamic Slicing Algorithms. In Proceedings of the 25th International Conference on Software Engineering (Portland, Oregon) (ICSE '03). IEEE Computer Society, Washington, DC, USA, 319-329. http://dl.acm.org/citation.cfm?id= 776816. 776855Google Scholar
Digital Library
- Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. 2019. The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) ( SOSP '19). ACM, New York, NY, USA, 131-146. https://doi.org/10.1145/3341301.3359650 Google Scholar
Digital Library
Index Terms
WATCHER: in-situ failure diagnosis
Recommendations
Decentralized failure diagnosis of discrete event systems
By decentralized diagnosis we mean diagnosis using multiple diagnosers, each possessing its own set of sensors, without involving any communication among diagnosers or to any coordinators. The notion of decentralized diagnosis is formalized by ...
Empirical study of root cause analysis of software failure
Root Cause Analysis (RCA) is the process of identifying project issues, correcting them and taking preventive actions to avoid occurrences of such issues in the future. Issues could be variance in schedule, effort, cost, productivity, expected results ...
Hotspot diagnosis on logical level
CNSM '11: Proceedings of the 7th International Conference on Network and Services ManagementHotspots in data center have been attributed to an increase in equipment failures, which causes system down time and business loss. In maintenance of IT equipment, removing hotspot with minimal cooling power cost is both ecological and financial ...






Comments