skip to main content
research-article
Public Access

FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems

Authors Info & Claims
Published:19 March 2018Publication History
Skip Abstract Section

Abstract

It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault might occur, and introduce time-of-fault (TOF) bugs that only manifest when a node crashes or a message drops at a special moment. Although challenging, detecting TOF bugs is fundamental to developing highly available distributed systems. Unlike previous work that relies on fault injection to expose TOF bugs, this paper carefully models TOF bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution. Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe TOF bugs.

References

  1. Hbase-3596. https://issues.apache.org/jira/browse/HBASE-3596, 2011.Google ScholarGoogle Scholar
  2. Mapreduce-3858. https://issues.apache.org/jira/browse/MAPREDUCE-3858, 2012.Google ScholarGoogle Scholar
  3. Cassandra-5393. https://issues.apache.org/jira/browse/CASSANDRA-5393, 2013.Google ScholarGoogle Scholar
  4. Cassandra-6415. https://issues.apache.org/jira/browse/CASSANDRA-6415, 2013.Google ScholarGoogle Scholar
  5. Hbase-10090. https://issues.apache.org/jira/browse/HBASE-10090, 2013.Google ScholarGoogle Scholar
  6. Mapreduce-5476. https://issues.apache.org/jira/browse/MAPREDUCE-5476, 2013.Google ScholarGoogle Scholar
  7. Zookeeper-1653. https://issues.apache.org/jira/browse/ZOOKEEPER-1653, 2013.Google ScholarGoogle Scholar
  8. Java platform standard edition 7 documentation. https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode(), 2017.Google ScholarGoogle Scholar
  9. Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Correlated crash vulnerabilities. In OSDI, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. Automating failure testing research at internet scale. In SoCC, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. Lineage-driven Fault Injection. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kumud Bhandari, Dhruva R Chakrabarti, and Hans-J Boehm. Makalu: Fast recoverable allocation of non-volatile memory. In OOPSLA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lucas Brutschy, Dimitar Dimitrov, Peter Müller, and Martin T. Vechev. Serializability for eventual consistency: criterion, analysis, and applications. In POPL, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pierre Castéran and Yves Bertot. Interactive theorem proving and program development. coq'art: The calculus of inductive constructions., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. jPredictor: a predictive runtime analysis tool for java. In ICSE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Datapath.io. Recent aws outage and how you could have avoided downtime. https://medium.com/@datapath_io/recent-aws-outage-and-how-you-could-have-avoided-downtime-7d9d9443d776, 2017.Google ScholarGoogle Scholar
  17. Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google ScholarGoogle Scholar
  18. Pantazis Deligiannis, Alastair F Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. Asynchronous programming, analysis and testing with state machines. In PLDI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In FAST, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. textscFate and textscDestini: A Framework for Cloud Recovery Testing. In NSDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In SoCC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. EIO: error handling is occasionally correct. In FAST, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. Failure recovery: When the cure is worse than the disease. In HotOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. IronFleet: proving practical distributed systems correct. In SOSP, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. Race detection for event-driven mobile applications. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. IBM. Main page - walawiki. http://wala.sourceforge.net/wiki/index.php/Main_Page.Google ScholarGoogle Scholar
  28. jboss javassist. Javassist. http://jboss-javassist.github.io/javassist/.Google ScholarGoogle Scholar
  29. Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. Setsud=o: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Charles Killian, James Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In NSDI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M Chen, and Thomas F Wenisch. High-performance transactions for persistent memories. In ASPLOS, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Eric Koskinen and Junfeng Yang. Reducing crash recoverability to reachability. In POPL, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558--565, July 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Leslie Lamport. Specifying systems: the TLA+ language and tools for hardware and software engineers. Addison-Wesley Longman Publishing Co., Inc., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Philip Lantz, Dulloor Subramanya Rao, Sanjay Kumar, Rajesh Sankaran, and Jeff Jackson. Yat: A validation framework for persistent memory software. In ATC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In ASPLOS, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kaituo Li, Pallavi Joshi, Aarti Gupta, and Malay K Ganai. Reprolite: A lightweight tool to quickly reproduce hard system bugs. In SoCC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. DCatch: Automatically detecting distributed concurrency bugs in cloud systems. In ASPLOS, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Pallavi Maiya, Aditya Kanade, and Rupak Majumdar. Race detection for android applications. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. IHS Markit. Businesses losing $700 billion a year to it downtime, says ihs. http://news.ihsmarkit.com/press-release/technology/businesses-losing-700-billion-year-it-downtime-says-ihs, 2016.Google ScholarGoogle Scholar
  43. Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazieres, and Mendel Rosenblum. Towards practical default-on multi-core record/replay. In ASPLOS, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Robert H. B. Netzer and Barton P. Miller. Improving The Accuracy of Data Race Detection. In PPoPP, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Oracle. Virtualbox -- oracle vm virtualbox. https://www.virtualbox.org/wiki/VirtualBox.Google ScholarGoogle Scholar
  46. Steven Pelley, Peter M Chen, and Thomas F Wenisch. Memory persistency. In ISCA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Boris Petrov, Martin Vechev, Manu Sridharan, and Julian Dolby. Race detection for web applications. In PLDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Cindy Rubio-González, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau. Error propagation analysis for file systems. In PLDI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Suman Saha, Jean-Pierre Lozi, Gaël Thomas, Julia L. Lawall, and Gilles Muller. Hector: Detecting resource-release omission faults in error-handling code for systems software. In DSN, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM TOCS, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tallam. Dynamic Recognition of Synchronization Operations for Improved Data Race Detection. In ISSTA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Haris Volos, Andres Jaan Tack, and Michael M Swift. Mnemosyne: Lightweight persistent memory. In ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. James R Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D Ernst, and Thomas Anderson. Verdi: a framework for implementing and formally verifying distributed systems. In PLDI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. Ad Hoc Synchronization Considered Harmful. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Junfeng Yang, Can Sar, and Dawson Engler. Explode: a lightweight, general system for finding serious storage system errors. In OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using model checking to find serious file system errors. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S Yang, Bill W Zhao, and Shashank Singh. Torturing databases for fun and profit. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 53, Issue 2
          ASPLOS '18
          February 2018
          809 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/3296957
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
            March 2018
            827 pages
            ISBN:9781450349116
            DOI:10.1145/3173162

          Copyright © 2018 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 March 2018

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!