Abstract
It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault might occur, and introduce time-of-fault (TOF) bugs that only manifest when a node crashes or a message drops at a special moment. Although challenging, detecting TOF bugs is fundamental to developing highly available distributed systems. Unlike previous work that relies on fault injection to expose TOF bugs, this paper carefully models TOF bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution. Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe TOF bugs.
- Hbase-3596. https://issues.apache.org/jira/browse/HBASE-3596, 2011.Google Scholar
- Mapreduce-3858. https://issues.apache.org/jira/browse/MAPREDUCE-3858, 2012.Google Scholar
- Cassandra-5393. https://issues.apache.org/jira/browse/CASSANDRA-5393, 2013.Google Scholar
- Cassandra-6415. https://issues.apache.org/jira/browse/CASSANDRA-6415, 2013.Google Scholar
- Hbase-10090. https://issues.apache.org/jira/browse/HBASE-10090, 2013.Google Scholar
- Mapreduce-5476. https://issues.apache.org/jira/browse/MAPREDUCE-5476, 2013.Google Scholar
- Zookeeper-1653. https://issues.apache.org/jira/browse/ZOOKEEPER-1653, 2013.Google Scholar
- Java platform standard edition 7 documentation. https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode(), 2017.Google Scholar
- Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Correlated crash vulnerabilities. In OSDI, 2016. Google Scholar
Digital Library
- Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, and Lorin Hochstein. Automating failure testing research at internet scale. In SoCC, 2016. Google Scholar
Digital Library
- Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. Lineage-driven Fault Injection. In SIGMOD, 2015. Google Scholar
Digital Library
- Kumud Bhandari, Dhruva R Chakrabarti, and Hans-J Boehm. Makalu: Fast recoverable allocation of non-volatile memory. In OOPSLA, 2016. Google Scholar
Digital Library
- Lucas Brutschy, Dimitar Dimitrov, Peter Müller, and Martin T. Vechev. Serializability for eventual consistency: criterion, analysis, and applications. In POPL, 2017. Google Scholar
Digital Library
- Pierre Castéran and Yves Bertot. Interactive theorem proving and program development. coq'art: The calculus of inductive constructions., 2004. Google Scholar
Digital Library
- Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. jPredictor: a predictive runtime analysis tool for java. In ICSE, 2008. Google Scholar
Digital Library
- Datapath.io. Recent aws outage and how you could have avoided downtime. https://medium.com/@datapath_io/recent-aws-outage-and-how-you-could-have-avoided-downtime-7d9d9443d776, 2017.Google Scholar
- Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google Scholar
- Pantazis Deligiannis, Alastair F Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. Asynchronous programming, analysis and testing with state machines. In PLDI, 2015. Google Scholar
Digital Library
- Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions. In FAST, 2017. Google Scholar
Digital Library
- Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. textscFate and textscDestini: A Framework for Cloud Recovery Testing. In NSDI, 2011. Google Scholar
Digital Library
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In SoCC, 2014. Google Scholar
Digital Library
- Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. EIO: error handling is occasionally correct. In FAST, 2008. Google Scholar
Digital Library
- Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP, 2011. Google Scholar
Digital Library
- Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. Failure recovery: When the cure is worse than the disease. In HotOS, 2013. Google Scholar
Digital Library
- Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. IronFleet: proving practical distributed systems correct. In SOSP, 2015. Google Scholar
Digital Library
- Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. Race detection for event-driven mobile applications. In PLDI, 2014. Google Scholar
Digital Library
- IBM. Main page - walawiki. http://wala.sourceforge.net/wiki/index.php/Main_Page.Google Scholar
- jboss javassist. Javassist. http://jboss-javassist.github.io/javassist/.Google Scholar
- Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. Setsud=o: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems, 2013. Google Scholar
Digital Library
- Charles Killian, James Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In NSDI, 2007. Google Scholar
Digital Library
- Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M Chen, and Thomas F Wenisch. High-performance transactions for persistent memories. In ASPLOS, 2016. Google Scholar
Digital Library
- Eric Koskinen and Junfeng Yang. Reducing crash recoverability to reachability. In POPL, 2016. Google Scholar
Digital Library
- Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558--565, July 1978. Google Scholar
Digital Library
- Leslie Lamport. Specifying systems: the TLA+ language and tools for hardware and software engineers. Addison-Wesley Longman Publishing Co., Inc., 2002. Google Scholar
Digital Library
- Philip Lantz, Dulloor Subramanya Rao, Sanjay Kumar, Rajesh Sankaran, and Jeff Jackson. Yat: A validation framework for persistent memory software. In ATC, 2014. Google Scholar
Digital Library
- Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI, 2014. Google Scholar
Digital Library
- Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. In ASPLOS, 2016. Google Scholar
Digital Library
- Kaituo Li, Pallavi Joshi, Aarti Gupta, and Malay K Ganai. Reprolite: A lightweight tool to quickly reproduce hard system bugs. In SoCC, 2014. Google Scholar
Digital Library
- Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. DCatch: Automatically detecting distributed concurrency bugs in cloud systems. In ASPLOS, 2017. Google Scholar
Digital Library
- Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP, 2015. Google Scholar
Digital Library
- Pallavi Maiya, Aditya Kanade, and Rupak Majumdar. Race detection for android applications. In PLDI, 2014. Google Scholar
Digital Library
- IHS Markit. Businesses losing $700 billion a year to it downtime, says ihs. http://news.ihsmarkit.com/press-release/technology/businesses-losing-700-billion-year-it-downtime-says-ihs, 2016.Google Scholar
- Ali José Mashtizadeh, Tal Garfinkel, David Terei, David Mazieres, and Mendel Rosenblum. Towards practical default-on multi-core record/replay. In ASPLOS, 2017.Google Scholar
Digital Library
- Robert H. B. Netzer and Barton P. Miller. Improving The Accuracy of Data Race Detection. In PPoPP, 1991. Google Scholar
Digital Library
- Oracle. Virtualbox -- oracle vm virtualbox. https://www.virtualbox.org/wiki/VirtualBox.Google Scholar
- Steven Pelley, Peter M Chen, and Thomas F Wenisch. Memory persistency. In ISCA, 2014. Google Scholar
Digital Library
- Boris Petrov, Martin Vechev, Manu Sridharan, and Julian Dolby. Race detection for web applications. In PLDI, 2012. Google Scholar
Digital Library
- Cindy Rubio-González, Haryadi S. Gunawi, Ben Liblit, Remzi H. Arpaci-Dusseau, and Andrea C. Arpaci-Dusseau. Error propagation analysis for file systems. In PLDI, 2009. Google Scholar
Digital Library
- Suman Saha, Jean-Pierre Lozi, Gaël Thomas, Julia L. Lawall, and Gilles Muller. Hector: Detecting resource-release omission faults in error-handling code for systems software. In DSN, 2013. Google Scholar
Digital Library
- Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM TOCS, 1997. Google Scholar
Digital Library
- Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV, 2010. Google Scholar
Digital Library
- Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tallam. Dynamic Recognition of Synchronization Operations for Improved Data Race Detection. In ISSTA, 2008. Google Scholar
Digital Library
- Haris Volos, Andres Jaan Tack, and Michael M Swift. Mnemosyne: Lightweight persistent memory. In ASPLOS, 2011. Google Scholar
Digital Library
- James R Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D Ernst, and Thomas Anderson. Verdi: a framework for implementing and formally verifying distributed systems. In PLDI, 2015. Google Scholar
Digital Library
- Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. Ad Hoc Synchronization Considered Harmful. In OSDI, 2010. Google Scholar
Digital Library
- Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI, 2009. Google Scholar
Digital Library
- Junfeng Yang, Can Sar, and Dawson Engler. Explode: a lightweight, general system for finding serious storage system errors. In OSDI, 2006. Google Scholar
Digital Library
- Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using model checking to find serious file system errors. In OSDI, 2004. Google Scholar
Digital Library
- Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In OSDI, 2014. Google Scholar
Digital Library
- Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S Yang, Bill W Zhao, and Shashank Singh. Torturing databases for fun and profit. In OSDI, 2014. Google Scholar
Digital Library
Index Terms
FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems
Recommendations
CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesCrash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when ...
DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems
Asplos'17In big data and cloud computing era, reliability of distributed systems is extremely important. Unfortunately, distributed concurrency bugs, referred to as DCbugs, widely exist. They hide in the large state space of distributed cloud systems and ...
FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsIt is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault ...







Comments