skip to main content
research-article
Public Access

DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems

Published:04 April 2017Publication History
Skip Abstract Section

Abstract

In big data and cloud computing era, reliability of distributed systems is extremely important. Unfortunately, distributed concurrency bugs, referred to as DCbugs, widely exist. They hide in the large state space of distributed cloud systems and manifest non-deterministically depending on the timing of distributed computation and communication. Effective techniques to detect DCbugs are desired. This paper presents a pilot solution, DCatch, in the world of DCbug detection. DCatch predicts DCbugs by analyzing correct execution of distributed systems. To build DCatch, we design a set of happens-before rules that model a wide variety of communication and concurrency mechanisms in real-world distributed cloud systems. We then build runtime tracing and trace analysis tools to effectively identify concurrent conflicting memory accesses in these systems. Finally, we design tools to help prune false positives and trigger DCbugs. We have evaluated DCatch on four representative open-source distributed cloud systems, Cassandra, Hadoop MapReduce, HBase, and ZooKeeper. By monitoring correct execution of seven workloads on these systems, DCatch reports 32 DCbugs, with 20 of them being truly harmful.

References

  1. Yves Bertot and Pierre Castéran. Interactive Theorem Proving and Program Development. Coq'Art: The Calculus of Inductive Constructions. Springer, 2004. http://www.labri.fr/perso/casteran/CoqArt/index.html.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Lucas Brutschy, Dimitar Dimitrov, Peter Müller, and Martin T. Vechev. Serializability for eventual consistency: criterion, analysis, and applications. In POPL, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In SOSP, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ankush Desai, Vivek Gupta, Ethan Jackson, Shaz Qadeer, Sriram Rajamani, and Damien Zufferey. P: Safe Asynchronous Event-Driven Programming. In PLDI, 2013.Google ScholarGoogle Scholar
  8. Cormac Flanagan and Stephen N Freund. Atomizer: A Dynamic Atomicity Checker For Multithreaded Programs. ACM SIGPLAN Notices, 39(1):256--267, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. In PLDI, 2009.Google ScholarGoogle Scholar
  10. Qi Gao, Wenbin Zhang, Zhezhe Chen, Mai Zheng, and Feng Qin. 2ndStrike: Toward Manifesting Hidden Concurrency Typestate Bugs. In ASPLOS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur.textscFate andtextscDestini: A Framework for Cloud Recovery Testing. In NSDI, 2011.Google ScholarGoogle Scholar
  13. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000Google ScholarGoogle Scholar
  14. Issues in Cloud Systems. In SoCC, 2014.Google ScholarGoogle Scholar
  15. Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: Proving Practical Distributed Systems Correct. In SOSP, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chun-Hung Hsiao, Cristiano Pereira, Jie Yu, Gilles Pokam, Satish Narayanasamy, Peter M. Chen, Ziyun Kong, and Jason Flinn. Race Detection for Event-Driven Mobile Applications. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In ATC, 2010.Google ScholarGoogle Scholar
  20. Baris Kasikci, Cristian Zamfir, and George Candea. Data Races vs. Data Race Bugs: Telling the Difference with Portend. In ASPLOS, 2012.Google ScholarGoogle Scholar
  21. Charles Killian, James Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In NSDI, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7):558--565, July 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Leslie Lamport. Specifying Systems: The TLAGoogle ScholarGoogle Scholar
  24. Language and Tools for Hardware and Software Engineers. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002.Google ScholarGoogle Scholar
  25. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In ASPLOS, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xuezheng Liu, Wei Lin, Aimin Pan, and Zheng Zhang. WiDS Checker: Combating Bugs in Distributed Systems. In NSDI, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. AVIO: Detecting Atomicity Violations via Access Interleaving Invariants. In ASPLOS, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Brandon Lucia, Luis Ceze, and Karin Strauss. ColorSafe: Architectural Support for Debugging and Dynamically Avoiding Multi-Variable Atomicity Violations. In ISCA, 2010.Google ScholarGoogle Scholar
  30. Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pallavi Maiya, Aditya Kanade, and Rupak Majumdar. Race Detection for Android Applications. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. Naiad: A Timely Dataflow System. In SOSP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gerard Basler, Piramanayagam Arumuga Nainar, and Iulian Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. In OSDI, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Robert H. B. Netzer and Barton P. Miller. Improving The Accuracy of Data Race Detection. In PPoPP, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. Flat Datacenter Storage. In OSDI, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm. In ATC, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Soyeon Park, Shan Lu, and Yuanyuan Zhou. CTrigger: Exposing Atomicity Violation Bugs from Their Finding Places. In ASPLOS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Veselin Raychev, Martin T. Vechev, and Manu Sridharan. Effective Race Detection for Event-Driven Programs. In OOPSLA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM TOCS, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Koushik Sen. Race Directed Random Testing of Concurrent Programs. In PLDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yao Shi, Soyeon Park, Zuoning Yin, Shan Lu, Yuanyuan Zhou, Wenguang Chen, and Weimin Zheng. Do I Use the Wrong Definition? DefUse: Definition-Use Invariants for Detecting Concurrency and Sequential Bugs. In OOPSLA, 2010.Google ScholarGoogle Scholar
  42. Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google ScholarGoogle Scholar
  43. Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tallam. Dynamic Recognition of Synchronization Operations for Improved Data Race Detection. In ISSTA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In SoCC, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kaushik Veeraraghavan, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. Detecting and Surviving Data Races using Complementary Schedules. In SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Tom Anderson. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In PLDI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. Ad Hoc Synchronization Considered Harmful. In OSDI, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jiaqi Zhang, Weiwei Xiong, Yang Liu, Soyeon Park, Yuanyuan Zhou, and Zhiqiang Ma. ATDetector: improving the accuracy of a commercial data race detector by identifying address transfer. In MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Wei Zhang, Junghee Lim, Ramya Olichandran, Joel Scherpelz, Guoliang Jin, Shan Lu, and Thomas Reps. ConSeq: Detecting Concurrency Bugs through Sequential Errors. In ASPLOS, 2011.Google ScholarGoogle Scholar
  52. Wei Zhang, Chong Sun, and Shan Lu. ConMem: Detecting Severe Concurrency Bugs through an Effect-Oriented Approach. In ASPLOS, 2010.Google ScholarGoogle Scholar

Index Terms

  1. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 52, Issue 4
          ASPLOS '17
          April 2017
          811 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/3093336
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
            April 2017
            856 pages
            ISBN:9781450344654
            DOI:10.1145/3037697

          Copyright © 2017 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 April 2017

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!