skip to main content
research-article
Public Access

TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems

Published:25 March 2016Publication History
Skip Abstract Section

Abstract

We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs from four widely-deployed cloud-scale datacenter distributed systems, Cassandra, Hadoop MapReduce, HBase and ZooKeeper. We study DC-bug characteristics along several axes of analysis such as the triggering timing condition and input preconditions, error and failure symptoms, and fix strategies, collectively stored as 2,083 classification labels in TaxDC database. We discuss how our study can open up many new research directions in combating DC bugs.

References

  1. http://www.freepastry.org/.Google ScholarGoogle Scholar
  2. http://ucare.cs.uchicago.edu/projects/cbs/.Google ScholarGoogle Scholar
  3. Apache Cassandra. http://cassandra.apache.org.Google ScholarGoogle Scholar
  4. Apache Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  5. Apache Hadoop NextGen MapReduce (YARN).hrefhttps://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttps://hadoop.apache.org/docs/current/hrefhttps://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhadoop-yarn/hadoop-yarn-site/YARN.html.Google ScholarGoogle Scholar
  6. Apache HBase. http://hbase.apache.org.Google ScholarGoogle Scholar
  7. Apache ZooKeeper. http://zookeeper.apache.org.Google ScholarGoogle Scholar
  8. Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley.textscPacer: Proportional Detection of Data Races. In PLDI, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs. In ASPLOS, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI '06.Google ScholarGoogle Scholar
  11. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI '06.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An Empirical Study of Operating System Errors. In SOSP '01.Google ScholarGoogle ScholarCross RefCross Ref
  13. Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In OSDI '14.Google ScholarGoogle Scholar
  14. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI '04.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-value Store. In SOSP '07.Google ScholarGoogle Scholar
  16. Ankush Desai, Vivek Gupta, Ethan Jackson, Shaz Qadeer, Sriram Rajamani, and Damien Zufferey. P: Safe Asynchronous Event-Driven Programming. In PLDI '13.Google ScholarGoogle Scholar
  17. Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In SoCC '13.Google ScholarGoogle Scholar
  18. Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In SOSP, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Michael D. Ernst, Adam Czeisler, William G. Griswold, and David Notkin. Quickly Detecting Relevant Program Invariants. In ICSE, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  20. Cormac Flanagan and Stephen N. Freund. FastTrack: Efficient and Precise Dynamic Race Detection. In PLDI, 2009.Google ScholarGoogle Scholar
  21. Pedro Fonseca, Cheng Li, Vishal Singhal, and Rodrigo Rodrigues. A Study of the Internal and External Effects of Concurrency Bugs. In DSN, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  22. Dennis Geels, Gautam Altekar, Petros Maniatis, Timothy Roscoe, and Ion Stoica. Friday: Global Comprehension for Distributed Replay. In NSDI '07.Google ScholarGoogle Scholar
  23. Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica. Replay Debugging for Distributed Applications. In USENIX ATC '06.Google ScholarGoogle Scholar
  24. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. Fate and Destini: A Framework for Cloud Recovery Testing. In NSDI '11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000Google ScholarGoogle Scholar
  26. Issues in Cloud Systems. In SoCC '14.Google ScholarGoogle Scholar
  27. Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP '11.Google ScholarGoogle Scholar
  28. Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments. In FAST '16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: Proving Practical Distributed Systems Correct. In SOSP '15.Google ScholarGoogle Scholar
  30. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI '11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Chun-Hung Hsiao, Cristiano L. Pereira, Jie Yu, Gilles A. Pokam, Satish Narayanasamy, Peter M. Chen, Ziyun Kong, and Jason Flinn. Race Detection for Event-Driven Mobile Applications. In PLDI, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Nicholas Hunt, Tom Bergan, Luis Ceze, and Steven D. Gribble. DDOS: Taming Nondeterminism in Distributed Systems. In ASPLOS '13.Google ScholarGoogle Scholar
  33. Nicholas Jalbert, Cristiano Pereira, Gilles Pokam, and Koushik Sen. RADBench: A Concurrency Bug Benchmark Suite. In HotPar, 2011.Google ScholarGoogle Scholar
  34. Guoliang Jin, Aditya Thakur, Ben Liblit, and Shan Lu. Instrumentation and Sampling Strategies for Cooperative Concurrency Bug Isolation. In OOPSLA, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. Automated Concurrency-Bug Fixing. In OSDI, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen.textscPreFail: A Programmable Tool for Multiple-Failure Injection. In OOPSLA '11.Google ScholarGoogle Scholar
  37. Horatiu Jula, Daniel Tralamazza, Cristian Zamfir, and George Candea. Deadlock Immunity: Enabling Systems To Defend Against Deadlocks. In OSDI, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In NSDI '07.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI '14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sihan Li, Hucheng Zhou, Haoxiang Lin, Tian Xiao, Haibo Lin, Wei Lin, and Tao Xie. A Characteristic Study on Failures of Production Distributed Data-Parallel Programs. In ICSE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  41. Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. Bug Isolation via Remote Program Sampling. In PLDI, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Peng Liu, Omer Tripp, and Charles Zhang. Grail: Context-Aware Fixing of Concurrency Bugs. In FSE, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In NSDI '08.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Xuezheng Liu, Wei Lin, Aimin Pan, and Zheng Zhang. WiDS Checker: Combating Bugs in Distributed Systems. In NSDI '07.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. A Study of Linux File System Evolution. In FAST '13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from Mistakes -- A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS '08.Google ScholarGoogle Scholar
  47. Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. AVIO: Detecting Atomicity Violations via Access Interleaving Invariants. In ASPLOS, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Brandon Lucia and Luis Ceze. Cooperative Empirical Failure Avoidance for Multithreaded Programs. In ASPLOS, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Madanlal Musuvathi and Shaz Qadeer. Iterative Context Bounding for Systematic Testing of Multithreaded Programs. In PLDI, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm. In USENIX ATC '14.Google ScholarGoogle Scholar
  51. Nicolas Palix, Gael Thomas, Suman Saha, Christophe Calvès, Julia Lawall, and Gilles Muller. Faults in Linux: Ten Years Later. In ASPLOS, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Soyeon Park, Shan Lu, and Yuanyuan Zhou. CTrigger: Exposing Atomicity Violation Bugs from Their Finding Places. In ASPLOS, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich, Nima Honarmand, Nathan Dautenhahn, Samuel T. King, and Josep Torrellas. QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs. In ISCA, 2013.Google ScholarGoogle Scholar
  54. Shanxiang Qi, Abdullah A. Muzahid, Wonsun Ahn, and Josep Torrellas. Dynamically Detecting and Tolerating IF-Condition Data Races. In HPCA, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  55. Veselin Raychev, Martin T. Vechev, and Manu Sridharan. Effective race detection for event-driven programs. In OOPSLA, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In NSDI '06.Google ScholarGoogle Scholar
  57. Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. ACM TOCS, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Koushik Sen. Race Directed Random Testing of Concurrent Programs. In PLDI, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Koushik Sen and Gul Agha. Automated Systematic Testing of Open Distributed Programs. In FSE, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In SSV '10.Google ScholarGoogle Scholar
  61. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. In SIGCOMM '01.Google ScholarGoogle Scholar
  62. Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. DoublePlay: Parallelizing Sequential Logging and Replay. In ASPLOS, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Yin Wang, Terence Kelly, Manjunath Kudlur, Stephane Lafortune, and Scott Mahlke. Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs. In OSDI, 2008.Google ScholarGoogle Scholar
  64. James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. Verdi: A Framework for Implementing and Formally Verifying Distributed System. In PLDI, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Tian Xiao, Jiaxing Zhang, Hucheng Zhou, Zhenyu Guo, Sean McDirmid, Wei Lin, Wenguang Chen, and Lidong Zhou. Nondeterminism in MapReduce Considered Harmful? An Empirical Study on Non-commutative Aggregators in MapReduce Programs. In ICSE '14.Google ScholarGoogle Scholar
  66. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI '09.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Jie Yu. A collection of concurrency bugs.textcolorMaroonhttps://github.com/jieyu/concurrency-bugs.Google ScholarGoogle Scholar
  68. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In OSDI '14.Google ScholarGoogle Scholar
  69. Wei Zhang, Marc de Kruijf, Ang Li, Shan Lu, and Karthikeyan Sankaralingam. ConAir: Featherweight concurrency bug recovery via single-threaded idempotent execution. In ASPLOS, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 51, Issue 4
        ASPLOS '16
        April 2016
        774 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2954679
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2016
          824 pages
          ISBN:9781450340915
          DOI:10.1145/2872362
          • General Chair:
          • Tom Conte,
          • Program Chair:
          • Yuanyuan Zhou

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 March 2016

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!