skip to main content
research-article
Open Access

Testing consensus implementations using communication closure

Published:13 November 2020Publication History
Skip Abstract Section

Abstract

Large scale production distributed systems are difficult to design and test. Correctness must be ensured when processes run asynchronously, at arbitrary rates relative to each other, and in the presence of failures, e.g., process crashes or message losses. These conditions create a huge space of executions that is difficult to explore in a principled way. Current testing techniques focus on systematic or randomized exploration of all executions of an implementation while treating the implemented algorithms as black boxes. On the other hand, proofs of correctness of many of the underlying algorithms often exploit semantic properties that reduce reasoning about correctness to a subset of behaviors. For example, the communication-closure property, used in many proofs of distributed consensus algorithms, shows that every asynchronous execution of the algorithm is equivalent to a lossy synchronous execution, thus reducing the burden of proof to only that subset. In a lossy synchronous execution, processes execute in lock-step rounds, and messages are either received in the same round or lost forever—such executions form a small subset of all asynchronous ones.

We formulate the communication-closure hypothesis, which states that bugs in implementations of distributed consensus algorithms will already manifest in lossy synchronous executions and present a testing algorithm based on this hypothesis. We prioritize the search space based on a bound on the number of failures in the execution and the rate at which these failures are recovered. We show that a random testing algorithm based on sampling lossy synchronous executions can empirically find a number of bugs—including previously unknown ones—in production distributed systems such as Zookeeper, Cassandra, and Ratis, and also produce more understandable bug traces.

Skip Supplemental Material Section

Supplemental Material

Auxiliary Presentation Video

This is a presentation video on our research paper "Testing Consensus Implementations using Communication Closure" at OOPSLA 2020. In this paper, we present a testing methodology that complements theoretical concepts from the distributed computing community with novel search prioritization and randomization techniques. We formulate the communication-closure hypothesis, which states that bugs in implementations of distributed consensus algorithms will already manifest in lossy synchronous executions and present a testing algorithm based on this hypothesis. We prioritize the search space based on a bound on the number of failures in the execution and the rate at which these failures are recovered. We show that a random testing algorithm based on sampling lossy synchronous executions can empirically find a number of bugs—including previously unknown ones—in production distributed systems such as Zookeeper, Cassandra, and Ratis, and also produce more understandable bug traces.

References

  1. Apache. 2013. CASSANDRA-6023: CAS should distinguish promised and accepted ballots. Retrieved January 26, 2020 from http://issues.apache.org/jira/browse/CASSANDRA-6023Google ScholarGoogle Scholar
  2. Apache. 2020. Apache Ratis. Retrieved May 14, 2020 from http://ratis.incubator.apache.org/Google ScholarGoogle Scholar
  3. Kenneth P. Birman and Robert Cooper. 1991. The ISIS Project: Real Experience with a Fault Tolerant Programming System. ACM SIGOPS Oper. Syst. Rev. 25, 2 ( 1991 ), 103-107. https://doi.org/10.1145/122120.122133 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kenneth P. Birman and Thomas A. Joseph. 1987. Exploiting Virtual Synchrony in Distributed Systems. In Proceedings of the Eleventh ACM Symposium on Operating System Principles, SOSP 1987, Stoufer Austin Hotel, Austin, Texas, USA, November 8-11, 1987, Les Belady (Ed.). ACM, 123-138. https://doi.org/10.1145/41457.37515 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010, James C. Hoe and Vikram S. Adve (Eds.). ACM, 167-178. https://doi.org/10.1145/1736020.1736040 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: an engineering perspective. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, PODC 2007, Portland, Oregon, USA, August 12-15, 2007, Indranil Gupta and Roger Wattenhofer (Eds.). ACM, 398-407. https://doi.org/10.1145/ 1281100.1281103 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mouna Chaouch-Saad, Bernadette Charron-Bost, and Stephan Merz. 2009. A Reduction Theorem for the Verification of Round-Based Distributed Algorithms. In Reachability Problems, 3rd International Workshop, RP 2009, Palaiseau, France, September 23-25, 2009. Proceedings (Lecture Notes in Computer Science, Vol. 5797 ), Olivier Bournez and Igor Potapov (Eds.). Springer, 93-106. https://doi.org/10.1007/978-3-642-04420-5_10 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Charalambos A Charalambides. 2018. Enumerative combinatorics. Chapman and Hall/CRC.Google ScholarGoogle Scholar
  9. Bernadette Charron-Bost and André Schiper. 2009. The Heard-Of model: computing in distributed systems with benign faults. Distributed Comput. 22, 1 ( 2009 ), 49-71. https://doi.org/10.1007/s00446-009-0084-6 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kaustuv Chaudhuri, Damien Doligez, Leslie Lamport, and Stephan Merz. 2010. Verifying Safety Properties with the TLA+ Proof System. In Automated Reasoning, 5th International Joint Conference, IJCAR 2010, Edinburgh, UK, July 16-19, 2010. Proceedings (Lecture Notes in Computer Science, Vol. 6173 ), Jürgen Giesl and Reiner Hähnle (Eds.). Springer, 142-148. https://doi.org/10.1007/978-3-642-14203-1_12 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ching-Tsun Chou and Eli Gafni. 1988. Understanding and Verifying Distributed Algorithms Using Stratified Decomposition.Google ScholarGoogle Scholar
  12. August 15-17, 1988, Danny Dolev (Ed.). ACM, 44-65. https://doi.org/10.1145/62546.62556 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Andrei Damian, Cezara Dragoi, Alexandru Militaru, and Josef Widder. 2019. Communication-Closed Asynchronous Protocols. In Computer Aided Verification-31st International Conference, CAV 2019, New York City, NY, USA, July 15-18, 2019, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 11562 ), Isil Dillig and Serdar Tasiran (Eds.). Springer, 344-363. https://doi.org/10.1007/978-3-030-25543-5_20 Google ScholarGoogle ScholarCross RefCross Ref
  14. Ankush Desai, Shaz Qadeer, and Sanjit A. Seshia. 2015. Systematic testing of asynchronous reactive systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30-September 4, 2015, Elisabetta Di Nitto, Mark Harman, and Patrick Heymans (Eds.). ACM, 73-83. https://doi.org/10.1145/2786805. 2786861 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cezara Dragoi, Thomas A. Henzinger, and Damien Zuferey. 2016. PSync: a partially synchronous language for fault-tolerant distributed algorithms. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20-22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 400-415. https://doi.org/10.1145/2837614.2837650 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cynthia Dwork, Nancy A. Lynch, and Larry J. Stockmeyer. 1988. Consensus in the presence of partial synchrony. J. ACM 35, 2 ( 1988 ), 288-323. https://doi.org/10.1145/42282.42283 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tzilla Elrad and Nissim Francez. 1982. Decomposition of Distributed Programs into Communication-Closed Layers. Sci. Comput. Program. 2, 3 ( 1982 ), 155-173. https://doi.org/10.1016/ 0167-6423 ( 83 ) 90013-8 Google ScholarGoogle ScholarCross RefCross Ref
  18. Michael Emmi, Shaz Qadeer, and Zvonimir Rakamaric. 2011. Delay-bounded scheduling. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 411-422. https://doi.org/10.1145/1926385.1926432 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Alan Fekete and Nancy A. Lynch. 1990. The Need for Headers: An Impossibility Result for Communication over Unreliable Channels. In CONCUR '90, Theories of Concurrency: Unification and Extension, Amsterdam, The Netherlands, August 27-30, 1990, Proceedings (Lecture Notes in Computer Science, Vol. 458 ), Jos C. M. Baeten and Jan Willem Klop (Eds.). Springer, 199-215. https://doi.org/10.1007/BFb0039061 Google ScholarGoogle ScholarCross RefCross Ref
  20. Michael J. Fischer, Nancy A. Lynch, and Mike Paterson. 1985. Impossibility of Distributed Consensus with One Faulty Process. J. ACM 32, 2 ( 1985 ), 374-382. https://doi.org/10.1145/3149.214121 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. 2017. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys 2017, Belgrade, Serbia, April 23-26, 2017. ACM, 328-343. https://doi.org/10.1145/3064176.3064183 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Eli Gafni. 1998. Round-by-Round Fault Detectors: Unifying Synchrony and Asynchrony (Extended Abstract). In Proceedings of the Seventeenth Annual ACM Symposium on Principles of Distributed Computing, PODC '98, Puerto Vallarta, Mexico, June 28-July 2, 1998, Brian A. Coan and Yehuda Afek (Eds.). ACM, 143-152. https://doi.org/10.1145/277697.277724 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018. 539-550. https://doi.org/10.1145/3236024.3236030 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Haryadi S. Gunawi, Thanh Do, Agung Laksono, Mingzhe Hao, Tanakorn Leesatapornwongsa, Jefrey F. Lukman, and Riza O. Suminto. 2015. What Bugs Live in the Cloud?: A Study of Issues in Scalable Distributed Systems. login Usenix Mag. 40, 4 ( 2015 ). https://www.usenix.org/publications/login/aug15/gunawiGoogle ScholarGoogle Scholar
  25. Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath T. V. Setty, and Brian Zill. 2015. IronFleet: proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, Monterey, CA, USA, October 4-7, 2015, Ethan L. Miller and Steven Hand (Eds.). ACM, 1-17. https://doi.org/10.1145/2815400.2815428 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In 2010 USENIX Annual Technical Conference, Boston, MA, USA, June 23-25, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yury Izrailevsky and Ariel Tseitlin. 2011. The Netflix Simian army. The Netflix Tech Blog ( 2011 ).Google ScholarGoogle Scholar
  28. Flavio Paiva Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2011, Hong Kong, China, June 27-30 2011. IEEE Compute Society, 245-256. https://doi.org/10.1109/DSN. 2011.5958223 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Victor Kac and Pokman Cheung. 2001. Quantum calculus. Springer Science & Business Media.Google ScholarGoogle Scholar
  30. Charles Edwin Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code (Awarded Best Paper). In 4th Symposium on Networked Systems Design and Implementation (NSDI 2007 ), April 11-13, 2007, Cambridge, Massachusetts, USA, Proceedings, Hari Balakrishnan and Peter Druschel (Eds.). USENIX. http://www.usenix.org/events/nsdi07/tech/killian.htmlGoogle ScholarGoogle Scholar
  31. Kyle Kingsbury. 2013-2018. Jepsen. Retrieved January 26, 2020 from http://jepsen.io/Google ScholarGoogle Scholar
  32. Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. Operating Systems Review 44, 2 ( 2010 ), 35-40. https://doi.org/10.1145/1773912.1773922 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Leslie Lamport. 2005. Generalized Consensus and Paxos. Technical Report MSR-TR-2005-33. 60 pages. https://www.microsoft. com/en-us/research/publication/generalized-consensus-and-paxos/Google ScholarGoogle Scholar
  34. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jefrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 6-8, 2014, Jason Flinn and Hank Levy (Eds.). USENIX Association, 399-414. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/ leesatapornwongsaGoogle ScholarGoogle Scholar
  35. Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, and Liang You. 2019. CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019. 114-130. https://doi.org/10.1145/3341301.3359645 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jefrey F. Lukman, Huan Ke, Cesar A. Stuardo, Riza O. Suminto, Daniar H. Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, Aarti Gupta, Shan Lu, and Haryadi S. Gunawi. 2019. FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 20 : 1-20 : 16. https://doi.org/10.1145/3302424.3303986 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Nancy A. Lynch. 1996. Distributed Algorithms. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in Egalitarian parliaments. In ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP '13, Farmington, PA, USA, November 3-6, 2013, Michael Kaminsky and Mike Dahlin (Eds.). ACM, 358-372. https://doi.org/10.1145/2517349.2517350 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yoram Moses and Sergio Rajsbaum. 2002. A Layered Analysis of Consensus. SIAM J. Comput. 31, 4 ( 2002 ), 989-1021. https://doi.org/10.1137/S0097539799364006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Madanlal Musuvathi and Shaz Qadeer. 2007. Iterative context bounding for systematic testing of multithreaded programs. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, Jeanne Ferrante and Kathryn S. McKinley (Eds.). ACM, 446-455. https://doi.org/10. 1145/1250734.1250785 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Brian M. Oki and Barbara Liskov. 1988. Viewstamped Replication: A General Primary Copy. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, Toronto, Ontario, Canada, August 15-17, 1988, Danny Dolev (Ed.). ACM, 8-17. https://doi.org/10.1145/62546.62549 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Diego Ongaro and John K. Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In 2014 USENIX Annual Technical Conference, USENIX ATC '14, Philadelphia, PA, USA, June 19-20, 2014, Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, 305-319. https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaroGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  43. Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, and Georg Weissenbacher. 2018. Randomized testing of distributed systems with probabilistic guarantees. Proc. ACM Program. Lang. 2, OOPSLA ( 2018 ), 160 : 1-160 : 28. https://doi.org/10.1145/3276530 Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Burcu Kulahcioglu Ozkan, Rupak Majumdar, and Simin Oraee. 2019. Trace aware random testing for distributed systems. Proc. ACM Program. Lang. 3, OOPSLA ( 2019 ), 180 : 1-180 : 29. https://doi.org/10.1145/3360606 Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Oded Padon, Giuliano Losa, Mooly Sagiv, and Sharon Shoham. 2017. Paxos made EPR: decidable reasoning about distributed protocols. Proc. ACM Program. Lang. 1, OOPSLA ( 2017 ), 108 : 1-108 : 31. https://doi.org/10.1145/3140568 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Shaz Qadeer and Jakob Rehof. 2005. Context-Bounded Model Checking of Concurrent Software. In Tools and Algorithms for the Construction and Analysis of Systems, 11th International Conference, TACAS 2005, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2005, Edinburgh, UK, April 4-8, 2005, Proceedings (Lecture Notes in Computer Science, Vol. 3440 ), Nicolas Halbwachs and Lenore D. Zuck (Eds.). Springer, 93-107. https://doi.org/10.1007/978-3-540-31980-1_7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Nicola Santoro and Peter Widmayer. 1989. Time is Not a Healer. In STACS 89, 6th Annual Symposium on Theoretical Aspects of Computer Science, Paderborn, FRG, February 16-18, 1989, Proceedings (Lecture Notes in Computer Science, Vol. 349 ), Burkhard Monien and Robert Cori (Eds.). Springer, 304-313. https://doi.org/10.1007/BFb0028994 Google ScholarGoogle ScholarCross RefCross Ref
  48. Pierre Sutra. 2019. On the correctness of Egalitarian Paxos. CoRR abs/ 1906.10917 ( 2019 ). arXiv: 1906.10917 http://arxiv.org/ abs/ 1906.10917Google ScholarGoogle Scholar
  49. Paul Thomson, Alastair F. Donaldson, and Adam Betts. 2014. Concurrency testing using schedule bounding: an empirical study. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, Orlando, FL, USA, February 15-19, 2014, José E. Moreira and James R. Larus (Eds.). ACM, 15-28. https://doi.org/10.1145/2555243.2555260 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Leslie G. Valiant. 1990. A Bridging Model for Parallel Computation. Commun. ACM 33, 8 ( 1990 ), 103-111. https: //doi.org/10.1145/79173.79181 Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Klaus von Gleissenthall, Rami Gökhan Kici, Alexander Bakst, Deian Stefan, and Ranjit Jhala. 2019. Pretend synchrony: synchronous verification of asynchronous distributed programs. Proc. ACM Program. Lang. 3, POPL ( 2019 ), 59 : 1-59 : 30. https://doi.org/10.1145/3290372 Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas E. Anderson. 2015. Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, David Grove and Steve Blackburn (Eds.). ACM, 357-368. https://doi.org/10.1145/2737924.2737958 Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Xinhao Yuan, Junfeng Yang, and Ronghui Gu. 2018. Partial Order Aware Concurrency Sampling. In Computer Aided Verification-30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part II. 317-335.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Testing consensus implementations using communication closure

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!