Abstract
Large scale production distributed systems are difficult to design and test. Correctness must be ensured when processes run asynchronously, at arbitrary rates relative to each other, and in the presence of failures, e.g., process crashes or message losses. These conditions create a huge space of executions that is difficult to explore in a principled way. Current testing techniques focus on systematic or randomized exploration of all executions of an implementation while treating the implemented algorithms as black boxes. On the other hand, proofs of correctness of many of the underlying algorithms often exploit semantic properties that reduce reasoning about correctness to a subset of behaviors. For example, the communication-closure property, used in many proofs of distributed consensus algorithms, shows that every asynchronous execution of the algorithm is equivalent to a lossy synchronous execution, thus reducing the burden of proof to only that subset. In a lossy synchronous execution, processes execute in lock-step rounds, and messages are either received in the same round or lost forever—such executions form a small subset of all asynchronous ones.
We formulate the communication-closure hypothesis, which states that bugs in implementations of distributed consensus algorithms will already manifest in lossy synchronous executions and present a testing algorithm based on this hypothesis. We prioritize the search space based on a bound on the number of failures in the execution and the rate at which these failures are recovered. We show that a random testing algorithm based on sampling lossy synchronous executions can empirically find a number of bugs—including previously unknown ones—in production distributed systems such as Zookeeper, Cassandra, and Ratis, and also produce more understandable bug traces.
Supplemental Material
- Apache. 2013. CASSANDRA-6023: CAS should distinguish promised and accepted ballots. Retrieved January 26, 2020 from http://issues.apache.org/jira/browse/CASSANDRA-6023Google Scholar
- Apache. 2020. Apache Ratis. Retrieved May 14, 2020 from http://ratis.incubator.apache.org/Google Scholar
- Kenneth P. Birman and Robert Cooper. 1991. The ISIS Project: Real Experience with a Fault Tolerant Programming System. ACM SIGOPS Oper. Syst. Rev. 25, 2 ( 1991 ), 103-107. https://doi.org/10.1145/122120.122133 Google Scholar
Digital Library
- Kenneth P. Birman and Thomas A. Joseph. 1987. Exploiting Virtual Synchrony in Distributed Systems. In Proceedings of the Eleventh ACM Symposium on Operating System Principles, SOSP 1987, Stoufer Austin Hotel, Austin, Texas, USA, November 8-11, 1987, Les Belady (Ed.). ACM, 123-138. https://doi.org/10.1145/41457.37515 Google Scholar
Digital Library
- Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010, James C. Hoe and Vikram S. Adve (Eds.). ACM, 167-178. https://doi.org/10.1145/1736020.1736040 Google Scholar
Digital Library
- Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: an engineering perspective. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, PODC 2007, Portland, Oregon, USA, August 12-15, 2007, Indranil Gupta and Roger Wattenhofer (Eds.). ACM, 398-407. https://doi.org/10.1145/ 1281100.1281103 Google Scholar
Digital Library
- Mouna Chaouch-Saad, Bernadette Charron-Bost, and Stephan Merz. 2009. A Reduction Theorem for the Verification of Round-Based Distributed Algorithms. In Reachability Problems, 3rd International Workshop, RP 2009, Palaiseau, France, September 23-25, 2009. Proceedings (Lecture Notes in Computer Science, Vol. 5797 ), Olivier Bournez and Igor Potapov (Eds.). Springer, 93-106. https://doi.org/10.1007/978-3-642-04420-5_10 Google Scholar
Digital Library
- Charalambos A Charalambides. 2018. Enumerative combinatorics. Chapman and Hall/CRC.Google Scholar
- Bernadette Charron-Bost and André Schiper. 2009. The Heard-Of model: computing in distributed systems with benign faults. Distributed Comput. 22, 1 ( 2009 ), 49-71. https://doi.org/10.1007/s00446-009-0084-6 Google Scholar
Digital Library
- Kaustuv Chaudhuri, Damien Doligez, Leslie Lamport, and Stephan Merz. 2010. Verifying Safety Properties with the TLA+ Proof System. In Automated Reasoning, 5th International Joint Conference, IJCAR 2010, Edinburgh, UK, July 16-19, 2010. Proceedings (Lecture Notes in Computer Science, Vol. 6173 ), Jürgen Giesl and Reiner Hähnle (Eds.). Springer, 142-148. https://doi.org/10.1007/978-3-642-14203-1_12 Google Scholar
Digital Library
- Ching-Tsun Chou and Eli Gafni. 1988. Understanding and Verifying Distributed Algorithms Using Stratified Decomposition.Google Scholar
- August 15-17, 1988, Danny Dolev (Ed.). ACM, 44-65. https://doi.org/10.1145/62546.62556 Google Scholar
Digital Library
- Andrei Damian, Cezara Dragoi, Alexandru Militaru, and Josef Widder. 2019. Communication-Closed Asynchronous Protocols. In Computer Aided Verification-31st International Conference, CAV 2019, New York City, NY, USA, July 15-18, 2019, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 11562 ), Isil Dillig and Serdar Tasiran (Eds.). Springer, 344-363. https://doi.org/10.1007/978-3-030-25543-5_20 Google Scholar
Cross Ref
- Ankush Desai, Shaz Qadeer, and Sanjit A. Seshia. 2015. Systematic testing of asynchronous reactive systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30-September 4, 2015, Elisabetta Di Nitto, Mark Harman, and Patrick Heymans (Eds.). ACM, 73-83. https://doi.org/10.1145/2786805. 2786861 Google Scholar
Digital Library
- Cezara Dragoi, Thomas A. Henzinger, and Damien Zuferey. 2016. PSync: a partially synchronous language for fault-tolerant distributed algorithms. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20-22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 400-415. https://doi.org/10.1145/2837614.2837650 Google Scholar
Digital Library
- Cynthia Dwork, Nancy A. Lynch, and Larry J. Stockmeyer. 1988. Consensus in the presence of partial synchrony. J. ACM 35, 2 ( 1988 ), 288-323. https://doi.org/10.1145/42282.42283 Google Scholar
Digital Library
- Tzilla Elrad and Nissim Francez. 1982. Decomposition of Distributed Programs into Communication-Closed Layers. Sci. Comput. Program. 2, 3 ( 1982 ), 155-173. https://doi.org/10.1016/ 0167-6423 ( 83 ) 90013-8 Google Scholar
Cross Ref
- Michael Emmi, Shaz Qadeer, and Zvonimir Rakamaric. 2011. Delay-bounded scheduling. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 411-422. https://doi.org/10.1145/1926385.1926432 Google Scholar
Digital Library
- Alan Fekete and Nancy A. Lynch. 1990. The Need for Headers: An Impossibility Result for Communication over Unreliable Channels. In CONCUR '90, Theories of Concurrency: Unification and Extension, Amsterdam, The Netherlands, August 27-30, 1990, Proceedings (Lecture Notes in Computer Science, Vol. 458 ), Jos C. M. Baeten and Jan Willem Klop (Eds.). Springer, 199-215. https://doi.org/10.1007/BFb0039061 Google Scholar
Cross Ref
- Michael J. Fischer, Nancy A. Lynch, and Mike Paterson. 1985. Impossibility of Distributed Consensus with One Faulty Process. J. ACM 32, 2 ( 1985 ), 374-382. https://doi.org/10.1145/3149.214121 Google Scholar
Digital Library
- Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. 2017. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys 2017, Belgrade, Serbia, April 23-26, 2017. ACM, 328-343. https://doi.org/10.1145/3064176.3064183 Google Scholar
Digital Library
- Eli Gafni. 1998. Round-by-Round Fault Detectors: Unifying Synchrony and Asynchrony (Extended Abstract). In Proceedings of the Seventeenth Annual ACM Symposium on Principles of Distributed Computing, PODC '98, Puerto Vallarta, Mexico, June 28-July 2, 1998, Brian A. Coan and Yehuda Afek (Eds.). ACM, 143-152. https://doi.org/10.1145/277697.277724 Google Scholar
Digital Library
- Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018. 539-550. https://doi.org/10.1145/3236024.3236030 Google Scholar
Digital Library
- Haryadi S. Gunawi, Thanh Do, Agung Laksono, Mingzhe Hao, Tanakorn Leesatapornwongsa, Jefrey F. Lukman, and Riza O. Suminto. 2015. What Bugs Live in the Cloud?: A Study of Issues in Scalable Distributed Systems. login Usenix Mag. 40, 4 ( 2015 ). https://www.usenix.org/publications/login/aug15/gunawiGoogle Scholar
- Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath T. V. Setty, and Brian Zill. 2015. IronFleet: proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, Monterey, CA, USA, October 4-7, 2015, Ethan L. Miller and Steven Hand (Eds.). ACM, 1-17. https://doi.org/10.1145/2815400.2815428 Google Scholar
Digital Library
- Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In 2010 USENIX Annual Technical Conference, Boston, MA, USA, June 23-25, 2010.Google Scholar
Digital Library
- Yury Izrailevsky and Ariel Tseitlin. 2011. The Netflix Simian army. The Netflix Tech Blog ( 2011 ).Google Scholar
- Flavio Paiva Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2011, Hong Kong, China, June 27-30 2011. IEEE Compute Society, 245-256. https://doi.org/10.1109/DSN. 2011.5958223 Google Scholar
Digital Library
- Victor Kac and Pokman Cheung. 2001. Quantum calculus. Springer Science & Business Media.Google Scholar
- Charles Edwin Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code (Awarded Best Paper). In 4th Symposium on Networked Systems Design and Implementation (NSDI 2007 ), April 11-13, 2007, Cambridge, Massachusetts, USA, Proceedings, Hari Balakrishnan and Peter Druschel (Eds.). USENIX. http://www.usenix.org/events/nsdi07/tech/killian.htmlGoogle Scholar
- Kyle Kingsbury. 2013-2018. Jepsen. Retrieved January 26, 2020 from http://jepsen.io/Google Scholar
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. Operating Systems Review 44, 2 ( 2010 ), 35-40. https://doi.org/10.1145/1773912.1773922 Google Scholar
Digital Library
- Leslie Lamport. 2005. Generalized Consensus and Paxos. Technical Report MSR-TR-2005-33. 60 pages. https://www.microsoft. com/en-us/research/publication/generalized-consensus-and-paxos/Google Scholar
- Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jefrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 6-8, 2014, Jason Flinn and Hank Levy (Eds.). USENIX Association, 399-414. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/ leesatapornwongsaGoogle Scholar
- Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, and Liang You. 2019. CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019. 114-130. https://doi.org/10.1145/3341301.3359645 Google Scholar
Digital Library
- Jefrey F. Lukman, Huan Ke, Cesar A. Stuardo, Riza O. Suminto, Daniar H. Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, Aarti Gupta, Shan Lu, and Haryadi S. Gunawi. 2019. FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, George Candea, Robbert van Renesse, and Christof Fetzer (Eds.). ACM, 20 : 1-20 : 16. https://doi.org/10.1145/3302424.3303986 Google Scholar
Digital Library
- Nancy A. Lynch. 1996. Distributed Algorithms. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Google Scholar
Digital Library
- Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is more consensus in Egalitarian parliaments. In ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP '13, Farmington, PA, USA, November 3-6, 2013, Michael Kaminsky and Mike Dahlin (Eds.). ACM, 358-372. https://doi.org/10.1145/2517349.2517350 Google Scholar
Digital Library
- Yoram Moses and Sergio Rajsbaum. 2002. A Layered Analysis of Consensus. SIAM J. Comput. 31, 4 ( 2002 ), 989-1021. https://doi.org/10.1137/S0097539799364006 Google Scholar
Digital Library
- Madanlal Musuvathi and Shaz Qadeer. 2007. Iterative context bounding for systematic testing of multithreaded programs. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, Jeanne Ferrante and Kathryn S. McKinley (Eds.). ACM, 446-455. https://doi.org/10. 1145/1250734.1250785 Google Scholar
Digital Library
- Brian M. Oki and Barbara Liskov. 1988. Viewstamped Replication: A General Primary Copy. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, Toronto, Ontario, Canada, August 15-17, 1988, Danny Dolev (Ed.). ACM, 8-17. https://doi.org/10.1145/62546.62549 Google Scholar
Digital Library
- Diego Ongaro and John K. Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In 2014 USENIX Annual Technical Conference, USENIX ATC '14, Philadelphia, PA, USA, June 19-20, 2014, Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, 305-319. https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaroGoogle Scholar
Digital Library
- Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, and Georg Weissenbacher. 2018. Randomized testing of distributed systems with probabilistic guarantees. Proc. ACM Program. Lang. 2, OOPSLA ( 2018 ), 160 : 1-160 : 28. https://doi.org/10.1145/3276530 Google Scholar
Digital Library
- Burcu Kulahcioglu Ozkan, Rupak Majumdar, and Simin Oraee. 2019. Trace aware random testing for distributed systems. Proc. ACM Program. Lang. 3, OOPSLA ( 2019 ), 180 : 1-180 : 29. https://doi.org/10.1145/3360606 Google Scholar
Digital Library
- Oded Padon, Giuliano Losa, Mooly Sagiv, and Sharon Shoham. 2017. Paxos made EPR: decidable reasoning about distributed protocols. Proc. ACM Program. Lang. 1, OOPSLA ( 2017 ), 108 : 1-108 : 31. https://doi.org/10.1145/3140568 Google Scholar
Digital Library
- Shaz Qadeer and Jakob Rehof. 2005. Context-Bounded Model Checking of Concurrent Software. In Tools and Algorithms for the Construction and Analysis of Systems, 11th International Conference, TACAS 2005, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2005, Edinburgh, UK, April 4-8, 2005, Proceedings (Lecture Notes in Computer Science, Vol. 3440 ), Nicolas Halbwachs and Lenore D. Zuck (Eds.). Springer, 93-107. https://doi.org/10.1007/978-3-540-31980-1_7 Google Scholar
Digital Library
- Nicola Santoro and Peter Widmayer. 1989. Time is Not a Healer. In STACS 89, 6th Annual Symposium on Theoretical Aspects of Computer Science, Paderborn, FRG, February 16-18, 1989, Proceedings (Lecture Notes in Computer Science, Vol. 349 ), Burkhard Monien and Robert Cori (Eds.). Springer, 304-313. https://doi.org/10.1007/BFb0028994 Google Scholar
Cross Ref
- Pierre Sutra. 2019. On the correctness of Egalitarian Paxos. CoRR abs/ 1906.10917 ( 2019 ). arXiv: 1906.10917 http://arxiv.org/ abs/ 1906.10917Google Scholar
- Paul Thomson, Alastair F. Donaldson, and Adam Betts. 2014. Concurrency testing using schedule bounding: an empirical study. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, Orlando, FL, USA, February 15-19, 2014, José E. Moreira and James R. Larus (Eds.). ACM, 15-28. https://doi.org/10.1145/2555243.2555260 Google Scholar
Digital Library
- Leslie G. Valiant. 1990. A Bridging Model for Parallel Computation. Commun. ACM 33, 8 ( 1990 ), 103-111. https: //doi.org/10.1145/79173.79181 Google Scholar
Digital Library
- Klaus von Gleissenthall, Rami Gökhan Kici, Alexander Bakst, Deian Stefan, and Ranjit Jhala. 2019. Pretend synchrony: synchronous verification of asynchronous distributed programs. Proc. ACM Program. Lang. 3, POPL ( 2019 ), 59 : 1-59 : 30. https://doi.org/10.1145/3290372 Google Scholar
Digital Library
- James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas E. Anderson. 2015. Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, David Grove and Steve Blackburn (Eds.). ACM, 357-368. https://doi.org/10.1145/2737924.2737958 Google Scholar
Digital Library
- Xinhao Yuan, Junfeng Yang, and Ronghui Gu. 2018. Partial Order Aware Concurrency Sampling. In Computer Aided Verification-30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part II. 317-335.Google Scholar
Cross Ref
Index Terms
Testing consensus implementations using communication closure
Recommendations
Automated Test Generation Using Concolic Testing
ISEC '15: Proceedings of the 8th India Software Engineering ConferenceIn this talk, I will talk about the recent advances and challenges in concolic testing and symbolic execution. Concolic testing, also known as directed automated random testing (DART) or dynamic symbolic execution, is an efficient way to automatically ...
Probabilistic Balancing of Fault Coverage and Test Cost in Combined Built-In Self-Test/Automated Test Equipment Testing Environment
DFT '04: Proceedings of the Defect and Fault Tolerance in VLSI Systems, 19th IEEE International SymposiumAs design and test complexities of SoCs ever intensify, the balanced utilization of combined Built-In Self-Test (BIST) and Automated Test Equipment (ATE) testing becomes desirable to meet the required minimum fault-coverage while maintaining acceptable ...






Comments