Abstract
Fault-tolerant distributed algorithms play an important role in many critical/high-availability applications. These algorithms are notoriously difficult to implement correctly, due to asynchronous communication and the occurrence of faults, such as the network dropping messages or computers crashing. We introduce PSync, a domain specific language based on the Heard-Of model, which views asynchronous faulty systems as synchronous ones with an adversarial environment that simulates asynchrony and faults by dropping messages. We define a runtime system for PSync that efficiently executes on asynchronous networks. We formalise the relation between the runtime system and PSync in terms of observational refinement. The high-level lockstep abstraction introduced by PSync simplifies the design and implementation of fault-tolerant distributed algorithms and enables automated formal verification. We have implemented an embedding of PSync in the Scala programming language with a runtime system for partially synchronous networks. We show the applicability of PSync by implementing several important fault-tolerant distributed algorithms and we compare the implementation of consensus algorithms in PSync against implementations in other languages in terms of code size, runtime efficiency, and verification.
- P. A. Abdulla, A. Collomb-Annichini, A. Bouajjani, and B. Jonsson. Using Forward Reachability Analysis for Verification of Lossy Channel Systems. FMSD, 25(1):39–65, 2004. Google Scholar
Digital Library
- Y. Afek and E. Gafni. Asynchrony from synchrony. In D. Frey, M. Raynal, S. Sarkar, R. K. Shyamasundar, and P. Sinha, editors, Distributed Computing and Networking, 14th International Conference, ICDCN 2013, Mumbai, India, January 3-6, 2013. Proceedings, pages 225–239, 2013.Google Scholar
- S. F. Allen, R. L. Constable, R. Eaton, C. Kreitz, and L. Lorigo. The nuprl open logical environment. In D. A. McAllester, editor, Automated Deduction - CADE-17, 17th International Conference on Automated Deduction, Pittsburgh, PA, USA, June 17-20, 2000, Proceedings, pages 170–176, 2000. Google Scholar
Digital Library
- P. Alvaro, T. Condie, N. Conway, J. M. Hellerstein, and R. Sears. I do declare: consensus in a logic language. Operating Systems Review, 43 (4):25–30, 2009. Google Scholar
Digital Library
- P. Alvaro, N. Conway, J. Hellerstein, and W. R. Marczak. Consistency analysis in bloom: a CALM and collected approach. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9-12, 2011, Online Proceedings, pages 249–260, 2011.Google Scholar
- Y. Amir and J. Kirsch. Paxos for system builders: An overview. In Workshop on Large-Scale Distributed Systems and Middleware (LADIS 2008), Yorktown, NY, September 2008, 2008. Google Scholar
Digital Library
- K. R. Apt and D. Kozen. Limits for automatic verification of finitestate concurrent systems. Inf. Process. Lett., 22(6):307–309, 1986. Google Scholar
Digital Library
- J. L. Armstrong. The development of erlang. In S. L. P. Jones, M. Tofte, and A. M. Berman, editors, Proceedings of the 1997 ACM SIGPLAN International Conference on Functional Programming (ICFP ’97), Amsterdam, The Netherlands, June 9-11, 1997. Google Scholar
Digital Library
- , pages 196–203. ACM, 1997.Google Scholar
- M. P. Ashley-Rollman, P. Lee, S. C. Goldstein, P. Pillai, and J. Campbell. A language for large ensembles of independently executing nodes. In Logic Programming, 25th International Conference, ICLP 2009, Pasadena, CA, USA, July 14-17, 2009. Proceedings, pages 265– 280, 2009. Google Scholar
Digital Library
- S. Basu, T. Bultan, and M. Ouederni. Synchronizability for verification of asynchronously communicating systems. In V. Kuncak and A. Rybalchenko, editors, Verification, Model Checking, and Abstract Interpretation - 13th International Conference, VMCAI 2012, Philadelphia, PA, USA, January 22-24, 2012. Proceedings, pages 56–71, 2012. Google Scholar
Digital Library
- M. Ben-Or. Another advantage of free choice (extended abstract): Completely asynchronous agreement protocols. In PODC, pages 27– 30. ACM, 1983. Google Scholar
Digital Library
- M. Biely, B. Charron-Bost, A. Gaillard, M. Hutle, A. Schiper, and J. Widder. Tolerating corrupted communication. In PODC, pages 244–253, 2007. Google Scholar
Digital Library
- M. Biely, Z. Milosevic, N. Santos, and A. Schiper. S-paxos: Offloading the leader for high throughput state machine replication. In IEEE 31st Symposium on Reliable Distributed Systems, SRDS 2012, Irvine, CA, USA, October 8-11, 2012, pages 111–120, 2012. Google Scholar
Digital Library
- M. Biely, P. Delgado, Z. Milosevic, and A. Schiper. Distal: A framework for implementing fault-tolerant distributed algorithms. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Budapest, Hungary, June 24-27, 2013, pages 1–8, 2013. Google Scholar
Digital Library
- D. Brand and P. Zafiropulo. On Communicating Finite-State Machines. J. ACM, 30(2):323–342, 1983. Google Scholar
Digital Library
- M. Burrows. The chubby lock service for loosely-coupled distributed systems. In OSDI, Berkeley, CA, USA, 2006. USENIX Association. ISBN 1-931971-47-1. Google Scholar
Digital Library
- C. Cachin, R. Guerraoui, and L. Rodrigues. Reliable broadcast. In Introduction to Reliable and Secure Distributed Programming, pages 73–135. Springer Berlin Heidelberg, 2011. ISBN 978-3-642-15259-7.Google Scholar
- M. Castro and B. Liskov. Practical byzantine fault tolerance. In M. I. Seltzer and P. J. Leach, editors, Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, February 22-25, 1999, pages 173–186. USENIX Association, 1999.. Google Scholar
Digital Library
- T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225–267, 1996. Google Scholar
Digital Library
- T. D. Chandra, R. Griesemer, and J. Redstone. Paxos made live: An engineering perspective. In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing, PODC ’07, pages 398–407, New York, NY, USA, 2007. ACM. ISBN 978-1- 59593-616-5. Google Scholar
Digital Library
- M. Chaouch-Saad, B. Charron-Bost, and S. Merz. A reduction theorem for the verification of round-based distributed algorithms. In RP, volume 5797 of LNCS, pages 93–106, 2009. Google Scholar
Digital Library
- B. Charron-Bost and S. Merz. Formal verification of a consensus algorithm in the heard-of model. Int. J. Software and Informatics, 3(2-3):273–303, 2009.Google Scholar
- B. Charron-Bost and A. Schiper. The heard-of model: computing in distributed systems with benign faults. Distributed Computing, 22(1): 49–71, 2009.Google Scholar
Digital Library
- B. Charron-Bost, H. Debrat, and S. Merz. Formal verification of consensus algorithms tolerating malicious faults. In SSS, pages 120– 134. Springer, 2011. ISBN 978-3-642-24549-7. Google Scholar
Digital Library
- S. Chaudhuri. More choices allow more faults: Set consensus problems in totally asynchronous systems. Information and Computation, 105(1):132 – 158, 1993. ISSN 0890-5401. Google Scholar
Digital Library
- A. Clement, E. L. Wong, L. Alvisi, M. Dahlin, and M. Marchetti. Making byzantine fault tolerant systems tolerate byzantine faults. In J. Rexford and E. G. Sirer, editors, Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009, April 22-24, 2009, Boston, MA, USA, pages 153–168. USENIX Association, 2009. Google Scholar
Digital Library
- H. Debrat and S. Merz. Verifying fault-tolerant distributed algorithms in the heard-of model. Archive of Formal Proofs, 2012.Google Scholar
- A. Desai, P. Garg, and P. Madhusudan. Natural proofs for asynchronous programs using almost-synchronous reductions. In A. P. Black and T. D. Millstein, editors, Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2014, part of SPLASH 2014, Portland, OR, USA, October 20-24, 2014, pages 709––725, 2014. Google Scholar
Digital Library
- Dialyzer. http://www.erlang.org/doc/man/dialyzer.html.Google Scholar
- D. Dolev and E. N. Hoch. Byzantine self-stabilizing pulse in a bounded-delay model. In T. Masuzawa and S. Tixeuil, editors, Stabilization, Safety, and Security of Distributed Systems, 9th International Symposium, SSS 2007, Paris, France, November 14-16, 2007, Proceedings, pages 234–252. Springer, 2007.. Google Scholar
Digital Library
- D. Dolev, C. Dwork, and L. Stockmeyer. On the minimal synchronism needed for distributed consensus. In Foundations of Computer Science, 1983, 24th Annual Symposium on, pages 393–402, 1983.. Google Scholar
Digital Library
- E. D’Osualdo, J. Kochems, and C. L. Ong. Automatic verification of erlang-style concurrency. In Static Analysis - 20th International Symposium, SAS 2013, Seattle, WA, USA, June 20-22, 2013. Proceedings, pages 454–476, 2013.Google Scholar
- C. Dragoi, T. A. Henzinger, H. Veith, J. Widder, and D. Zufferey. A logic-based framework for verifying consensus algorithms. In K. L. McMillan and X. Rival, editors, VMCAI, pages 161–181. Springer, 2014.Google Scholar
- C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. JACM, 35(2):288–323, Apr. 1988. Google Scholar
Digital Library
- T. Elrad and N. Francez. Decomposition of distributed programs into communication-closed layers. Sci. Comput. Program., 2(3):155–173, 1982.Google Scholar
Cross Ref
- J. M. Falerio, S. K. Rajamani, K. Rajan, G. Ramalingam, and K. Vaswani. Generalized lattice agreement. In ACM Symposium on Principles of Distributed Computing, PODC ’12, Funchal, Madeira, Portugal, July 16-18, 2012, pages 125–134, 2012. Google Scholar
Digital Library
- I. Filipovic, P. W. O’Hearn, N. Rinetzky, and H. Yang. Abstraction for concurrent objects. Theor. Comput. Sci., 411(51-52):4379–4398, 2010.. Google Scholar
Digital Library
- M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374– 382, Apr. 1985. Google Scholar
Digital Library
- L. Fredlund and H. Svensson. Mcerlang: a model checker for a distributed functional programming language. In Proceedings of the 12th ACM SIGPLAN International Conference on Functional Programming, ICFP 2007, Freiburg, Germany, October 1-3, 2007, pages 125– 136, 2007. Google Scholar
Digital Library
- E. Gafni. Round-by-round fault detectors: Unifying synchrony and asynchrony (extended abstract). In B. A. Coan and Y. Afek, editors, Proceedings of the Seventeenth Annual ACM Symposium on Principles of Distributed Computing, PODC ’98, Puerto Vallarta, Mexico, June 28 - July 2, 1998, pages 143–152, 1998. Google Scholar
Digital Library
- J. Gray. Notes on data base operating systems. In R. Bayer, R. Graham, and G. Seegmüller, editors, Operating Systems, volume 60 of Lecture Notes in Computer Science, pages 393–481. Springer Berlin Heidelberg, 1978. ISBN 978-3-540-08755-7. Google Scholar
Digital Library
- R. Guerraoui and L. Rodrigues. Introduction to Reliable Distributed Programming. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 3540288457. Google Scholar
Digital Library
- P. Haller and M. Odersky. Scala actors: Unifying thread-based and event-based programming. Theor. Comput. Sci., 410(2-3):202–220, 2009.. Google Scholar
Digital Library
- C. Hewitt, P. Bishop, and R. Steiger. A universal modular ACTOR formalism for artificial intelligence. In Proceedings of the 3rd International Joint Conference on Artificial Intelligence. Standford, CA, August 1973, pages 235–245, 1973. Google Scholar
Digital Library
- C. A. R. Hoare. Communicating sequential processes. Commun. ACM, 21(8):666–677, 1978. Google Scholar
Digital Library
- M. Hutle and A. Schiper. Communication predicates: A high-level abstraction for coping with transient and dynamic faults. In DSN, pages 92–101, 2007. Google Scholar
Digital Library
- A. John, I. Konnov, U. Schmid, H. Veith, and J. Widder. Parameterized model checking of fault-tolerant distributed algorithms by abstraction. In FMCAD, pages 201–209, 2013.Google Scholar
Cross Ref
- F. P. Junqueira, B. C. Reed, and M. Serafini. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2011, Hong Kong, China, June 27-30 2011, pages 245– 256. IEEE, 2011. Google Scholar
Digital Library
- C. E. Killian, J. W. Anderson, R. Braud, R. Jhala, and A. Vahdat. Mace: language support for building distributed systems. In SIGPLAN Conference on Programming Language Design and Implementation, pages 179–188, 2007. Google Scholar
Digital Library
- A. D. Kshemkalyani and M. Singhal. Distributed Computing: Principles, Algorithms, and Systems. Cambridge University Press, 2011. Google Scholar
Digital Library
- L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16 (2):133–169, May 1998. ISSN 0734-2071. Google Scholar
Digital Library
- L. Lamport. Distributed algorithms in TLA (abstract). In G. Neiger, editor, Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing, July 16-19, 2000, Portland, Oregon, USA. ACM, 2000. Google Scholar
Digital Library
- L. Lamport. The pluscal algorithm language. In Theoretical Aspects of Computing - ICTAC 2009, 6th International Colloquium, Kuala Lumpur, Malaysia, August 16-20, 2009. Proceedings, pages 36–60, 2009. Google Scholar
Digital Library
- Y. A. Liu, S. D. Stoller, B. Lin, and M. Gorbovitski. From clarity to efficiency for distributed algorithms. In G. T. Leavens and M. B. Dwyer, editors, Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, October 21-25, 2012, pages 395–410, 2012. Google Scholar
Digital Library
- N. Lynch. Distributed Algorithms. Morgan Kaufman, 1996. Google Scholar
Digital Library
- N. A. Lynch and M. R. Tuttle. Hierarchical correctness proofs for distributed algorithms. In F. B. Schneider, editor, Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing, Vancouver, British Columbia, Canada, August 10-12, 1987, pages 137–151, 1987. Google Scholar
Digital Library
- Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius: Building efficient replicated state machine for wans. In R. Draves and R. van Renesse, editors, 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, pages 369–384. USENIX Association, 2008. Google Scholar
Digital Library
- O. Mari´c, C. Sprenger, and D. Basin. Consensus refined. In J. Karlsson and Y. Amir, editors, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015. Google Scholar
Digital Library
- V. R. Mark Bickford, Robert L. Constable. Logic of events, a framework to reason about distributed systems. In 2012 Languages for Distributed Algorithms Workshop, Philadelphia, PA, 2012.Google Scholar
- H. Miller, P. Haller, E. Burmako, and M. Odersky. Instant pickles: generating object-oriented pickler combinators for fast and extensible serialization. In A. L. Hosking, P. T. Eugster, and C. V. Lopes, editors, OOPSLA, pages 183–202, 2013. Google Scholar
Digital Library
- R. Milner, J. Parrow, and D. Walker. A calculus of mobile processes, I. Inf. Comput., 100(1):1–40, 1992. Google Scholar
Digital Library
- R. Milner, J. Parrow, and D. Walker. A calculus of mobile processes, II. Inf. Comput., 100(1):41–77, 1992. Google Scholar
Digital Library
- I. Moraru, D. G. Andersen, and M. Kaminsky. There is more consensus in egalitarian parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 358–372, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2388- 8. Google Scholar
Digital Library
- L. Moura and N. Bjorner. Z3: An efficient SMT solver. In C. Ramakrishnan and J. Rehof, editors, TACAS, volume 4963 of Lecture Notes in Computer Science, pages 337–340. Springer Berlin Heidelberg, 2008. ISBN 978-3-540-78799-0. Google Scholar
Digital Library
- P. Parvedy, M. Raynal, and C. Travers. Early-stopping k-set agreement in synchronous systems prone to any number of process crashes. In V. Malyshkin, editor, Parallel Computing Technologies, volume 3606 of Lecture Notes in Computer Science, pages 49–58. Springer Berlin Heidelberg, 2005. ISBN 978-3-540-28126-9. Google Scholar
Digital Library
- N. Santoro and P. Widmayer. Agreement in synchronous networks with ubiquitous faults. Theor. Comput. Sci., 384(2-3):232–249, 2007. Google Scholar
Digital Library
- N. Schiper, V. Rahli, R. van Renesse, M. Bickford, and R. L. Constable. Developing correctly replicated databases using formal tools. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23-26, 2014, pages 395–406, 2014. Google Scholar
Digital Library
- P. Schnoebelen. Revisiting Ackermann-Hardness for Lossy Counter Machines and Reset Petri Nets. In Mathematical Foundations of Computer Science 2010, 35th International Symposium, MFCS 2010, Brno, Czech Republic, August 23-27, 2010. Proceedings, pages 616– 628, 2010. Google Scholar
Digital Library
- The Netty project. http://netty.io/.Google Scholar
- S. Torstendahl. Open telecom platform. Ericsson Review, 1, 1997.Google Scholar
- T. Tsuchiya and A. Schiper. Model checking of consensus algorithms. In 26th IEEE Symposium on Reliable Distributed Systems (SRDS 2007), Beijing, China, October 10-12, 2007, pages 137–148. IEEE Computer Society, 2007. Google Scholar
Digital Library
- T. Tsuchiya and A. Schiper. Using bounded model checking to verify consensus algorithms. In G. Taubenfeld, editor, Distributed Computing, 22nd International Symposium, DISC 2008, Arcachon, France, September 22-24, 2008. Proceedings, pages 466–480, 2008. Google Scholar
Digital Library
- T. Tsuchiya and A. Schiper. Verification of consensus algorithms using satisfiability solving. Distributed Computing, 23(5-6):341–358, 2011.Google Scholar
Digital Library
- J. Widder and U. Schmid. The Theta-Model: Achieving synchrony without clocks. Distributed Computing, 22(1):29–47, Apr. 2009.Google Scholar
Digital Library
- J. R. Wilcox, D. Woos, P. Panchekha, Z. Tatlock, X. Wang, M. D. Ernst, and T. E. Anderson. Verdi: a framework for implementing and formally verifying distributed systems. In D. Grove and S. Blackburn, editors, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, pages 357–368. ACM, 2015. Google Scholar
Digital Library
Index Terms
PSync: a partially synchronous language for fault-tolerant distributed algorithms
Recommendations
Pretend synchrony: synchronous verification of asynchronous distributed programs
We present pretend synchrony, a new approach to verifying distributed systems, based on the observation that while distributed programs must execute asynchronously, we can often soundly treat them as if they were synchronous when verifying their ...
PSync: a partially synchronous language for fault-tolerant distributed algorithms
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesFault-tolerant distributed algorithms play an important role in many critical/high-availability applications. These algorithms are notoriously difficult to implement correctly, due to asynchronous communication and the occurrence of faults, such as the ...
Synchrony and time in fault-tolerant distributed algorithms
FORMATS'10: Proceedings of the 8th international conference on Formal modeling and analysis of timed systemsIn 1985, Fischer, Lynch and Paterson published their celebrated impossibility of solving distributed agreement (consensus) in purely asynchronous distributed systems with crash failures. Synchrony requirements, i.e., constraints on the occurrence of ...






Comments