skip to main content
research-article
Open Access

Randomized testing of distributed systems with probabilistic guarantees

Published:24 October 2018Publication History
Skip Abstract Section

Abstract

Several recently proposed randomized testing tools for concurrent and distributed systems come with theoretical guarantees on their success. The key to these guarantees is a notion of bug depth—the minimum length of a sequence of events sufficient to expose the bug—and a characterization of d-hitting families of schedules—a set of schedules guaranteed to cover every bug of given depth d. Previous results show that in certain cases the size of a d-hitting family can be significantly smaller than the total number of possible schedules. However, these results either assume shared-memory multithreading, or that the underlying partial ordering of events is known statically and has special structure. These assumptions are not met by distributed message-passing applications.

In this paper, we present a randomized scheduling algorithm for testing distributed systems. In contrast to previous approaches, our algorithm works for arbitrary partially ordered sets of events revealed online as the program is being executed. We show that for partial orders of width at most w and size at most n (both statically unknown), our algorithm is guaranteed to sample from at most w2 nd−1 schedules, for every fixed bug depth d. Thus, our algorithm discovers a bug of depth d with probability at least 1 / (w2 nd−1). As a special case, our algorithm recovers a previous randomized testing algorithm for multithreaded programs. Our algorithm is simple to implement, but the correctness arguments depend on difficult combinatorial results about online dimension and online chain partitioning of partially ordered sets.

We have implemented our algorithm in a randomized testing tool for distributed message-passing programs. We show that our algorithm can find bugs in distributed systems such as Zookeeper and Cassandra, and empirically outperforms naive random exploration while providing theoretical guarantees.

Skip Supplemental Material Section

Supplemental Material

a160-kulahcioglu.webm

References

  1. Parosh Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. 2014. Optimal Dynamic Partial Order Reduction. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’14). ACM, New York, NY, USA, 373–384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anurag Agarwal and Vijay K. Garg. 2007. Efficient dependency tracking for relevant events in concurrent systems. Distributed Computing 19, 3 (2007), 163–183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache. 2012. Cassandra-2.0.0. Retrieved April 13, 2018 from http://archive.apache.org/dist/cassandra/2.0.0/Google ScholarGoogle Scholar
  4. Bartłomiej Bosek, Stefan Felsner, Kamil Kloch, Tomasz Krawczyk, Grzegorz Matecki, and Piotr Micek. 2012. On-Line Chain Partitions of Orders: A Survey. Order 29, 1 (2012), 49–73.Google ScholarGoogle ScholarCross RefCross Ref
  5. Bartłomiej Bosek, Hal A. Kierstead, Tomasz Krawczyk, Grzegorz Matecki, and Matthew E. Smith. 2018. An Easy Subexponential Bound for Online Chain Partitioning. Electr. J. Comb. 25, 2 (2018), P2.28. http://www.combinatorics.org/ojs/ index.php/eljc/article/view/v25i2p28Google ScholarGoogle Scholar
  6. Bartłomiej Bosek and Tomasz Krawczyk. 2010. The Sub-exponential Upper Bound for On-Line Chain Partitioning. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA. IEEE Computer Society, 347–354.Google ScholarGoogle Scholar
  7. Ahmed Bouajjani and Michael Emmi. 2012. Bounded Phase Analysis of Message-Passing Programs. In TACAS ’12: Proc. 18th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (LNCS). Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, USA, 167–178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dmitry Chistikov, Rupak Majumdar, and Filip Niksic. 2016. Hitting Families of Schedules for Asynchronous Programs. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II (Lecture Notes in Computer Science), Vol. 9780. Springer, 157–176.Google ScholarGoogle Scholar
  10. Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. 2015. Asynchronous Programming, Analysis and Testing with State Machines. SIGPLAN Not. 50, 6 (June 2015), 154–164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. 2016. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 249–262. https://www.usenix.org/conference/fast16/technical- sessions/presentation/ deligiannis Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ankush Desai, Shaz Qadeer, and Sanjit A. Seshia. 2015. Systematic Testing of Asynchronous Reactive Systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA, 73–83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robert P. Dilworth. 1950. A Decomposition Theorem for Partially Ordered Sets. Annals of Mathematics 51, 1 (1950), 161–166. http://www.jstor.org/stable/1969503Google ScholarGoogle ScholarCross RefCross Ref
  14. Dimitar Dimitrov, Martin T. Vechev, and Vivek Sarkar. 2015. Race Detection in Two Dimensions. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA 2015, Portland, OR, USA, June 13-15, 2015. ACM, 101–110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Michael Emmi, Shaz Qadeer, and Zvonimir Rakamaric. 2011. Delay-bounded scheduling. In POPL ’11: Proc. 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 411–422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Stefan Felsner. 1997. On-Line Chain Partitions of Orders. Theor. Comput. Sci. 175, 2 (1997), 283–292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-order Reduction for Model Checking Software. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’05). ACM, New York, NY, USA, 110–121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gatling Corp. 2011–2018. Gatling. Retrieved September 7, 2018 from https://gatling.io/Google ScholarGoogle Scholar
  19. Patrice Godefroid. 1996. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. Springer-Verlag New York, Inc., Secaucus, NJ, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F.P. Junqueira, B.C. Reed, and M. Serafini. 2011. Zab: High-performance Broadcast for Primary-backup Systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks (DSN ’11). IEEE Computer Society, 245–256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Henry A. Kierstead. 1981. An Effective Version of Dilworth’s Theorem. Trans. Amer. Math. Soc. 268, 1 (1981), 63–77. http://www.jstor.org/stable/1998337Google ScholarGoogle Scholar
  22. Kyle Kingsbury. 2013. Partitions for Everyone! Retrieved July 31, 2018 from https://www.infoq.com/presentations/ partitioning- comparisonGoogle ScholarGoogle Scholar
  23. Kyle Kingsbury. 2013–2018. Jepsen. Retrieved July 31, 2018 from http://jepsen.io/Google ScholarGoogle Scholar
  24. Kamil Kloch. 2007. Online dimension of partially ordered sets. Reports on Mathematical Logic 42 (2007), 101–116. http: //www.iphils.uj.edu.pl/rml/rml- 42/a- klo- 42.htmGoogle ScholarGoogle Scholar
  25. Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35–40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 399–414. https: //www.usenix.org/conference/osdi14/technical- sessions/presentation/leesatapornwongsa Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of NonDeterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). ACM, New York, NY, USA, 517–530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rupak Majumdar and Filip Niksic. 2018. Why is random testing effective for partition tolerance bugs? PACMPL 2, POPL (2018), 46:1–46:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Rashmi Mudduluru, Pantazis Deligiannis, Ankush Desai, Akash Lal, and Shaz Qadeer. 2017. Lasso Detection using PartialState Caching. https://www.microsoft.com/en- us/research/publication/lasso- detection- using- partial- state- caching- 2/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Madanlal Musuvathi and Shaz Qadeer. 2007. Iterative Context Bounding for Systematic Testing of Multithreaded Programs. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, USA, 446–455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shaz Qadeer and Jakob Rehof. 2005. Context-Bounded Model Checking of Concurrent Software. In TACAS ’05: Proc. 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (LNCS), Vol. 3440. Springer, 93–107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Veselin Raychev, Martin Vechev, and Manu Sridharan. 2013. Effective Race Detection for Event-driven Programs. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’13). ACM, New York, NY, USA, 151–166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Samira Tasharofi, Michael Pradel, Yu Lin, and Ralph E. Johnson. 2013. Bita: Coverage-guided, automatic testing of actor programs. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. IEEE, 114–124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xinhao Yuan, Junfeng Yang, and Ronghui Gu. 2018. Partial Order Aware Concurrency Sampling. In Computer Aided Verification - 30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part II (Lecture Notes in Computer Science), Vol. 10982. Springer, 317–335.Google ScholarGoogle Scholar

Index Terms

  1. Randomized testing of distributed systems with probabilistic guarantees

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!