Abstract
Several recently proposed randomized testing tools for concurrent and distributed systems come with theoretical guarantees on their success. The key to these guarantees is a notion of bug depth—the minimum length of a sequence of events sufficient to expose the bug—and a characterization of d-hitting families of schedules—a set of schedules guaranteed to cover every bug of given depth d. Previous results show that in certain cases the size of a d-hitting family can be significantly smaller than the total number of possible schedules. However, these results either assume shared-memory multithreading, or that the underlying partial ordering of events is known statically and has special structure. These assumptions are not met by distributed message-passing applications.
In this paper, we present a randomized scheduling algorithm for testing distributed systems. In contrast to previous approaches, our algorithm works for arbitrary partially ordered sets of events revealed online as the program is being executed. We show that for partial orders of width at most w and size at most n (both statically unknown), our algorithm is guaranteed to sample from at most w2 nd−1 schedules, for every fixed bug depth d. Thus, our algorithm discovers a bug of depth d with probability at least 1 / (w2 nd−1). As a special case, our algorithm recovers a previous randomized testing algorithm for multithreaded programs. Our algorithm is simple to implement, but the correctness arguments depend on difficult combinatorial results about online dimension and online chain partitioning of partially ordered sets.
We have implemented our algorithm in a randomized testing tool for distributed message-passing programs. We show that our algorithm can find bugs in distributed systems such as Zookeeper and Cassandra, and empirically outperforms naive random exploration while providing theoretical guarantees.
Supplemental Material
- Parosh Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. 2014. Optimal Dynamic Partial Order Reduction. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’14). ACM, New York, NY, USA, 373–384. Google Scholar
Digital Library
- Anurag Agarwal and Vijay K. Garg. 2007. Efficient dependency tracking for relevant events in concurrent systems. Distributed Computing 19, 3 (2007), 163–183. Google Scholar
Digital Library
- Apache. 2012. Cassandra-2.0.0. Retrieved April 13, 2018 from http://archive.apache.org/dist/cassandra/2.0.0/Google Scholar
- Bartłomiej Bosek, Stefan Felsner, Kamil Kloch, Tomasz Krawczyk, Grzegorz Matecki, and Piotr Micek. 2012. On-Line Chain Partitions of Orders: A Survey. Order 29, 1 (2012), 49–73.Google Scholar
Cross Ref
- Bartłomiej Bosek, Hal A. Kierstead, Tomasz Krawczyk, Grzegorz Matecki, and Matthew E. Smith. 2018. An Easy Subexponential Bound for Online Chain Partitioning. Electr. J. Comb. 25, 2 (2018), P2.28. http://www.combinatorics.org/ojs/ index.php/eljc/article/view/v25i2p28Google Scholar
- Bartłomiej Bosek and Tomasz Krawczyk. 2010. The Sub-exponential Upper Bound for On-Line Chain Partitioning. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA. IEEE Computer Society, 347–354.Google Scholar
- Ahmed Bouajjani and Michael Emmi. 2012. Bounded Phase Analysis of Message-Passing Programs. In TACAS ’12: Proc. 18th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (LNCS). Springer. Google Scholar
Digital Library
- Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, USA, 167–178. Google Scholar
Digital Library
- Dmitry Chistikov, Rupak Majumdar, and Filip Niksic. 2016. Hitting Families of Schedules for Asynchronous Programs. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II (Lecture Notes in Computer Science), Vol. 9780. Springer, 157–176.Google Scholar
- Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. 2015. Asynchronous Programming, Analysis and Testing with State Machines. SIGPLAN Not. 50, 6 (June 2015), 154–164. Google Scholar
Digital Library
- Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. 2016. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 249–262. https://www.usenix.org/conference/fast16/technical- sessions/presentation/ deligiannis Google Scholar
Digital Library
- Ankush Desai, Shaz Qadeer, and Sanjit A. Seshia. 2015. Systematic Testing of Asynchronous Reactive Systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, New York, NY, USA, 73–83. Google Scholar
Digital Library
- Robert P. Dilworth. 1950. A Decomposition Theorem for Partially Ordered Sets. Annals of Mathematics 51, 1 (1950), 161–166. http://www.jstor.org/stable/1969503Google Scholar
Cross Ref
- Dimitar Dimitrov, Martin T. Vechev, and Vivek Sarkar. 2015. Race Detection in Two Dimensions. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA 2015, Portland, OR, USA, June 13-15, 2015. ACM, 101–110. Google Scholar
Digital Library
- Michael Emmi, Shaz Qadeer, and Zvonimir Rakamaric. 2011. Delay-bounded scheduling. In POPL ’11: Proc. 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 411–422. Google Scholar
Digital Library
- Stefan Felsner. 1997. On-Line Chain Partitions of Orders. Theor. Comput. Sci. 175, 2 (1997), 283–292. Google Scholar
Digital Library
- Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-order Reduction for Model Checking Software. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’05). ACM, New York, NY, USA, 110–121. Google Scholar
Digital Library
- Gatling Corp. 2011–2018. Gatling. Retrieved September 7, 2018 from https://gatling.io/Google Scholar
- Patrice Godefroid. 1996. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. Springer-Verlag New York, Inc., Secaucus, NJ, USA. Google Scholar
Digital Library
- F.P. Junqueira, B.C. Reed, and M. Serafini. 2011. Zab: High-performance Broadcast for Primary-backup Systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks (DSN ’11). IEEE Computer Society, 245–256. Google Scholar
Digital Library
- Henry A. Kierstead. 1981. An Effective Version of Dilworth’s Theorem. Trans. Amer. Math. Soc. 268, 1 (1981), 63–77. http://www.jstor.org/stable/1998337Google Scholar
- Kyle Kingsbury. 2013. Partitions for Everyone! Retrieved July 31, 2018 from https://www.infoq.com/presentations/ partitioning- comparisonGoogle Scholar
- Kyle Kingsbury. 2013–2018. Jepsen. Retrieved July 31, 2018 from http://jepsen.io/Google Scholar
- Kamil Kloch. 2007. Online dimension of partially ordered sets. Reports on Mathematical Logic 42 (2007), 101–116. http: //www.iphils.uj.edu.pl/rml/rml- 42/a- klo- 42.htmGoogle Scholar
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35–40. Google Scholar
Digital Library
- Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 399–414. https: //www.usenix.org/conference/osdi14/technical- sessions/presentation/leesatapornwongsa Google Scholar
Digital Library
- Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of NonDeterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). ACM, New York, NY, USA, 517–530. Google Scholar
Digital Library
- Rupak Majumdar and Filip Niksic. 2018. Why is random testing effective for partition tolerance bugs? PACMPL 2, POPL (2018), 46:1–46:24. Google Scholar
Digital Library
- Rashmi Mudduluru, Pantazis Deligiannis, Ankush Desai, Akash Lal, and Shaz Qadeer. 2017. Lasso Detection using PartialState Caching. https://www.microsoft.com/en- us/research/publication/lasso- detection- using- partial- state- caching- 2/ Google Scholar
Digital Library
- Madanlal Musuvathi and Shaz Qadeer. 2007. Iterative Context Bounding for Systematic Testing of Multithreaded Programs. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’07). ACM, New York, NY, USA, 446–455. Google Scholar
Digital Library
- Shaz Qadeer and Jakob Rehof. 2005. Context-Bounded Model Checking of Concurrent Software. In TACAS ’05: Proc. 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (LNCS), Vol. 3440. Springer, 93–107. Google Scholar
Digital Library
- Veselin Raychev, Martin Vechev, and Manu Sridharan. 2013. Effective Race Detection for Event-driven Programs. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’13). ACM, New York, NY, USA, 151–166. Google Scholar
Digital Library
- Samira Tasharofi, Michael Pradel, Yu Lin, and Ralph E. Johnson. 2013. Bita: Coverage-guided, automatic testing of actor programs. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. IEEE, 114–124. Google Scholar
Digital Library
- Xinhao Yuan, Junfeng Yang, and Ronghui Gu. 2018. Partial Order Aware Concurrency Sampling. In Computer Aided Verification - 30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part II (Lecture Notes in Computer Science), Vol. 10982. Springer, 317–335.Google Scholar
Index Terms
Randomized testing of distributed systems with probabilistic guarantees
Recommendations
Trace aware random testing for distributed systems
Distributed and concurrent applications often have subtle bugs that only get exposed under specific schedules. While these schedules may be found by systematic model checking techniques, in practice, model checkers do not scale to large systems. On the ...
A randomized scheduler with probabilistic guarantees of finding bugs
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsThis paper presents a randomized scheduler for finding concurrency bugs. Like current stress-testing methods, it repeatedly runs a given test program with supplied inputs. However, it improves on stress-testing by finding buggy schedules more ...
A randomized scheduler with probabilistic guarantees of finding bugs
ASPLOS '10This paper presents a randomized scheduler for finding concurrency bugs. Like current stress-testing methods, it repeatedly runs a given test program with supplied inputs. However, it improves on stress-testing by finding buggy schedules more ...






Comments