Abstract
Random testing has proven to be an effective way to catch bugs in distributed systems in the presence of network partition faults. This is surprising, as the space of potentially faulty executions is enormous, and the bugs depend on a subtle interplay between sequences of operations and faults.
We provide a theoretical justification of the effectiveness of random testing in this context. First, we show a general construction, using the probabilistic method from combinatorics, that shows that whenever a random test covers a fixed coverage goal with sufficiently high probability, a small randomly-chosen set of tests achieves full coverage with high probability. In particular, we show that our construction can give test sets exponentially smaller than systematic enumeration. Second, based on an empirical study of many bugs found by random testing in production distributed systems, we introduce notions of test coverage relating to network partition faults which are effective in finding bugs. Finally, we show using combinatorial arguments that for these notions of test coverage we introduce, we can find a lower bound on the probability that a random test covers a given goal. Our general construction then explains why random testing tools achieve good coverage---and hence, find bugs---quickly.
While we formulate our results in terms of network partition faults, our construction provides a step towards rigorous analysis of random testing algorithms, and can be applicable in other scenarios.
Supplemental Material
- Noga Alon. 2010. Algebraic and Probabilistic Methods in Discrete Mathematics. Birkhäuser Basel, Basel, 455–470. Google Scholar
Cross Ref
- Noga Alon and Joel H. Spencer. 2004. The Probabilistic Method. Wiley. https://books.google.de/books?id=6QIEjeMjBkkCGoogle Scholar
- Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM, 331–346. Google Scholar
Digital Library
- Apache Hadoop. 2016. Fault Injection Framework and Development Guide. Retrieved July 7, 2017 from http://hadoop. apache.org/docs/r2.7.3/hadoop- project- dist/hadoop- hdfs/FaultInjectFramework.htmlGoogle Scholar
- Christel Baier and Joost-Pieter Katoen. 2008. Principles of Model Checking. MIT Press. https://books.google.de/books?id= nDQiAQAAIAAJGoogle Scholar
Digital Library
- Eric A. Brewer. 2000. Towards robust distributed systems (abstract). In Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing, July 16-19, 2000, Portland, Oregon, USA. ACM, 7. Google Scholar
Digital Library
- Eric A. Brewer. 2012. CAP Twelve Years Later: How the “Rules” Have Changed. Retrieved July 7, 2017 from https: //www.infoq.com/articles/cap- twelve- years- later- how- the- rules- have- changedGoogle Scholar
- Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010. ACM, 167–178. Google Scholar
Digital Library
- Chaos Engineering. 2017. Principles of Chaos Engineering. Retrieved July 7, 2017 from http://principlesofchaos.org/Google Scholar
- Dmitry Chistikov, Rupak Majumdar, and Filip Niksic. 2016. Hitting Families of Schedules for Asynchronous Programs. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II (Lecture Notes in Computer Science), Vol. 9780. Springer, 157–176. Google Scholar
Cross Ref
- Koen Claessen, Michal H. Palka, Nicholas Smallbone, John Hughes, Hans Svensson, Thomas Arts, and Ulf T. Wiger. 2009. Finding race conditions in Erlang with QuickCheck and P ULSE. In Proceeding of the 14th ACM SIGPLAN international conference on Functional programming, ICFP 2009. ACM, 149–160.Google Scholar
- Charles J. Colbourn. 2004. Combinatorial aspects of covering arrays. Le Matematiche 59, 1,2 (2004), 125–172. https: //lematematiche.dmi.unict.it/index.php/lematematiche/article/view/166Google Scholar
- Zbigniew J. Czech, George Havas, and Bohdan S. Majewski. 1997. Perfect Hashing. Theor. Comput. Sci. 182, 1-2 (1997), 1–143. Google Scholar
Digital Library
- Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. 2016. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In 14th USENIX Conference on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA, February 22-25, 2016. USENIX Association, 249–262. https://www.usenix.org/conference/fast16/ technical- sessions/presentation/deligiannisGoogle Scholar
- Ben Dushnik and E. W. Miller. 1941. Partially Ordered Sets. American Journal of Mathematics 63, 3 (1941), 600–610. http://www.jstor.org/stable/2371374 Google Scholar
Cross Ref
- Gerald A. Edgar, Daniel H. Ullman, and Douglas B. West. 2017. Problems and Solutions. The American Mathematical Monthly 124, 2 (2017), 179–187. http://www.jstor.org/stable/10.4169/amer.math.monthly.124.2.179 Google Scholar
Cross Ref
- Dana Fisman, Orna Kupferman, and Yoad Lustig. 2008. On Verifying Fault Tolerance of Distributed Protocols. In Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings (Lecture Notes in Computer Science), Vol. 4963. Springer, 315–331. Google Scholar
Cross Ref
- Michael L. Fredman, János Komlós, and Endre Szemerédi. 1984. Storing a Sparse Table with O (1) Worst Case Access Time. J. ACM 31, 3 (1984), 538–544. Google Scholar
Digital Library
- Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (2002), 51–59. Google Scholar
Digital Library
- Anant P. Godbole, Daphne E. Skipper, and Rachel A. Sunley. 1996. t-Covering arrays: Upper bounds and poisson approximations. Combinatorics Probability and Computing 5, 2 (12 1996), 105–117.Google Scholar
- Ronald L. Graham, Donald Ervin Knuth, and Oren Patashnik. 1994. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley. https://books.google.de/books?id=cjgPAQAAMAAJGoogle Scholar
- Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011. USENIX Association. https://www.usenix.org/conference/nsdi11/ fate- and- destini- framework- cloud- recovery- testingGoogle Scholar
- Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, Monterey, CA, USA, October 4-7, 2015. ACM, 1–17. Google Scholar
Digital Library
- Yury Izrailevsky and Ariel Tseitlin. 2011. The Netflix Simian Army. Retrieved July 7, 2017 from https://medium.com/ netflix- techblog/the- netflix- simian- army- 16e57fbab116Google Scholar
- Kyle Kingsbury. 2013. Partitions for Everyone! Retrieved July 7, 2017 from https://www.infoq.com/presentations/ partitioning- comparisonGoogle Scholar
- Kyle Kingsbury. 2013–2017. Jepsen. Retrieved July 7, 2017 from http://jepsen.io/Google Scholar
- Igor Konnov, Helmut Veith, and Josef Widder. 2017. On the completeness of bounded model checking for threshold-based distributed algorithms: Reachability. Inf. Comput. 252 (2017), 95–109. Google Scholar
Digital Library
- D. Richard Kuhn, Raghu N. Kacker, and Yu Lei. 2010. Combinatorial Testing. In Encyclopedia of Software Engineering, Phillip A. Laplante (Ed.). CRC Press, 1–12.Google Scholar
- Orna Kupferman, Wenchao Li, and Sanjit A. Seshia. 2008. A Theory of Mutations with Applications to Vacuity, Coverage, and Fault Tolerance. In Formal Methods in Computer-Aided Design, FMCAD 2008, Portland, Oregon, USA, 17-20 November 2008. IEEE, 1–9. Google Scholar
Cross Ref
- Leslie Lamport. 1994. The Temporal Logic of Actions. ACM Trans. Program. Lang. Syst. 16, 3 (1994), 872–923. Google Scholar
Digital Library
- Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014. USENIX Association, 399–414. https://www.usenix.org/conference/osdi14/technical- sessions/presentation/leesatapornwongsaGoogle Scholar
- David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. 2009. Markov Chains and Mixing Times. American Mathematical Society. https://books.google.de/books?id=vgQnngEACAAJGoogle Scholar
- Cristina Videira Lopes. 2016. Distributed Systems Testing: The Lost World. Retrieved July 7, 2017 from http://tagide.com/ blog/research/distributed- systems- testing- the- lost- world/Google Scholar
- Caitie McCaffrey. 2015. The Verification of a Distributed System. ACM Queue 13, 9 (2015), 60. Google Scholar
Digital Library
- Milena Mihail and Christos H. Papadimitriou. 1994. On the Random Walk Method for Protocol Testing. In Computer Aided Verification, 6th International Conference, CAV ’94, Stanford, California, USA, June 21-23, 1994, Proceedings (Lecture Notes in Computer Science), Vol. 818. Springer, 132–141. Google Scholar
Cross Ref
- Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In 2014 USENIX Annual Technical Conference, USENIX ATC ’14, Philadelphia, PA, USA, June 19-20, 2014. USENIX Association, 305–319. https: //www.usenix.org/conference/atc14/technical- sessions/presentation/ongaroGoogle Scholar
- Colin Scott. 2016. Technologies for Testing and Debugging Distributed Systems. Retrieved July 7, 2017 from http: //colin- scott.github.io/blog/2016/03/04/technologies- for- testing- and- debugging- distributed- systems/Google Scholar
- Gadiel Seroussi and Nader H. Bshouty. 1988. Vector sets for exhaustive testing of logic circuits. IEEE Trans. Information Theory 34, 3 (1988), 513–522. Google Scholar
Digital Library
- Colin H. West. 1989. Protocol Validation in Complex Systems. In SIGCOMM ’89, Proceedings of the ACM Symposium on Communications Architectures & Protocols, Austin, TX, USA, September 19-22, 1989. ACM, 303–312. Google Scholar
Digital Library
- James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. 2015. Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015. ACM, 357–368. Google Scholar
Digital Library
- Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009, April 22-24, 2009, Boston, MA, USA. USENIX Association, 213–228. http://www.usenix.org/events/nsdi09/tech/full_papers/yang/yang.pdfGoogle Scholar
Digital Library
- Mihalis Yannakakis. 1982. The Complexity of the Partial Order Dimension Problem. SIAM Journal on Algebraic Discrete Methods 3, 3 (1982), 351–358. Google Scholar
Cross Ref
- Andrew Chi-Chih Yao. 1981. Should Tables Be Sorted? J. ACM 28, 3 (1981), 615–628. Google Scholar
Digital Library
Index Terms
Why is random testing effective for partition tolerance bugs?
Recommendations
Prioritizing random combinatorial test suites
SAC '17: Proceedings of the Symposium on Applied ComputingThe behaviour of a system under test can be influenced by several factors, such as system configurations, user inputs, and so on. It has also been observed that many failures are caused by only a small number of factors. Combinatorial testing aims at ...
Partition Testing vs. Random Testing: The Influence of Uncertainty
This paper compares partition testing and random testing on the assumption that program failure rates are not known with certainty before testing and are, therefore, modeled by random variables. It is shown that under uncertainty, partition testing ...
Fault detection effectiveness of source test case generation strategies for metamorphic testing
MET '18: Proceedings of the 3rd International Workshop on Metamorphic TestingMetamorphic testing is a well known approach to tackle the oracle problem in software testing. This technique requires the use of source test cases that serve as seeds for the generation of follow-up test cases. Systematic design of test cases is ...






Comments