skip to main content
research-article
Open Access

Why is random testing effective for partition tolerance bugs?

Published:27 December 2017Publication History
Skip Abstract Section

Abstract

Random testing has proven to be an effective way to catch bugs in distributed systems in the presence of network partition faults. This is surprising, as the space of potentially faulty executions is enormous, and the bugs depend on a subtle interplay between sequences of operations and faults.

We provide a theoretical justification of the effectiveness of random testing in this context. First, we show a general construction, using the probabilistic method from combinatorics, that shows that whenever a random test covers a fixed coverage goal with sufficiently high probability, a small randomly-chosen set of tests achieves full coverage with high probability. In particular, we show that our construction can give test sets exponentially smaller than systematic enumeration. Second, based on an empirical study of many bugs found by random testing in production distributed systems, we introduce notions of test coverage relating to network partition faults which are effective in finding bugs. Finally, we show using combinatorial arguments that for these notions of test coverage we introduce, we can find a lower bound on the probability that a random test covers a given goal. Our general construction then explains why random testing tools achieve good coverage---and hence, find bugs---quickly.

While we formulate our results in terms of network partition faults, our construction provides a step towards rigorous analysis of random testing algorithms, and can be applicable in other scenarios.

Skip Supplemental Material Section

Supplemental Material

whyisrandomtestingeffective.webm

References

  1. Noga Alon. 2010. Algebraic and Probabilistic Methods in Discrete Mathematics. Birkhäuser Basel, Basel, 455–470. Google ScholarGoogle ScholarCross RefCross Ref
  2. Noga Alon and Joel H. Spencer. 2004. The Probabilistic Method. Wiley. https://books.google.de/books?id=6QIEjeMjBkkCGoogle ScholarGoogle Scholar
  3. Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM, 331–346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache Hadoop. 2016. Fault Injection Framework and Development Guide. Retrieved July 7, 2017 from http://hadoop. apache.org/docs/r2.7.3/hadoop- project- dist/hadoop- hdfs/FaultInjectFramework.htmlGoogle ScholarGoogle Scholar
  5. Christel Baier and Joost-Pieter Katoen. 2008. Principles of Model Checking. MIT Press. https://books.google.de/books?id= nDQiAQAAIAAJGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eric A. Brewer. 2000. Towards robust distributed systems (abstract). In Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing, July 16-19, 2000, Portland, Oregon, USA. ACM, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Eric A. Brewer. 2012. CAP Twelve Years Later: How the “Rules” Have Changed. Retrieved July 7, 2017 from https: //www.infoq.com/articles/cap- twelve- years- later- how- the- rules- have- changedGoogle ScholarGoogle Scholar
  8. Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010. ACM, 167–178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chaos Engineering. 2017. Principles of Chaos Engineering. Retrieved July 7, 2017 from http://principlesofchaos.org/Google ScholarGoogle Scholar
  10. Dmitry Chistikov, Rupak Majumdar, and Filip Niksic. 2016. Hitting Families of Schedules for Asynchronous Programs. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II (Lecture Notes in Computer Science), Vol. 9780. Springer, 157–176. Google ScholarGoogle ScholarCross RefCross Ref
  11. Koen Claessen, Michal H. Palka, Nicholas Smallbone, John Hughes, Hans Svensson, Thomas Arts, and Ulf T. Wiger. 2009. Finding race conditions in Erlang with QuickCheck and P ULSE. In Proceeding of the 14th ACM SIGPLAN international conference on Functional programming, ICFP 2009. ACM, 149–160.Google ScholarGoogle Scholar
  12. Charles J. Colbourn. 2004. Combinatorial aspects of covering arrays. Le Matematiche 59, 1,2 (2004), 125–172. https: //lematematiche.dmi.unict.it/index.php/lematematiche/article/view/166Google ScholarGoogle Scholar
  13. Zbigniew J. Czech, George Havas, and Bohdan S. Majewski. 1997. Perfect Hashing. Theor. Comput. Sci. 182, 1-2 (1997), 1–143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. 2016. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In 14th USENIX Conference on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA, February 22-25, 2016. USENIX Association, 249–262. https://www.usenix.org/conference/fast16/ technical- sessions/presentation/deligiannisGoogle ScholarGoogle Scholar
  15. Ben Dushnik and E. W. Miller. 1941. Partially Ordered Sets. American Journal of Mathematics 63, 3 (1941), 600–610. http://www.jstor.org/stable/2371374 Google ScholarGoogle ScholarCross RefCross Ref
  16. Gerald A. Edgar, Daniel H. Ullman, and Douglas B. West. 2017. Problems and Solutions. The American Mathematical Monthly 124, 2 (2017), 179–187. http://www.jstor.org/stable/10.4169/amer.math.monthly.124.2.179 Google ScholarGoogle ScholarCross RefCross Ref
  17. Dana Fisman, Orna Kupferman, and Yoad Lustig. 2008. On Verifying Fault Tolerance of Distributed Protocols. In Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings (Lecture Notes in Computer Science), Vol. 4963. Springer, 315–331. Google ScholarGoogle ScholarCross RefCross Ref
  18. Michael L. Fredman, János Komlós, and Endre Szemerédi. 1984. Storing a Sparse Table with O (1) Worst Case Access Time. J. ACM 31, 3 (1984), 538–544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (2002), 51–59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Anant P. Godbole, Daphne E. Skipper, and Rachel A. Sunley. 1996. t-Covering arrays: Upper bounds and poisson approximations. Combinatorics Probability and Computing 5, 2 (12 1996), 105–117.Google ScholarGoogle Scholar
  21. Ronald L. Graham, Donald Ervin Knuth, and Oren Patashnik. 1994. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley. https://books.google.de/books?id=cjgPAQAAMAAJGoogle ScholarGoogle Scholar
  22. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011. USENIX Association. https://www.usenix.org/conference/nsdi11/ fate- and- destini- framework- cloud- recovery- testingGoogle ScholarGoogle Scholar
  23. Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, Monterey, CA, USA, October 4-7, 2015. ACM, 1–17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yury Izrailevsky and Ariel Tseitlin. 2011. The Netflix Simian Army. Retrieved July 7, 2017 from https://medium.com/ netflix- techblog/the- netflix- simian- army- 16e57fbab116Google ScholarGoogle Scholar
  25. Kyle Kingsbury. 2013. Partitions for Everyone! Retrieved July 7, 2017 from https://www.infoq.com/presentations/ partitioning- comparisonGoogle ScholarGoogle Scholar
  26. Kyle Kingsbury. 2013–2017. Jepsen. Retrieved July 7, 2017 from http://jepsen.io/Google ScholarGoogle Scholar
  27. Igor Konnov, Helmut Veith, and Josef Widder. 2017. On the completeness of bounded model checking for threshold-based distributed algorithms: Reachability. Inf. Comput. 252 (2017), 95–109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Richard Kuhn, Raghu N. Kacker, and Yu Lei. 2010. Combinatorial Testing. In Encyclopedia of Software Engineering, Phillip A. Laplante (Ed.). CRC Press, 1–12.Google ScholarGoogle Scholar
  29. Orna Kupferman, Wenchao Li, and Sanjit A. Seshia. 2008. A Theory of Mutations with Applications to Vacuity, Coverage, and Fault Tolerance. In Formal Methods in Computer-Aided Design, FMCAD 2008, Portland, Oregon, USA, 17-20 November 2008. IEEE, 1–9. Google ScholarGoogle ScholarCross RefCross Ref
  30. Leslie Lamport. 1994. The Temporal Logic of Actions. ACM Trans. Program. Lang. Syst. 16, 3 (1994), 872–923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014. USENIX Association, 399–414. https://www.usenix.org/conference/osdi14/technical- sessions/presentation/leesatapornwongsaGoogle ScholarGoogle Scholar
  32. David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. 2009. Markov Chains and Mixing Times. American Mathematical Society. https://books.google.de/books?id=vgQnngEACAAJGoogle ScholarGoogle Scholar
  33. Cristina Videira Lopes. 2016. Distributed Systems Testing: The Lost World. Retrieved July 7, 2017 from http://tagide.com/ blog/research/distributed- systems- testing- the- lost- world/Google ScholarGoogle Scholar
  34. Caitie McCaffrey. 2015. The Verification of a Distributed System. ACM Queue 13, 9 (2015), 60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Milena Mihail and Christos H. Papadimitriou. 1994. On the Random Walk Method for Protocol Testing. In Computer Aided Verification, 6th International Conference, CAV ’94, Stanford, California, USA, June 21-23, 1994, Proceedings (Lecture Notes in Computer Science), Vol. 818. Springer, 132–141. Google ScholarGoogle ScholarCross RefCross Ref
  36. Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In 2014 USENIX Annual Technical Conference, USENIX ATC ’14, Philadelphia, PA, USA, June 19-20, 2014. USENIX Association, 305–319. https: //www.usenix.org/conference/atc14/technical- sessions/presentation/ongaroGoogle ScholarGoogle Scholar
  37. Colin Scott. 2016. Technologies for Testing and Debugging Distributed Systems. Retrieved July 7, 2017 from http: //colin- scott.github.io/blog/2016/03/04/technologies- for- testing- and- debugging- distributed- systems/Google ScholarGoogle Scholar
  38. Gadiel Seroussi and Nader H. Bshouty. 1988. Vector sets for exhaustive testing of logic circuits. IEEE Trans. Information Theory 34, 3 (1988), 513–522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Colin H. West. 1989. Protocol Validation in Complex Systems. In SIGCOMM ’89, Proceedings of the ACM Symposium on Communications Architectures & Protocols, Austin, TX, USA, September 19-22, 1989. ACM, 303–312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. 2015. Verdi: a framework for implementing and formally verifying distributed systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015. ACM, 357–368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009, April 22-24, 2009, Boston, MA, USA. USENIX Association, 213–228. http://www.usenix.org/events/nsdi09/tech/full_papers/yang/yang.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mihalis Yannakakis. 1982. The Complexity of the Partial Order Dimension Problem. SIAM Journal on Algebraic Discrete Methods 3, 3 (1982), 351–358. Google ScholarGoogle ScholarCross RefCross Ref
  43. Andrew Chi-Chih Yao. 1981. Should Tables Be Sorted? J. ACM 28, 3 (1981), 615–628. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Why is random testing effective for partition tolerance bugs?

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!