skip to main content
research-article
Open Access

Trace aware random testing for distributed systems

Published:10 October 2019Publication History
Skip Abstract Section

Abstract

Distributed and concurrent applications often have subtle bugs that only get exposed under specific schedules. While these schedules may be found by systematic model checking techniques, in practice, model checkers do not scale to large systems. On the other hand, naive random exploration techniques often require a very large number of runs to find the specific interactions needed to expose a bug. In recent years, several random testing algorithms have been proposed that, on the one hand, exploit state-space reduction strategies from model checking and, on the other, provide guarantees on the probability of hitting bugs of certain kinds.

These existing techniques exploit two orthogonal strategies to reduce the state space: partial-order reduction and bug depth. Testing algorithms based on partial order techniques, such as RAPOS or POS, ensure non-redundant exploration of independent interleavings among system events by imposing an equivalence relation on schedules and ideally exploring only one schedule from each equivalence class. Techniques based on bug depth, such as PCT, exploit the empirical observation that many bugs are exposed by the clever scheduling of a small number of key events. They bias the sample space of schedules to only cover all executions of small depth, rather than the much larger space of all schedules. At this point, there is no random testing algorithm that combines the power of both approaches.

In this paper, we provide such an algorithm. Our algorithm, trace-aware PCT (taPCTCP), extends and unifies several different algorithms in the random testing literature. It samples the space of low-depth executions by constructing a schedule online, while taking dependencies among events into account. Moreover, the algorithm comes with a theoretical guarantee on the probability of sampling a trace of low depth---the probability grows exponentially with the depth but only polynomially with the number of racy events explored. We further show that the guarantee is optimal among a large class of techniques.

We empirically compare our algorithm with several state-of-the-art random testing approaches for concurrent software on two large-scale distributed systems, Zookeeper and Cassandra, and show that our approach is effective in uncovering subtle bugs and usually outperforms related random testing algorithms.

Skip Supplemental Material Section

Supplemental Material

a180-ozcan

Presentation at OOPSLA '19

References

  1. Parosh Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. 2014. Optimal Dynamic Partial Order Reduction. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’14). ACM, 373–384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Parosh Aziz Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. 2017. Source Sets: A Foundation for Optimal Dynamic Partial Order Reduction. J. ACM 64, 4 (2017), 25:1–25:49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Anurag Agarwal and Vijay K. Garg. 2007. Efficient dependency tracking for relevant events in concurrent systems. Distributed Computing 19, 3 (2007), 163–183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Gul Agha and Carl Hewitt. 1985. Concurrent Programming Using Actors: Exploiting large-Scale Parallelism. In Foundations of Software Technology and Theoretical Computer Science, Fifth Conference, New Delhi, India, December 16-18, 1985, Proceedings. 19–41. Google ScholarGoogle ScholarCross RefCross Ref
  5. Elvira Albert, Puri Arenas, Maria Garcia de la Banda, Miguel Gómez-Zamalloa, and Peter J. Stuckey. 2017. Context-Sensitive Dynamic Partial Order Reduction. In Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I. 526–543. Google ScholarGoogle ScholarCross RefCross Ref
  6. Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. 2019. Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019. 59–73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sanjeev Arora and Boaz Barak. 2009. Computational Complexity - A Modern Approach. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ahmed Bouajjani and Michael Emmi. 2012. Bounded Phase Analysis of Message-Passing Programs. In Tools and Algorithms for the Construction and Analysis of Systems - 18th International Conference, TACAS 2012, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2012, Tallinn, Estonia, March 24 - April 1, 2012. Proceedings. 451–465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010. 167–178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Cartier and D. Foata. 1969. Problèmes combinatoires de commutation et r’earrangements. Number 85 in Lecture Notes in Mathematics. Springer.Google ScholarGoogle Scholar
  11. Marek Chalupa, Krishnendu Chatterjee, Andreas Pavlogiannis, Nishant Sinha, and Kapil Vaidya. 2018. Data-centric dynamic partial order reduction. PACMPL 2, POPL (2018), 31:1–31:30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dmitry Chistikov, Rupak Majumdar, and Filip Niksic. 2016. Hitting Families of Schedules for Asynchronous Programs. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II (Lecture Notes in Computer Science), Vol. 9780. Springer, 157–176. Google ScholarGoogle ScholarCross RefCross Ref
  13. Katherine E. Coons, Madan Musuvathi, and Kathryn S. McKinley. 2013. Bounded partial-order reduction. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2013, part of SPLASH 2013, Indianapolis, IN, USA, October 26-31, 2013. 833–848. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. 2016. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In 14th USENIX Conference on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA, February 22-25, 2016. 249–262. https://www.usenix.org/conference/fast16/technical- sessions/presentation/ deligiannisGoogle ScholarGoogle Scholar
  15. Ankush Desai, Shaz Qadeer, and Sanjit A. Seshia. 2015. Systematic testing of asynchronous reactive systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015. 73–83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Volker Diekert and Grzegorz Rozenberg (Eds.). 1995. The Book of Traces. World Scientific.Google ScholarGoogle Scholar
  17. Michael Emmi, Shaz Qadeer, and Zvonimir Rakamaric. 2011. Delay-bounded scheduling. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011. 411–422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-order Reduction for Model Checking Software. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’05). ACM, New York, NY, USA, 110–121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Patrice Godefroid. 1996. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. Springer-Verlag New York, Inc., Secaucus, NJ, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Patrice Godefroid. 1997. Model checking for programming languages using VeriSoft. In Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. ACM, 174–186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Vivek Gore, Mark Jerrum, Sampath Kannan, Z. Sweedyk, and Stephen R. Mahaney. 1997. A Quasi-Polynomial-Time Algorithm for Sampling Words from a Context-Free Language. Information and Computation 134, 1 (1997), 59–74.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Carl Hewitt, Peter Boehler Bishop, and Richard Steiger. 1973. A Universal Modular ACTOR Formalism for Artificial Intelligence. In Proceedings of the 3rd International Joint Conference on Artificial Intelligence. Standford, CA, USA, August 20-23, 1973. 235–245. http://ijcai.org/Proceedings/73/Papers/027B.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  23. John E. Hopcroft and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages and Computation (1st ed.). Addison-Wesley.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Flavio Paiva Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN) (2011), 245–256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Charles Edwin Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code (Awarded Best Paper). In 4th Symposium on Networked Systems Design and Implementation (NSDI 2007), April 11-13, 2007, Cambridge, Massachusetts, USA, Proceedings. http://www.usenix.org/ events/nsdi07/tech/killian.htmlGoogle ScholarGoogle Scholar
  26. Kyle Kingsbury. 2013–2018. Jepsen. Retrieved April 05, 2019 from http://jepsen.io/Google ScholarGoogle Scholar
  27. Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, and Georg Weissenbacher. 2018. Randomized testing of distributed systems with probabilistic guarantees. PACMPL 2, OOPSLA (2018), 160:1–160:28.Google ScholarGoogle Scholar
  28. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014. 399–414. https: //www.usenix.org/conference/osdi14/technical- sessions/presentation/leesatapornwongsaGoogle ScholarGoogle Scholar
  29. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of NonDeterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’16, Atlanta, GA, USA, April 2-6, 2016. 517–530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos Kawazoe Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 23-26, 2011. 279–294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. 2017. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi’an, China, April 8-12, 2017. 677–691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yuheng Long, Mehdi Bagherzadeh, Eric Lin, Ganesha Upadhyaya, and Hridesh Rajan. 2016. On ordering problems in message passing software. In Proceedings of the 15th International Conference on Modularity, MODULARITY 2016, Málaga, Spain, March 14 - 18, 2016. 54–65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Brandon Lucia and Luis Ceze. 2009. Finding concurrency bugs with context-aware communication graphs. In 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), December 12-16, 2009, New York, New York, USA. 553–563. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jeffrey F. Lukman, Huan Ke, Cesar A. Stuardo, Riza O. Suminto, Daniar H. Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, Aarti Gupta, Shan Lu, and Haryadi S. Gunawi. 2019. FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019. 20:1–20:16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Rupak Majumdar and Filip Niksic. 2018. Why is random testing effective for partition tolerance bugs? PACMPL 2, POPL (2018), 46:1–46:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Madanlal Musuvathi and Shaz Qadeer. 2007. Iterative context bounding for systematic testing of multithreaded programs. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007. 446–455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Santosh Nagarakatte, Sebastian Burckhardt, Milo M. K. Martin, and Madanlal Musuvathi. 2012. Multicore acceleration of priority-based schedulers for concurrency bug detection. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012. 543–554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Shaz Qadeer and Jakob Rehof. 2005. Context-Bounded Model Checking of Concurrent Software. In Tools and Algorithms for the Construction and Analysis of Systems, 11th International Conference, TACAS 2005, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2005, Edinburgh, UK, April 4-8, 2005, Proceedings. 93–107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Koushik Sen. 2007. Effective random testing of concurrent programs. In 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), November 5-9, 2007, Atlanta, Georgia, USA. 323–332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jirí Simsa, Randy Bryant, and Garth A. Gibson. 2011. dBug: Systematic Testing of Unmodified Distributed and Multi-threaded Systems. In Model Checking Software - 18th International SPIN Workshop, Snowbird, UT, USA, July 14-15, 2011. Proceedings. 188–193. Google ScholarGoogle ScholarCross RefCross Ref
  41. Samira Tasharofi, Rajesh K. Karmani, Steven Lauterburg, Axel Legay, Darko Marinov, and Gul Agha. 2012. TransDPOR: A Novel Dynamic Partial-Order Reduction Technique for Testing Actor Programs. In Formal Techniques for Distributed Systems - Joint 14th IFIP WG 6.1 International Conference, FMOODS 2012 and 32nd IFIP WG 6.1 International Conference, FORTE 2012, Stockholm, Sweden, June 13-16, 2012. Proceedings. 219–234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Samira Tasharofi, Michael Pradel, Yu Lin, and Ralph E. Johnson. 2013. Bita: Coverage-guided, automatic testing of actor programs. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. 114–124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Paul Thomson, Alastair F. Donaldson, and Adam Betts. 2014. Concurrency testing using schedule bounding: an empirical study. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, Orlando, FL, USA, February 15-19, 2014. 15–28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009, April 22-24, 2009, Boston, MA, USA. 213–228. http://www.usenix.org/events/nsdi09/tech/full_papers/yang/yang.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  45. Xinhao Yuan, Junfeng Yang, and Ronghui Gu. 2018. Partial Order Aware Concurrency Sampling. In Computer Aided Verification - 30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part II. 317–335.Google ScholarGoogle Scholar

Index Terms

  1. Trace aware random testing for distributed systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!