skip to main content
research-article

Predicting and preventing inconsistencies in deployed distributed systems

Authors Info & Claims
Published:04 August 2010Publication History
Skip Abstract Section

Abstract

We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation.

This article describes the design and implementation of this approach, termed CrystalBall. We evaluate CrystalBall on RandTree, BulletPrime, Paxos, and Chord distributed system implementations. We identified new bugs in mature Mace implementations of three systems. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at runtime.

References

  1. Arnold, M., Vechev, M., and Yahav, E. 2008. Qvm: An efficient runtime for detecting defects in deployed systems. In Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA'08). ACM, New York, NY, 143--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ball, T., Podelski, A., and Rajamani, S. K. 2001. Boolean and cartesian abstraction for model checking C programs. In Proceedings of the Workshop on Tools and Algorithms for the Construction and Analysis of Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ball, T. and Rajamani, S. K. 2002. The SLAM project: Debugging system software via static analysis. In Proceedings of the ACM Symposium on Principles of Programming Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bayazit, A. A. and Malik, S. 2005. Complementary use of runtime validation and model checking. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'05). IEEE Computer Society, Los Alamitos, CA, 1052--1059. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Burch, J. R., Clarke, E. M., McMillan, K. L., Dill, D. L., and Hwang, L. J. 1990. Symbolic model checking: 1020 states and beyond. In Proceedings of the Annual IEEE Symposium on Logic in Computer Science.Google ScholarGoogle Scholar
  7. Castro, M., Druschel, P., Kermarrec, A.-M., Nandi, A., Rowstron, A., and Singh, A. 2003. Splitstream: High-bandwidth content distribution in cooperative environments. In Proceedings of the ACM Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chaki, S., Clarke, E., Groce, A., Jha, S., and Veith, H. 2003. Modular verification of software components in C. IEEE Trans. Softw. Engin. 30, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chandra, T. D., Griesemer, R., and Redstone, J. 2007. Paxos made live: An engineering perspective. In Proceedings of the Annual ACM Symposium on Principles of Distributed Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chandy, K. M. and Lamport, L. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1, 63--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chang, H., Govindan, R., Jamin, S., Shenker, S., and Willinger, W. 2002. Towards capturing representative AS-level internet topologies. In Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chu, Y., Rao, S. G., Seshan, S., and Zhang, H. 2002. A case for end system multicast. IEEE J. Sel. Areas Comm. 20, 8, 1456--1471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Costa, M., Castro, M., Zhou, L., Zhang, L., and Peinado, M. 2007. Bouncer: Securing software by blocking bad input. In Proceedings of the ACM Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Costa, M., Crowcroft, J., Castro, M., Rowstron, A., Zhou, L., Zhang, L., and Barham, P. 2005. Vigilante: End-to-end containment of internet worms. In Proceedings of the ACM Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dagand, P.-E., Kostić, D., and Kuncak, V. 2009. Opis: Reliable distributed systems in OCaml. In Proceedings of the ACM SIGPLAN Workshop on Types in Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dao, D., Albrecht, J. R., Killian, C. E., and Vahdat, A. 2009. Live debugging of distributed systems. In Proceedings of the International Conference on Compiler Construction. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Demsky, B. and Rinard, M. 2003. Automatic detection and repair of errors in data structures. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dunagan, J., Harvey, N. J. A., Jones, M. B., Kostić, D., Theimer, M., and Wolman, A. 2004. FUSE: Lightweight guaranteed distributed failure notification. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fischer, M., Lynch, N., and Paterson, M. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2, 374--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Flanagan, C. and Godefroid, P. 2005. Dynamic partial-order reduction for model checking software. In Proceedings of the ACM Symposium on Principles of Programming Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fonseca, R., Porter, G., Katz, R. H., Shenker, S., and Stoica, I. 2007. X-Trace: A pervasive network tracing framework. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Geels, D., Altekar, G., Maniatis, P., Roscoe, T., and Stoica, I. 2007. Friday: Global comprehension for distributed replay. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Godefroid, P. and Wolper, P. 1994. A partial approach to model checking. Inf. Comput. 110, 2, 305--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Henzinger, T. A., Jhala, R., Majumdar, R., and Sutre, G. 2002. Lazy abstraction. In Proceedings of the ACM Symposium on Principles of Programming Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Holzmann, G. J. 1997. The model checker SPIN. IEEE Trans. Softw. Engin. 23, 5, 279--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jain, N., Mahajan, P., Kit, D., Yalagandula, P., Dahlin, M., and Zhang, Y. 2008. Network imprecision: A new consistency metric for scalable monitoring. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Janjua, M. U. and Mycroft, A. 2006. Automatic correction to safety violations. In Proceedings of the International Conference on Thread Verification (TV'06).Google ScholarGoogle Scholar
  28. Jobstmann, B., Griesmayer, A., and Bloem, R. 2005. Program repair as a game. In Proceedings of the International Conference on Computer Aided Verification. 226--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. John, J. P., Katz-Bassett, E., Krishnamurthy, A., Anderson, T., and Venkataramani, A. 2008. Consensus routing: The internet as a distributed system. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Killian, C. E., Anderson, J. W., Braud, R., Jhala, R., and Vahdat, A. M. 2007a. Mace: Language support for building distributed systems. In Proceedings of the Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Killian, C. E., Anderson, J. W., Jhala, R., and Vahdat, A. 2007b. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kostić, D., Braud, R., Killian, C., Vandekieft, E., Anderson, J. W., Snoeren, A. C., and Vahdat, A. 2005. Maintaining high bandwidth under dynamic network conditions. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kostić, D., Rodriguez, A., Albrecht, J., Bhirud, A., and Vahdat, A. 2003. Using random subsets to build scalable network services. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Comm. ACM 21, 7, 558--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2, 133--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Liu, X., Guo, Z., Wang, X., Chen, F., Lian, X., Tang, J., Wu, M., Kaashoek, M. F., and Zhang, Z. 2008. D3S: Debugging deployed distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Liu, X., Lin, W., Pan, A., and Zhang, Z. 2007. WiDS checker: Combating bugs in distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Manivannan, D. and Singhal, M. 2002. Asynchronous recovery without using vector timestamps. J. Parall. Distrib. Comput. 62, 12, 1695--1728. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Musuvathi, M. and Engler, D. R. 2004. Model checking large network protocol implementations. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Musuvathi, M., Park, D. Y. W., Chou, A., Engler, D. R., and Dill, D. L. 2002. CMC: A pragmatic approach to model checking real code. SIGOPS Oper. Syst. Rev. 36, SI, 75--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Musuvathi, M. and Qadeer, S. 2007. Iterative context bounding for systematic testing of multithreaded programs. In Proceedings of the Conference on Programming Language Design and Implementation. 446--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A., and Neamtiu, I. 2008. Finding and reproducing heisenbugs in concurrent programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Nightingale, E. B., Chen, P. M., and Flinn, J. 2005. Speculative execution in a distributed file system. In Proceedings of the ACM Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Porter, D. E., Hofmann, O. S., Rossbach, C. J., Benn, A., and Witchel, E. 2009. Operating systems transactions. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP'09). ACM, New York, NY, 161--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Qin, F., Tucek, J., Zhou, Y., and Sundaresan, J. 2007. Rx: Treating bugs as allergies—A safe method to survive software failures. ACM Trans. Comput. Syst. 25, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Reynolds, P., Killian, C., Wiener, J. L., Mogul, J. C., Shah, M. A., and Vahdat, A. 2006. Pip: Detecting the unexpected in distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., and Yu, H. 2005. OpenDHT: A public DHT service and its uses. In Proceedings of the ACM SIGCOMM Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Rinard, M. C., Cadar, C., Dumitran, D., Roy, D. M., Leu, T., and Beebee, W. S. 2004. Enhancing server availability and security through failure-oblivious computing. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Rodriguez, A., Killian, C., Bhat, S., Kostić, D., and Vahdat, A. 2004. MACEDON: Methodology for automatically creating, evaluating, and designing overlay networks. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Rowstron, A. and Druschel, P. 2001. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proceedings of the ACM Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22, 4, 299--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Sen, K. and Agha, G. 2006. Automated systematic testing of open distributed programs. In Proceedings of the International Conference on Fundamental Approaches to Software Engineering. 339--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Singh, A., Maniatis, P., Roscoe, T., and Druschel, P. 2006. Using queries for distributed monitoring and forensics. SIGOPS Oper. Syst. Rev. 40, 4, 389--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Srinivasan, S. M., K, S., Andrews, C. R., and Zhou, Y. 2004. Flashback: A lightweight extension for rollback and deterministic replay for software debugging. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R., Kaashoek, M. F., Dabek, F., and Balakrishnan, H. 2003. Chord: A scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw. 11, 1, 17--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Vahdat, A., Yocum, K., Walsh, K., Mahadevan, P., Kostić, D., Chase, J., and Becker, D. 2002. Scalability and accuracy in a large-scale network emulator. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Wang, Y., Kelly, T., Kudlur, M., Lafortune, S., and Mahlke, S. A. 2008. Gadara: Dynamic deadlock avoidance for multithreaded programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Wang, Y., Lafortune, S., Kelly, T., Kudlur, M., and Mahlke, S. 2009. The theory of deadlock avoidance via discrete control. In Proceedings of the ACM Symposium on Principles of Programming Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yabandeh, M., Knežević, N., Kostić, D., and Kuncak, V. 2009a. CrystalBall: Predicting and preventing inconsistencies in deployed distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Yabandeh, M., Vasić, N., Kostić, D., and Kuncak, V. 2009b. Simplifying distributed system development. In Proceedings of the Workshop on Hot Topics in Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yang, J., Chen, T., Wu, M., Xu, Z., Liu, X., Lin, H., Yang, M., Long, F., Zhang, L., and Zhou, L. 2009. MODIST: Transparent model checking of unmodified distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Yang, J., Sar, C., and Engler, D. 2006a. EXPLODE: A lightweight, general system for finding serious storage system errors. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Yang, J., Twohey, P., Engler, D., and Musuvathi, M. 2006b. Using model checking to find serious file system errors. ACM Trans. Comput. Syst. 24, 4, 393--423. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Predicting and preventing inconsistencies in deployed distributed systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!