Abstract
We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation.
This article describes the design and implementation of this approach, termed CrystalBall. We evaluate CrystalBall on RandTree, BulletPrime, Paxos, and Chord distributed system implementations. We identified new bugs in mature Mace implementations of three systems. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at runtime.
- Arnold, M., Vechev, M., and Yahav, E. 2008. Qvm: An efficient runtime for detecting defects in deployed systems. In Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA'08). ACM, New York, NY, 143--162. Google Scholar
Digital Library
- Ball, T., Podelski, A., and Rajamani, S. K. 2001. Boolean and cartesian abstraction for model checking C programs. In Proceedings of the Workshop on Tools and Algorithms for the Construction and Analysis of Systems. Google Scholar
Digital Library
- Ball, T. and Rajamani, S. K. 2002. The SLAM project: Debugging system software via static analysis. In Proceedings of the ACM Symposium on Principles of Programming Languages. Google Scholar
Digital Library
- Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Bayazit, A. A. and Malik, S. 2005. Complementary use of runtime validation and model checking. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'05). IEEE Computer Society, Los Alamitos, CA, 1052--1059. Google Scholar
Digital Library
- Burch, J. R., Clarke, E. M., McMillan, K. L., Dill, D. L., and Hwang, L. J. 1990. Symbolic model checking: 1020 states and beyond. In Proceedings of the Annual IEEE Symposium on Logic in Computer Science.Google Scholar
- Castro, M., Druschel, P., Kermarrec, A.-M., Nandi, A., Rowstron, A., and Singh, A. 2003. Splitstream: High-bandwidth content distribution in cooperative environments. In Proceedings of the ACM Symposium on Operating Systems Principles. Google Scholar
Digital Library
- Chaki, S., Clarke, E., Groce, A., Jha, S., and Veith, H. 2003. Modular verification of software components in C. IEEE Trans. Softw. Engin. 30, 6. Google Scholar
Digital Library
- Chandra, T. D., Griesemer, R., and Redstone, J. 2007. Paxos made live: An engineering perspective. In Proceedings of the Annual ACM Symposium on Principles of Distributed Computing. Google Scholar
Digital Library
- Chandy, K. M. and Lamport, L. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1, 63--75. Google Scholar
Digital Library
- Chang, H., Govindan, R., Jamin, S., Shenker, S., and Willinger, W. 2002. Towards capturing representative AS-level internet topologies. In Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. Google Scholar
Digital Library
- Chu, Y., Rao, S. G., Seshan, S., and Zhang, H. 2002. A case for end system multicast. IEEE J. Sel. Areas Comm. 20, 8, 1456--1471. Google Scholar
Digital Library
- Costa, M., Castro, M., Zhou, L., Zhang, L., and Peinado, M. 2007. Bouncer: Securing software by blocking bad input. In Proceedings of the ACM Symposium on Operating Systems Principles. Google Scholar
Digital Library
- Costa, M., Crowcroft, J., Castro, M., Rowstron, A., Zhou, L., Zhang, L., and Barham, P. 2005. Vigilante: End-to-end containment of internet worms. In Proceedings of the ACM Symposium on Operating Systems Principles. Google Scholar
Digital Library
- Dagand, P.-E., Kostić, D., and Kuncak, V. 2009. Opis: Reliable distributed systems in OCaml. In Proceedings of the ACM SIGPLAN Workshop on Types in Language Design and Implementation. Google Scholar
Digital Library
- Dao, D., Albrecht, J. R., Killian, C. E., and Vahdat, A. 2009. Live debugging of distributed systems. In Proceedings of the International Conference on Compiler Construction. Google Scholar
Digital Library
- Demsky, B. and Rinard, M. 2003. Automatic detection and repair of errors in data structures. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications. Google Scholar
Digital Library
- Dunagan, J., Harvey, N. J. A., Jones, M. B., Kostić, D., Theimer, M., and Wolman, A. 2004. FUSE: Lightweight guaranteed distributed failure notification. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Fischer, M., Lynch, N., and Paterson, M. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2, 374--382. Google Scholar
Digital Library
- Flanagan, C. and Godefroid, P. 2005. Dynamic partial-order reduction for model checking software. In Proceedings of the ACM Symposium on Principles of Programming Languages. Google Scholar
Digital Library
- Fonseca, R., Porter, G., Katz, R. H., Shenker, S., and Stoica, I. 2007. X-Trace: A pervasive network tracing framework. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Geels, D., Altekar, G., Maniatis, P., Roscoe, T., and Stoica, I. 2007. Friday: Global comprehension for distributed replay. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Godefroid, P. and Wolper, P. 1994. A partial approach to model checking. Inf. Comput. 110, 2, 305--326. Google Scholar
Digital Library
- Henzinger, T. A., Jhala, R., Majumdar, R., and Sutre, G. 2002. Lazy abstraction. In Proceedings of the ACM Symposium on Principles of Programming Languages. Google Scholar
Digital Library
- Holzmann, G. J. 1997. The model checker SPIN. IEEE Trans. Softw. Engin. 23, 5, 279--295. Google Scholar
Digital Library
- Jain, N., Mahajan, P., Kit, D., Yalagandula, P., Dahlin, M., and Zhang, Y. 2008. Network imprecision: A new consistency metric for scalable monitoring. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Janjua, M. U. and Mycroft, A. 2006. Automatic correction to safety violations. In Proceedings of the International Conference on Thread Verification (TV'06).Google Scholar
- Jobstmann, B., Griesmayer, A., and Bloem, R. 2005. Program repair as a game. In Proceedings of the International Conference on Computer Aided Verification. 226--238. Google Scholar
Digital Library
- John, J. P., Katz-Bassett, E., Krishnamurthy, A., Anderson, T., and Venkataramani, A. 2008. Consensus routing: The internet as a distributed system. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Killian, C. E., Anderson, J. W., Braud, R., Jhala, R., and Vahdat, A. M. 2007a. Mace: Language support for building distributed systems. In Proceedings of the Conference on Programming Language Design and Implementation. Google Scholar
Digital Library
- Killian, C. E., Anderson, J. W., Jhala, R., and Vahdat, A. 2007b. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Kostić, D., Braud, R., Killian, C., Vandekieft, E., Anderson, J. W., Snoeren, A. C., and Vahdat, A. 2005. Maintaining high bandwidth under dynamic network conditions. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Kostić, D., Rodriguez, A., Albrecht, J., Bhirud, A., and Vahdat, A. 2003. Using random subsets to build scalable network services. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. Google Scholar
Digital Library
- Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Comm. ACM 21, 7, 558--565. Google Scholar
Digital Library
- Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2, 133--169. Google Scholar
Digital Library
- Liu, X., Guo, Z., Wang, X., Chen, F., Lian, X., Tang, J., Wu, M., Kaashoek, M. F., and Zhang, Z. 2008. D3S: Debugging deployed distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Liu, X., Lin, W., Pan, A., and Zhang, Z. 2007. WiDS checker: Combating bugs in distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Manivannan, D. and Singhal, M. 2002. Asynchronous recovery without using vector timestamps. J. Parall. Distrib. Comput. 62, 12, 1695--1728. Google Scholar
Digital Library
- Musuvathi, M. and Engler, D. R. 2004. Model checking large network protocol implementations. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Musuvathi, M., Park, D. Y. W., Chou, A., Engler, D. R., and Dill, D. L. 2002. CMC: A pragmatic approach to model checking real code. SIGOPS Oper. Syst. Rev. 36, SI, 75--88. Google Scholar
Digital Library
- Musuvathi, M. and Qadeer, S. 2007. Iterative context bounding for systematic testing of multithreaded programs. In Proceedings of the Conference on Programming Language Design and Implementation. 446--455. Google Scholar
Digital Library
- Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A., and Neamtiu, I. 2008. Finding and reproducing heisenbugs in concurrent programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Nightingale, E. B., Chen, P. M., and Flinn, J. 2005. Speculative execution in a distributed file system. In Proceedings of the ACM Symposium on Operating Systems Principles. Google Scholar
Digital Library
- Porter, D. E., Hofmann, O. S., Rossbach, C. J., Benn, A., and Witchel, E. 2009. Operating systems transactions. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP'09). ACM, New York, NY, 161--176. Google Scholar
Digital Library
- Qin, F., Tucek, J., Zhou, Y., and Sundaresan, J. 2007. Rx: Treating bugs as allergies—A safe method to survive software failures. ACM Trans. Comput. Syst. 25, 3. Google Scholar
Digital Library
- Reynolds, P., Killian, C., Wiener, J. L., Mogul, J. C., Shah, M. A., and Vahdat, A. 2006. Pip: Detecting the unexpected in distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., and Yu, H. 2005. OpenDHT: A public DHT service and its uses. In Proceedings of the ACM SIGCOMM Conference. Google Scholar
Digital Library
- Rinard, M. C., Cadar, C., Dumitran, D., Roy, D. M., Leu, T., and Beebee, W. S. 2004. Enhancing server availability and security through failure-oblivious computing. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Rodriguez, A., Killian, C., Bhat, S., Kostić, D., and Vahdat, A. 2004. MACEDON: Methodology for automatically creating, evaluating, and designing overlay networks. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Rowstron, A. and Druschel, P. 2001. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proceedings of the ACM Symposium on Operating Systems Principles. Google Scholar
Digital Library
- Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22, 4, 299--319. Google Scholar
Digital Library
- Sen, K. and Agha, G. 2006. Automated systematic testing of open distributed programs. In Proceedings of the International Conference on Fundamental Approaches to Software Engineering. 339--356. Google Scholar
Digital Library
- Singh, A., Maniatis, P., Roscoe, T., and Druschel, P. 2006. Using queries for distributed monitoring and forensics. SIGOPS Oper. Syst. Rev. 40, 4, 389--402. Google Scholar
Digital Library
- Srinivasan, S. M., K, S., Andrews, C. R., and Zhou, Y. 2004. Flashback: A lightweight extension for rollback and deterministic replay for software debugging. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R., Kaashoek, M. F., Dabek, F., and Balakrishnan, H. 2003. Chord: A scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw. 11, 1, 17--32. Google Scholar
Digital Library
- Vahdat, A., Yocum, K., Walsh, K., Mahadevan, P., Kostić, D., Chase, J., and Becker, D. 2002. Scalability and accuracy in a large-scale network emulator. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Wang, Y., Kelly, T., Kudlur, M., Lafortune, S., and Mahlke, S. A. 2008. Gadara: Dynamic deadlock avoidance for multithreaded programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Wang, Y., Lafortune, S., Kelly, T., Kudlur, M., and Mahlke, S. 2009. The theory of deadlock avoidance via discrete control. In Proceedings of the ACM Symposium on Principles of Programming Languages. Google Scholar
Digital Library
- Yabandeh, M., Knežević, N., Kostić, D., and Kuncak, V. 2009a. CrystalBall: Predicting and preventing inconsistencies in deployed distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Yabandeh, M., Vasić, N., Kostić, D., and Kuncak, V. 2009b. Simplifying distributed system development. In Proceedings of the Workshop on Hot Topics in Operating Systems. Google Scholar
Digital Library
- Yang, J., Chen, T., Wu, M., Xu, Z., Liu, X., Lin, H., Yang, M., Long, F., Zhang, L., and Zhou, L. 2009. MODIST: Transparent model checking of unmodified distributed systems. In Proceedings of the ACM/USENIX Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- Yang, J., Sar, C., and Engler, D. 2006a. EXPLODE: A lightweight, general system for finding serious storage system errors. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google Scholar
Digital Library
- Yang, J., Twohey, P., Engler, D., and Musuvathi, M. 2006b. Using model checking to find serious file system errors. ACM Trans. Comput. Syst. 24, 4, 393--423. Google Scholar
Digital Library
Index Terms
Predicting and preventing inconsistencies in deployed distributed systems
Recommendations
Concurrent Exception Handling and Resolution in Distributed Object Systems
We address the problem of how to handle exceptions in distributed object systems. In a distributed computing environment, exceptions may be raised simultaneously in different processing nodes and thus need to be treated in a coordinated manner. ...
Strong stable properties in distributed systems
A stable property in a distributed system is a global property which once true, remains true forever. This paper refines this notion by formally introducing the concept of strong stable properties. A strong stable property has the nice property that it ...
A Consistency Model for Distributed Virtual Reality Systems
DEPCOS-RELCOMEX '09: Proceedings of the 2009 Fourth International Conference on Dependability of Computer SystemsAs a further development of works published in previous DepCoS-RELCOMEX conferences, this paper is devoted to one of the most important problems taking place in distributed virtual reality systems (DVR systems): maintenance of consistency among the ...






Comments