Abstract
Nondeterminism complicates the development and management of distributed systems, and arises from two main sources: the local behavior of each individual node as well as the behavior of the network connecting them. Taming nondeterminism effectively requires dealing with both sources.
This paper proposes DDOS, a system that leverages prior work on deterministic multithreading to offer: 1) space-efficient record/replay of distributed systems; and 2) fully deterministic distributed behavior. Leveraging deterministic behavior at each node makes outgoing messages strictly a function of explicit inputs. This allows us to record the system by logging just message's arrival time, not the contents. Going further, we propose and implement an algorithm that makes all communication between nodes deterministic by scheduling communication onto a global logical timeline.
We implement both algorithms in a system called DDOS and evaluate our system with parallel scientific applications, an HTTP/memcached system and a distributed microbenchmark with a high volume of peer-to-peer communication. Our results show up to two orders of magnitude reduction in log size of record/replay, and that distributed systems can be made deterministic with an order of magnitude of overhead.
- A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient System-Enforced Deterministic Parallelism. In OSDI, 2010. Google Scholar
Digital Library
- M. Basrai and P. M. Chen. Cooperative revirt: Adapting message logging for intrusion analysis. Technical Report CSE-TR-504-04, University of Michigan, 2004.Google Scholar
- T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution. In ASPLOS, 2010. Google Scholar
Digital Library
- T. Bergan, N. Hunt, L. Ceze, and S. D. Gribble. Deterministic Process Groups in dOS. In OSDI, 2010. Google Scholar
Digital Library
- E. Berger, T. Yang, T. Liu, and G. Novark. Grace: Safe and Efficient Concurrent Programming. In OOPSLA, 2009.Google Scholar
- K. P. Birman. The Process Group Approach to Reliable Distributed Computing. Communications of the ACM, 36(12), December 1993. Google Scholar
Digital Library
- J. Choi and H. Srinivasan. Deterministic Replay of Java Multithreaded Applications. In SIGMETRICS SPDT, 1998. Google Scholar
Digital Library
- J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic Shared Memory Multiprocessing. In ASPLOS, 2009. Google Scholar
Digital Library
- J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. RCDC: A Relaxed Consistency Deterministic Computer. In ASPLOS, 2011. Google Scholar
Digital Library
- G. Dunlap, S. King, S. Cinar, M. Basrai, and P. Chen. ReVirt: Enabling Intrusion Analysis Through Virtual-Machine Logging and Replay. In OSDI, 2002. Google Scholar
Digital Library
- G. Dunlap, D. Lucchetti, M. Fetterman, and P. Chen. Execution replay of multiprocessor virtual machines. In VEE, 2008. Google Scholar
Digital Library
- S. A. Edwards and O. Tardieu. SHIM: A Deterministic Model for Heterogeneous Embedded Systems. In EMSOFT, 2005. Google Scholar
Digital Library
- M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32:374--382, April 1985. Google Scholar
Digital Library
- D. Geels, G. Altekar, S. Shenker, and I. Stoica. Abstract replay debugging for distributed applications. In USENIX Annual Technical Conference, 2009. Google Scholar
Digital Library
- M. Hill and M. Xu. Racey: A Stress Test for Deterministic Execution. http://www.cs.wisc.edu/ markhill/racey.html.Google Scholar
- D. Hower, P. Dudnik, D. Wood, and M. Hill. Calvin: Deterministic or Not? Free Will to Choose. In HPCA, 2011. Google Scholar
Digital Library
- G. Kahn. The Semantics of a Simple Language for Parallel Programming. Information Processing, pages 471--475, 1974.Google Scholar
- R. Konuru. Deterministic replay of distributed java applications. In In Proceedings of the 14th IEEE International Parallel and Distributed Processing Symposium, pages 219--228, 2000. Google Scholar
Digital Library
- O. Laadan, N. Viennot, and J. Nieh. Transparent, Lightweight Application Execution Replay on Commodity Multiprocessor Operating Systems. In SIGMETRICS, 2010. Google Scholar
Digital Library
- L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7), July 1978. Google Scholar
Digital Library
- L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM TOCS, 3(1), 1985. Google Scholar
Digital Library
- L. Lamport. The Part-Time Parliament. ACM TOCS, 16(2), 1998. Google Scholar
Digital Library
- T. J. LeBlanc and J. M. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE TC, 36(4), 1987. Google Scholar
Digital Library
- B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting Phase Change Memory as a Scalable DRAM Alternative. In ISCA, 2009. Google Scholar
Digital Library
- NASA Advanced Supercomputing Division. The NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.Google Scholar
- M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient Deterministic Multithreading in Software. In ASPLOS, 2009. Google Scholar
Digital Library
- D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast Crash Recovery in RAMCloud. In SOSP, 2011. Google Scholar
Digital Library
- M. Ronsse and K. D. Bosschere. RecPlay: A Fully Integrated Practical Record/Replay System. ACM TOCS, 17(2), 1999. Google Scholar
Digital Library
- Y. Saito. Jockey: A User-Space Library for Record-Replay Debugging. In International Symposium on Automated Analysis-driven Debugging, 2005. Google Scholar
Digital Library
- A. Thomson and D. J. Abadi. The case for determinism in database systems. In VLDB, 2010. Google Scholar
Digital Library
- R. Van Renesse, K. Birman, and S. Maffeis. Horus: A Flexible Group Communication System. Communications of the ACM, 39(4), April 1996. Google Scholar
Digital Library
- K. Veeraraghavan, D. Lee, B. Wester, J. Ouyang, P. M. Chen, J. Flinn, and S. Narayanasamy. DoublePlay: Parallelizing Sequential Logging and Replay. In ASPLOS, 2011. Google Scholar
Digital Library
Index Terms
DDOS: taming nondeterminism in distributed systems
Recommendations
DDOS: taming nondeterminism in distributed systems
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsNondeterminism complicates the development and management of distributed systems, and arises from two main sources: the local behavior of each individual node as well as the behavior of the network connecting them. Taming nondeterminism effectively ...
DDOS: taming nondeterminism in distributed systems
ASPLOS '13Nondeterminism complicates the development and management of distributed systems, and arises from two main sources: the local behavior of each individual node as well as the behavior of the network connecting them. Taming nondeterminism effectively ...
On the Power of Las Vegas for One-Way Communication Complexity, OBDDs, and Finite Automata
The study of the computational power of randomized computations is one of the central tasks of complexity theory. The main goal of this paper is the comparison of the power of Las Vegas computation and deterministic respectively nondeterministic ...







Comments