Abstract
Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execution in multi-threaded code is not obvious, however. For a thread to correctly rexecute a region of code, it must ensure that all other threads which have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior may result. however, automatically determining what constitutes a consistent global checkpoint is not straightforward since thread interactions are a dynamic property of the program.In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction called stabilizers that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g. message-passing operations or updates to shared variables).Our experimental results on several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.
- A. Adya, R. Gruber, B. Liskov, and U. Maheshwari. Efficient Optimistic Concurrency Control Using Loosely Synchronized Clocks. SIGMOD Record (ACM Special Interest Group on Management of Data), 24(2):23--34, June 1995. Google Scholar
Digital Library
- Saurabh Agarwal, Rahul Garg, Meeta S. Gupta, and Jose E. Moreira. Adaptive Incremental Checkpointing for Massively Parallel Systems. In ICS '04: Proceedings of the 18th annual international conference on Supercomputing, pages 277--286, New York, NY, USA, 2004. ACM Press. Google Scholar
Digital Library
- Micah Beck, James S. Plank, and Gerry Kingsley. Compiler-Assisted Checkpointing. Technical report, University of Tennessee, Knoxville, TN, USA, 1994. Google Scholar
Digital Library
- Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated Application-Level Checkpointing of MPI Programs. In PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 84--94, New York, NY, USA, 2003. ACM Press. Google Scholar
Digital Library
- Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, and Martin Schulz. Application-Level Checkpointing for Shared Memory Programs. In ASPLOS-XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, pages 235--247, New York, NY, USA, 2004. ACM Press. Google Scholar
Digital Library
- R. Bruni, H. Melgratti, and U. Montanari. Theoretical Foundations for Compensations in Flow Composition Languages. In POPL '05: Proceedings of the 32nd ACM SIGPLAN-SIGACT sysposium on Principles of programming languages, pages 209--220, New York, NY, USA, 2005. ACM Press. Google Scholar
Digital Library
- G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot - A Technique for Cheap Recovery. In 6th Symposium on Operating Systems Design and Implementation, San Francisco, California, 2004. Google Scholar
Digital Library
- Yuqun Chen, James S. Plank, and Kai Li. CLIP: A Checkpointing Tool for Message-Passing Parallel Programs. In Supercomputing '97: Proceedings of the 1997 ACM/IEEE conference on Supercomputing, pages 1--11, New York, NY, USA, 1997. ACM Press. Google Scholar
Digital Library
- Jan Christiansen and Frank Huch. Searching for Deadlocks while Debugging Concurrent Haskell Programs. In ICFP '04: Proceedings of the ninth ACM SIGPLAN international conference on Functional programming, pages 28--39, New York, NY, USA, 2004. ACM Press. Google Scholar
Digital Library
- Panos K. Chrysanthis and Krithi Ramamritham. ACTA: the SAGA continues. In Database Transaction Models for Advanced Applications, pages 349--397. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992. Google Scholar
Digital Library
- William R. Dieter and James E. Lumpp Jr. A User-level Checkpointing Library for POSIX Threads Programs. In FTCS '99: Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, page 224, Washington, DC, USA, 1999. IEEE Computer Society. Google Scholar
Digital Library
- Kevin Donnelly and Matthew Fluet. Transactional events. In ICFP '06: Proceedings of the Eleventh ACM SIGPLAN International Conference on Functional Programming, New York, NY, USA, 2006. ACM Press. Google Scholar
Digital Library
- E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Comput. Surv., 34(3):375--408, 2002. Google Scholar
Digital Library
- John Field and Carlos A. Varela. Transactors: a Programming Model for Maintaining Globally Consistent Distributed State in Unreliable Environments. In POPL '05: Proceedings of the 32nd ACM SIGPLAN-SIGACT sysposium on Principles of programming languages, pages 195--208, New York, NY, USA, 2005. ACM Press. Google Scholar
Digital Library
- Matthew Flatt and Robert Bruce Findler. Kill-safe Synchronization Abstractions. In PLDI '04: Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, pages 47--58, New York, NY, USA, 2004. ACM Press. Google Scholar
Digital Library
- Jim Gray and Andreas Reuter. Transaction Processing. Morgan-Kaufmann, 1993.Google Scholar
- Tim Harris and Keir Fraser. Language support for lightweight transactions. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 388--402. ACM Press, 2003. Google Scholar
Digital Library
- Tim Harris, Simon Marlow, Simon Peyton Jones, and Maurice Herlihy. Composable Memory Transactions. In ACM Conference on Principles and Practice of Parallel Programming, 2005. Google Scholar
Digital Library
- Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer, III. Software transactional memory for dynamic-sized data structures. In ACM Conference on Principles of Distributed Computing, pages 92--101, 2003. Google Scholar
Digital Library
- http://www.mlton.org.Google Scholar
- D. Hulse. On Page-Based Optimistic Process Checkpointing. In IWOOOS '95: Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems, page 24, Washington, DC, USA, 1995. IEEE Computer Society. Google Scholar
Digital Library
- Mangesh Kasbekar and Chita Das. Selective Checkpointing and Rollback in Multithreaded Distributed Systems. In 21st International Conference on Distributed Computing Systems, 2001. Google Scholar
Digital Library
- H.T. Kung and John T. Robinson. On Optimistic Methods for Concurrency Control. TODS, 6(2):213--226, 1981. Google Scholar
Digital Library
- Kai Li, Jeffrey Naughton, and James Plank. Real-time Concurrent Checkpoint for Parallel Programs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 79--88, 1990. Google Scholar
Digital Library
- John Reppy. Concurrent Programming in ML. Cambridge University Press, 1999. Google Scholar
Digital Library
- Martin Rinard. Effective Fine-Grained Synchronization for Automatically Parallelized Programs Using Optimistic Synchronization Primitives. ACM Transactions on Computer Systems, 17(4):337--371, November 1999. Google Scholar
Digital Library
- Michael F. Ringenburg and Dan Grossman. Atomcaml: first-class atomicity via rollback. In ICFP '05: Proceedings of the Tenth ACM SIGPLAN International Conference on Functional Programming, pages 92--104, New York, NY, USA, 2005. ACM Press. Google Scholar
Digital Library
- Asser N. Tantawi and Manfred Ruschitzka. Performance Analysis of Checkpointing Strategies. ACM Trans. Comput. Syst., 2(2):123--144, 1984. Google Scholar
Digital Library
- Andrew P. Tolmach and Andrew W. Appel. Debugging Standard ML Without Reverse Engineering. In LFP '90: Proceedings of the 1990 ACM conference on LISP and functional programming, pages 1--12, New York, NY, USA, 1990. ACM Press. Google Scholar
Digital Library
- Andrew P. Tolmach and Andrew W. Appel. Debuggable Concurrency Extensions for Standard ML. In PADD '91: Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging, pages 120--131, New York, NY, USA, 1991. ACM Press. Google Scholar
Digital Library
- Adam Welc, Suresh Jagannathan, and Antony Hosking. Safe futures for java. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, pages 439--453. ACM Press, 2005. Google Scholar
Digital Library
- Adam Welc, Suresh Jagannathan, and Antony L. Hosking. Transactional Monitors for Concurrent Objects. In European Conference on Object-Oriented Programming, pages 519--542, 2004.Google Scholar
Index Terms
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
Recommendations
Modular Checkpointing for Atomicity
Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execution in multi-threaded code is not obvious, however. For a thread to ...
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
ICFP '06: Proceedings of the eleventh ACM SIGPLAN international conference on Functional programmingTransient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execution in multi-threaded code is not obvious, however. For a thread to ...
AtomCaml: first-class atomicity via rollback
Proceedings of the tenth ACM SIGPLAN international conference on Functional programmingWe have designed, implemented, and evaluated AtomCaml, an extension to Objective Caml that provides a synchronization primitive for atomic (transactional) execution of code. A first-class primitive function of type (unit->'a)->'a evaluates its argument (...







Comments