Abstract
To avoid having to restart a job from the beginning in case of random failure, it is standard practice to save periodically sufficient information to enable the job to be restarted at the previous point at which information was saved. Such points are referred to as checkpoints, and the saving of such information at these points is called checkpointing [1].
- 1 Jasper, David P. A discussion of checkpoint/restart. Software Age (Oct. 1969), 9-14.Google Scholar
Index Terms
A first order approximation to the optimum checkpoint interval
Recommendations
Optimal checkpointing interval of a communication system with rollback recovery
This paper considers a communication system which consists of many processors and studies the problem for improving its reliability by adopting the recovery techniques of checkpoint and rollback. When either processor failure or communication error has ...
A Communication-Induced Checkpointing Algorithm Using Virtual Checkpoint on Distributed Systems
ICPADS '00: Proceedings of the Seventh International Conference on Parallel and Distributed SystemsCheckpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, ...
Checkpoint scheduling model for optimality
To minimize the expected execution time, a general checkpoint scheduling algorithm is proposed to determine the near optimal checkpointing time sequence. More precisely, based on a simple timing policy, an execution analytical model is introduced and ...





Comments