Checkpointing models for tasks of different types

A server subject to random breakdowns and repairs offers services to incoming jobs whose lengths are highly variable. A checkpointing policy is in operation, aiming to protect against possibly lengthy recovery periods by backing up the current state at periodic checkpoints. The problem of how to choose a checkpointing interval in order to optimize performance is addressed by analysing a general queueing model which includes breakdowns, repairs, back-ups and recoveries. Exact solutions are obtained under both Markovian and non-Markovian assumptions. Numerical experiments illustrate the conditions where checkpoints are useful and where they are not, and in the former case, quantify the achievable benefits.


INTRODUCTION
Checkpointing is an important and useful crash-tolerant technique that involves storing process state during normal operation and restoring the recorded state to speed up recovery after a failure.It is cost-efective compared to hardware redundancy techniques, especially when the storage system for checkpointing data is reliable (Elnozahy et al. [18]).Not surprisingly, many commercial systems and public libraries, such as BlueGene/L (Adiga et al. [1]), IRIX OS (Tuthill et al. [38]) and Unix (Wang et al. [39]), have emerged to provide convenient APIs to facilitate its implementation.
The history of checkpointing goes back more than four decades, to the early days of transaction processing.Traditionally, the technique has involved keeping an 'audit trail' of transactions executed since the last checkpoint.Those transactions would be re-run in the event of a breakdown.The main question of interest would be how to choose the checkpoint frequency so as to minimize some appropriate cost function.In answering that question, the actual lengths of individual transactions were either ignored, or they were all assumed to have the same characteristics (e.g., exponentially distributed with the same mean).
We are interested in studying a checkpointing policy under a more realistic scenario where job processing times are random variables with a large coeicient of variation.That is, most of the jobs requiring service are short, but a few are very long.Exactly such a pattern of demand has been observed by monitoring a real-life cluster, see Chen et al. [9].Under those conditions, the purpose of the policy would be to shorten the execution time of the long jobs by reducing the recovery period following a breakdown, without at the same time adding signiicantly to the processing of the short jobs.These considerations would govern the choice of the checkpoint interval.
An example of such a 'mixed' workload is the HTAP (hybrid transaction analytics processing) workload which typically contains short transactions but also long-running analytical style queries.An example of the latter is PageRank.Processing such a query typically involves many data accesses and intermediate states before the end-result is returned to the user (see Chung, [10]).Checkpointing some of those states would certainly shorten the run time in the event of a crash, while for short transactions no checkpoints would be needed.
Widely varying task execution times have been acknowledged to pose major challenges in both cloud-based (Chen et al., [8]) and stream processing systems with high availability requirement (Cardellini et al. [5]).The alternative to checkpointing for ensuring fault tolerance Ð replicated processing Ð incurs high system overhead and also energy costs.
Modern stream processing platforms like Apache Flink have a built-in checkpointing mechanism.However, they use it in an ad-hoc fashion.Flink provides an arbitrarily chosen default value for the checkpointing interval and allows users to override it.The user is then left to carry out their own optimization (see [20] for coniguring checkpoints in the latest Flink version released in May 2023).Other stream processing systems, such as Scabbard (Theodorakis et al. [37]) on a single host and Carbone et al. [4] on distributed servers, also let the user choose the checkpointing intervals.
Clearly, algorithms that enable the evaluation of performance as a function of demand, breakdown and checkpointing parameters would help a system designer to make intelligent choices.Our aim is to provide such algorithms.
The contribution of this paper is to analyse a queue served by a single unreliable server, operating a checkpoint policy in a mixed workload environment.That server is likely to be part of some distributed system.The incoming jobs are typically submitted by another server.Processing a request may involve one or more database accesses whereby data is cached locally.All checkpoint back-ups take place on a database or possibly on the local disk.
A server breakdown may occur while (i) a job is being served, (ii) a checkpoint is being established, i.e. the current process state is being backed-up, (iii) a recovery from a previous breakdown is in progress, or (iv) the queue is empty and the server is idle.Following a breakdown event, the server does nothing for a random interval which will be referred to as the 'repair period'.After that, in cases (i), (ii), or (iii), it performs a recovery operation consisting of going back to the last checkpoint if there is one, or to the beginning of the job if not, and redoing the work done since then.In case (iv), after being repaired the server takes the irst waiting job, if any, and starts a new service.
Note that the repair period may not in fact involve an actual repair or reboot of the server.It may consist of disconnecting the primary server and replacing it with a secondary one that had been kept in reserve and in receipt of checkpoints from the primary (see Güler and Özkasap, [26], Oliveira et al. [32]).As far as the model is concerned, the exact nature of the operation is immaterial; of importance is only the distribution of the resulting inoperative interval.
The start of a job's execution plays the role of an initial checkpoint.Further checkpoints may be inserted at intervals as the run progresses.When the execution is completed, the job commits all its updates and departs from the queue.Then the next waiting job, if any, starts its service.Thus, the checkpoint policy can be designed so as to leave short jobs largely unafected, while reducing the run time of long jobs by shortening their recovery periods following a server breakdown.
Such a model has not, to our knowledge, been analyzed before.The objective of the analysis is to determine a performance measure such as the average response time or the average number of jobs present.This will enable the evaluation of the trade-ofs between the costs incurred in backing-up the current process state, and the beneits derived from faster recovery operations.We start by solving the model under Markovian assumptions, but later generalize it to allow non-exponential distributions and also multiple servers.

Related work
Models of checkpointing policies have been studied quite extensively over the years, under a variety of assumptions and application contexts.Research advances on checkpointing have also been 'checkpointed' by surveys at regular intervals.Worth mentioning are the surveys by Chandy [7], Nicola [31], Elnozahy et al. [18] and Marzouk and Jmaiel [29].The Elnozahy et.al. survey focusses exclusively on long-running computations, while Chandy mainly surveys checkpointing of streams of short transactions.
A large body of literature deals with long-running computations (sometimes referred to as 'ininite horizon'), motivated by scientiic workloads which might typically take hours or days to complete.Those papers are not interested in performance metrics related to customers (e.g., average latency).The optimization criterion is the fraction of time that the server is doing useful work.Examples of such studies are Cofman and Gilbert [11], Liu et al. [28], Grassi et al. [25], Bruno and Cofman [3], Plank and Thomason [34], Subasi et al. [36] and Gelenbe et al. [24] (the last paper also aims to minimize the energy used).A more recent long-running application that has received some attention in the literature is the speculative parallel discrete-event simulation, where incremental checkpoints are established during a long simulation run.Carnà et al. [6] have proposed and evaluated a hardware-assisted technique for speeding up the checkpointing process.
A diferent application of checkpointing was studied by Dimitriou [15], who analysed a model where jobs inding a busy server are not queued, but retry after a random period.Jobs are assumed to consist of a ixed number of tasks, with a checkpoint at the end of each task.Such a policy may be appropriate in the ield of communications, but it would not be implemented in a transaction processing system.It does not discriminate between short jobs and long jobs, and its use of service capacity is ineicient: the server may remain operative and idle while there are jobs requiring service.
Models where jobs are queued have also been studied.Gelenbe [23] derived an expression for the optimal checkpoint interval.Baccelli [2] developed a numerical procedure for computing the average response time, while Dohi et al. [16] generalized the checkpoint policy by making it age-dependent.All those authors obtained their results by assuming that during operative periods the system behaves like an M/M/1 queue.An audit trail is maintained, keeping track of the jobs that would have to be re-run in the event of a breakdown.The implicit assumption is that the audit trail survives the breakdown, i.e. it had been somehow backed-up continuously.Then the recovery following a breakdown is treated simply as a period during which jobs continue to arrive but none are served.The duration of that period is a linear function of the operative time elapsed since the last checkpoint.
In our model, there is no need for an audit trail because a breakdown only afects the job currently served, not the ones already completed.
More distantly related to our work are studies of various aspects of checkpointing that do not involve queuing.The optimization criterion is usually related to the cost of taking a checkpoint and recovering from a breakdown.Examples of such studies are Ling et al. [27], Liu et al. [28], Ozaki et al. [33], Shin et al. [35] and Subasi et al. [36].By not having to consider the queue, those authors are able to tackle features such as real-time tasks or general distributions of intervals between failures.Several diferent maintenance and repair models were examined by de Souza e Silva and Gail [14], again without considering the jobs.
Server breakdowns are instances of service interruptions, also referred to as server vacations.The literature on service interruptions is quite extensive, although it is generally not concerned with the problems of checkpointing.For example, Fiems et al. [19] have examined systems where services are interrupted at random and are then either resumed from the point of interruption or are repeated in their entirety.
A preliminary and considerably shorter version of this paper was presented at the European Performance Engineering Workshop in Santa Pola, 2022.It omitted a number of the results and proofs that are included here.

Summary of assumptions, analysis and results
The modelling approach is based on the concept of a job's 'efective service time', which has been used before in diferent contexts.In our case, this is the random interval consisting of the job's required service time, plus any interruptions due to checkpoints, breakdowns, repairs and recoveries.Thus, each job occupies the server for its efective service time.The system is then modelled as a special kind of an M/G/1 queue.
Most of the analysis is devoted to determining the characteristics of the efective service time, in particular its irst and second moments.However, the standard M/G/1 results are not directly applicable because breakdowns may occur while the server is idle and jobs arriving into an empty queue do not necessarily start service immediately.This situation is known in the literature as 'server with vacations': a server encountering an empty queue goes away for a random period, resuming service when it returns and inds jobs present.
Fuhrmann and Cooper [21] have obtained a decomposition result for queues where the server takes a vacation as soon as it becomes idle.That result relates the performance of the M/G/1 queue with vacations to that of the standard M/G/1 queue without vacations.
When an idle server breaks down, we treat the following repair period as a vacation.Deined in that way, vacations are sometimes taken and sometimes not.We derive our own decomposition result by analysing the Markov chain embedded at departure instants, and use it to determine performance measures.
It is possible to deine vacations diferently, so that they are always taken as soon as the queue becomes empty.A 'vacation' could be the interval from the moment the queue empties until either a job arrives to an idle (not recovering) server, or the server inishes recovering from a breakdown (regardless of whether a job arrived during the breakdown).The number of jobs that arrive during a vacation would then be either zero (if the next job arrives before the next breakdown), or the number of arrivals during a recovery period (if the next job arrives after the next breakdown).With such a deinition, Fuhrmann and Cooper's decomposition would apply.This was pointed out to us by a reviewer.However, interpreting their result in terms of our random variables is not quite straightforward.Rather than going through that process, we have decided to keep our direct derivation.
Because the analysis of both the efective service time and the vacations is non-trivial, it is presented irst in the case when the checkpointing, back-up, breakdown and repair intervals are distributed exponentially.Then the model is generalized to allow the back-up, checkpoint and repair intervals to have general distributions.Those generalizations do not involve any approximations, so the results are still exact.Finally, it is suggested that systems with non-exponential intervals between breakdowns, or more than one servers, may be solved approximately.Those two approximations are peripheral to the main study and the evaluation of their accuracy is left for future work.
Several numerical experiments exploring the behaviour of the system for diferent parameter settings, under both Markovian and non-Markovian assumptions are presented.These include an evaluation of the optimal checkpoint interval and the maximum achievable beneit of checkpointing.An example of an approximate solution, where the intervals between breakdowns have an Erlang distribution, is also included because it may illustrate an unexpected efect.

MARKOVIAN CHECKPOINTING ENVIRONMENT
The server goes through alternating periods of being operative and broken (or available and unavailable).These are distributed exponentially with means 1/ and 1/, respectively.Jobs arrive in a Poisson stream with rate .The required service times have a Hyperexponential distribution with exponential phases, where phase is entered with probability and has an average of 1/ ( = 1, 2, . . ., ).After completing the chosen exponential phase, the job departs.A three-phase Hyperexponential distribution is illustrated in Figure 1.
A Hyperexponential distribution with phases can be used to model jobs of diferent types of the kind observed in [8,9].Its coeicient of variation is always greater than or equal to 1, and can be arbitrarily large.For example, using just two Hyperexponential phases, with 2 and 2 much smaller than 1 and 1 , respectively, one can model patterns of demand where most of the jobs are short and a few are very long.
While being served, a job sets up periodic checkpoints at random intervals.At the start of its service phase, or after a checkpoint has been established, a timer distributed exponentially with mean 1/ is started.If that timer expires before the phase completes, a new checkpoint is attempted.The establishment of a checkpoint is not Fig. 1.A Hyperexponential distribution with three phases instantaneous but requires an exponentially distributed interval of time with mean 1/.That interval will be referred to as the 'back-up' operation.
Both the service intervals and the back-up operations may be interrupted by a server breakdown.Bearing in mind that the shortest of several exponentially distributed random variables is distributed exponentially with parameter equal to the sum of the parameters of the participating variables, we conclude that any service interval during phase is distributed exponentially with parameter , given by The end point of such an interval is either a service completion, with probability / , or a checkpointing attempt, with probability / , or a server breakdown, with probability / .Similarly, any back-up operation in any phase is distributed exponentially with parameter , given by Such an operation terminates with either the successful establishment of a checkpoint, with probability /, or a server breakdown, with probability /.
If the server breaks down during a service interval or during a following back-up operation, the 'elapsed period', i.e. the work performed since the last checkpoint, or in the absence of a checkpoint since the beginning of the phase, must be repeated when the server is repaired.This is referred to as the 'recovery' operation.
Poisson breakdowns act as 'random observers' of the service process.When the service intervals are distributed exponentially, the random observer property implies that both the repeated elapsed period and the remaining service have the same distribution as the entire interval.In other words, a service interval which is interrupted by a breakdown is, on the average, twice as long as a random one.
In view of the above observation, the recovery operations in phase are distributed exponentially with parameter .Of course, the server may break down again during a recovery, in which case another recovery with the same distribution is started after the repair.
We assume that at each recovery operation, the elapsed period to be repeated is resampled from the appropriate distribution.That assumption is motivated by the fact that diferent runs of the same task never take exactly the same time, particularly in a multi-core environment.Run times are afected by a varying hardware-software coniguration.Hence, after a breakdown, the elapsed period can be reproduced only in distribution.It is worth pointing out, however, that although the resampling forgets the exact duration of the elapsed period, it does remember that a service was interrupted by a breakdown and was therefore longer than average.Note that the action taken after the server is repaired following a breakdown depends on whether the breakdown occurred while the server was idle, or whether it occurred while the server was active (i.e., serving a job, backingup or recovering).In the former case there is no need for a recovery: either the server is again idle, or a job has arrived in the meantime and a new service begins.In the latter case, a new recovery starts, whose duration depends on the phase that was in progress when the breakdown occurred.
Denote by the random variable representing the total period between the start of a job's service and its completion.That period includes service intervals and back-up operations, as well as repair times and recovery operations following any breakdowns.The interval will be referred to as the 'efective service time'.While the server is serving jobs, the system behaves as a classic M/G/1 queue with i.i.d.service times distributed as .However, it deviates from the M/G/1 model during idle periods which may be interrupted by breakdowns.That behaviour will be analysed separately.
We shall need the irst and second moments of the random variable .In particular, the necessary and suicient condition for stability of the system is that the ofered load generated by the efective service times of the incoming jobs should be less than 1: ( ) < 1 .
(3) Let be the time a job takes to complete phase .The moments of the efective service time are simply expressed in terms of the moments of : and The notations introduced so far are summarized in

EXACT SOLUTION
The assumptions of the previous section imply that the execution of phase is a Markov process which can be in one of the following four states: Service, Backup, Repair or Recovery.The state transition diagram for that process is illustrated in Figure 2.

. State transition diagram for phase
To determine the moments of , we shall introduce two sets of auxiliary random variables.Let be the interval between attempting to establish a checkpoint during phase , and resuming phase service.That interval may include repairs and recovery operations resulting from breakdowns during the back-up operation.Also deine as the interval between a breakdown in phase and resuming phase service.It includes the repair and the recovery operation, plus any additional repairs and recoveries caused by further breakdowns.The time it takes to complete phase , , will be expressed in terms of these random variables.

The random variables and
Denote by () and () the Laplace transforms of the and probability density functions, respectively.
The Laplace transform of an exponential p.d.f. with some parameter, , is equal to /( + ).Also, the transform of a sum of independent random variables is equal to the product of their transforms.Hence, we can write the following equation for (): where is given by (1).The irst term in the square brackets is the probability that the recovery operation completes without interruption; the second term contains the probability that another breakdown occurs and a new random is started.
The irst and second moments of are obtained from ( ) = − ′ (0) and ( 2 ) = ′′ (0).Diferentiating (6) twice at = 0 yields, after some algebra, and Since the interval terminates either as a successful back-up, or is interrupted by a breakdown and is followed by a recovery operation , we can express () in terms of (): where is given by ( 2).The irst and second moments of are given by and

The phase completion time
Now remember that the phase execution, , consists of a service interval which either terminates uninterrupted, or is interrupted by a back-up operation, , and later resumed, or is interrupted by a breakdown interval, , and later resumed.This leads to the following equation for the Laplace transform of , ().
Diferentiating twice at = 0 and substituting the moments of and already derived, we obtain the irst and second moments of : and Using expressions (10) and (7), the average execution time of phase can be rewritten as The ergodicity condition (3) can now be stated explicitly: where = / .Note that when = 0 and = 0, i.e. when there are no checkpoints and no breakdowns, this reduces to the usual stability condition for the M/G/1 queue: the ofered load must be less than 1.
The exact solution for our model cannot be obtained by treating it as a simple M/G/1 queue.This is because of the possibility that the server may break down while the queue is empty.If that happens, a job may arrive into the system, ind an empty queue, yet be unable to start service immediately.

The efect of breakdowns during idle periods
For an exact analysis, we shall consider the number of jobs present at (just after) consecutive departure instants.This is a discrete time Markov chain embedded in the non-Markovian queueing process.The following notation will be used: is the steady-state probability that there are jobs left in the system after a departure ( = 0, 1, . ..).This is also the probability that an incoming job would see jobs present.Hence, by the PASTA property, it is also the probability that a random observer would see jobs present.
is the probability that jobs arrive during an efective service time, ; is the probability that jobs arrive during a repair period; , is the one-step transition probability that there are jobs present after the next departure instant, given that there were jobs present after the last one.
We shall also introduce the generating functions The possible one-step transitions from state , for > 0, are: to state − 1 if no jobs arrive during the intervening efective service time; to state if one job arrives; to state + 1 if two jobs arrive, and so on.Thus, A transition from state 0 to state can happen in several possible ways: (i) A job arrives while the server is still operative and further jobs arrive during the ensuing efective service time.(ii) The server breaks down before the next job, which arrives during the repair period; a total of further jobs arrive during the remaining repair period and the efective service time.(iii) The server breaks down and is repaired before the next job arrives; thereafter there is a transition from state 0 to state .These considerations lead to the following equation for 0, which, after solving, becomes These particular transition probabilities will be used via their generating function, (), given by The steady-state probabilities satisfy the following set of balance equations: In view of ( 18), these can be rewritten as Multiplying the 'th equation in (23) by and summing, we transform the set of balance equations into a single equation involving generating functions:

General checkpoint intervals
Suppose that the interval, , until the next checkpoint, has a general distribution with p.d.f.(), cumulative distribution function () and Laplace transform ().Consequently, the actual distribution of any service interval during phase depends on which of the following three alternatives takes place: (i) the phase completes before the occurrence of either a breakdown or a checkpoint; (ii) the service interval is interrupted by a breakdown before completion and before the next checkpoint; (iii) the next checkpoint is reached before the occurrence of either a breakdown or a phase completion.Denote the probabilities of cases (i), (ii) and (iii) by , and , respectively.Similar arguments to the ones used in the previous subsection, applied to phase , lead to The conditional Laplace transforms, (), () and (), of a phase service interval, given that alternative (i), (ii) or (iii) has taken place, are given by and Note that (49) is also the conditional Laplace transform of the recovery interval, given that a breakdown occurred in alternative (ii), whereas (50) is the conditional Laplace transform of the recovery interval, given that a breakdown occurred during a back-up operation following alternative (iii).These two versions of the recovery interval give rise to two versions of the random variable (the interval between a breakdown in phase , and resuming service).The corresponding Laplace transforms will be denoted by () and (), respectively.They satisfy the following versions of (36) (see also the derivation of (45)).
We can now write the generalized form of equation ( 12): where () is obtained according to (45), with () replaced by ().
All the elements necessary for determining the exact solution of the model, under general assumptions for the distributions of the repair, back-up and checkpoint intervals, are now in place.

Approximation for general breakdown intervals
Relaxing the Markovian character of the breakdown process, while still computing an exact solution, appears to be considerably more diicult.We are therefore proposing a simple approximation that can be readily justiied in a practical situation.The argument is based on the fact that breakdowns are rare events.
Assume that the operative periods of the server, i.e. the intervals between completing a repair and the next breakdown, are i.i.d.random variables with an -phase Coxian distribution.That is suiciently general for practical purposes.By choosing the number of phases and their parameters appropriately, one can approximate most commonly used distributions, such as Erlang, Weibull, Hyperexponential, Lognormal, Normal, etc.The approximation procedure consists of computing several moments of the desired or observed distribution from the available data and then itting the Coxian phases to produce those moments.
It is reasonable to assume that the parameters are small compared to the other parameters (i.e. the average lengths of all phases are large).In that case, the queueing process can be assumed to be stable and reach steady state during each phase.The approximate solution of the generalized model would then consist of the following steps.
(1) Compute he steady-state probabilities, , that an operative server is in phase of its breakdown interval.These probabilities satisfy the balance equations They can therefore be expressed, after normalization, as (2) Apply the existing exact solution to phase , using = (1 − ) as the breakdown rate, and compute the average number of jobs present, .(3) Compute the overall average number of jobs present, , as a weighted average over all phases: (56)

Approximation for multiple servers
Extending the checkpointing model to systems with more than one server must involve an approximation, since there is no known exact solution for the // queueing model.An efective and easily implementable heavy-traic approximation was proposed in Whitt [40].It works by irst evaluating the performance of the // queue with ofered load = ( ).Let // be the average number of jobs waiting (i.e., excluding the jobs in service).That number is given by where is the probability that an incoming job would have to wait.A closed-form expression for is provided by Erlang's delay formula, also known as the Erlang's C formula (e.g., see [30]).The proposed approximation applies to any heavily loaded // queue.It expresses the average number of jobs present, // , in the form ACM Trans.Model.Perform.Eval.Comput.Syst.
where 2 and 2 are the coeicients of variation of the interarrival interval and the service interval, respectively.When the arrival process is Poisson, we have 2 = 1, and (58) becomes This expression is exact when = 1.
As the approximation is intended for heavily loaded systems, where servers are rarely idle, the disruptive efect caused by breakdowns of idle servers is ignored.
Remark.The approximations described in the last two subsections have been used on a number of other occasions.A quantitative evaluation of their accuracy in the context of our system may be provided by simulations.However, as the breakdowns are rare events, that would be a non-trivial exercise, possibly requiring special techniques.Such an undertaking is outside the scope of the present work.

NUMERICAL RESULTS
The exact solutions of Sections 3 and 4 were applied to several example systems, with the aim of examining the trade-ofs between the costs and beneits of checkpointing.In order to reduce the parameter space to be explored, some of the parameters are kept ixed.The required service times are assumed to have a two-phase Hyperexponential distribution with quite a large coeicient of variation: 1/ 1 = 40; 1/ 2 = 400; 1 = 0.8; 2 = 0.2.In other words, the average requirement of 80% of the incoming jobs is 40, and for the other 20% it is 400.These parameters are chosen to conform, as far as it was possible to extract average values from the reported statistics, with the data collected in [9].The average repair period is assumed to be relatively short, 1/ = 15.The arrival rate , the checkpointing rate, , the average back-up interval, 1/ and the breakdown rate, , are varied.The performance measure in all cases is the average number of jobs present, .If we wished to evaluate the average response time, , we would use Little's result to compute = /.

Markovian assumptions
In the irst example, the arrival rate is set to = 0.007, and is plotted against the average checkpoint interval, = 1/, for three diferent values of the breakdown rate: = 10 −4.5 , = 10 −4 and = 10 −3.5 .These rates are comparable to the ones reported in Garraghan et al. [22].The back-up rate is set to = 200.
With these parameters, the system load, as measured by ( ), is on the order of 80% or higher.
The results are shown in Figure 3.All three plots exhibit a steep initial decline in occupancy, followed by a slow increase.When the checkpoint interval is small, the high occupancy is due to the cost of the backup operations.As increases, that cost decreases but the cost of the recovery operations increases.Since the breakdown rate is quite small, the trade-of is resolved quite slowly, leading to lat portions of the plots.The higher the value of , the earlier the occupancy starts to increase.In this example, the optimal checkpoint interval is about 14 when = 10 −4.5 , 8 when = 10 −4 and close to 6 when = 10 −3.5 .However, the latness of the two lower plots means that a policy which optimizes for the high breakdown rate (choosing = 6) would do pretty well for the lower rates too.
In the second experiment, the breakdown rate is kept ixed at = 10 −4 , while the arrival rate takes three diferent values: = 0.0065, = 0.007 and = 0.0075.Again, the value of is plotted against = 1/.The results are shown in Figure 4.
These plots display a similar behaviour to the previous igure: a steeply decreasing portion is followed by an almost lat one.For all three values of , the optimal checkpointing interval is about 8, but any value between = 4 and = 20 would perform quite well.If is increased much further, towards = ∞ ( = 0), would increase again, particularly when the load is high.It is intuitively clear that the more reliable the server, the less frequent need be the checkpoints.On the other hand, the less time it takes to perform a back-up operation, the more checkpoints can be aforded.In order to quantify these observations, we have evaluated the optimal checkpointing rate, * , as a function of the average interval between breakdowns, 1/.This was done for three diferent values of the back-up rate, = 100, = 200 and = 400.The results are presented in Figure 5, where 1/ is scaled exponentially.At each point, the optimal * was found by carrying out a search.As expected, the optimal checkpointing rate eventually becomes zero when the server becomes suiciently reliable.However, that point is not reached quickly: when = 100 or = 200, the mean time between failures needs to be about 10 7 before checkpoints become counterproductive.That interval increases to 10 8 for = 400.At the other end of the range, when breakdowns are relatively frequent, the optimal checkpointing rate is higher and increases with the speed of the back-up operations.
In order to exercise the model under more extreme conditions, we have evaluated the optimal checkpoint frequency for increasing backup intervals and for higher breakdown rates.A lower arrival rate of = 0.005 is chosen, in order to avoid the system becoming unstable.The results are shown in Figure 6.
Predictably, we observe that the higher the cost of establishing a checkpoint, the lower should be the checkpointing frequency.On the other hand, the higher the rate of breakdowns, the more frequent should be the checkpoints.

Constant checkpoint, back-up and repair intervals
In order to illustrate the efect that a change of distributions has on performance, we have repeated the experiment of Figure 4, under the assumptions that the repair, back-up and checkpoint intervals are constant, keeping the means, (), () and (), as before.Clearly, such a change would reduce the variance of the efective service time and hence would reduce the average number .However, it is not obvious whether the optimal value of the checkpoint frequency, = 1/ (), is afected and if so, to what extent.In Figure 7 the optimal checkpoint frequency is plotted against the average interval between breakdowns on an exponential scale, for three values of the back-up interval.We observe a similar behaviour to the one exhibited in Figure 5: as the server becomes more reliable, the optimal checkpoint frequency decreases and eventually becomes 0 (i.e.checkpoints are no longer needed).
The notable diference in this example is that actual values of * are in all cases signiicantly lower than before.That outcome can be explained intuitively by arguing that a deterministic checkpointing policy is more efective than a random one with the same averages.Hence, one can achieve the desired result with fewer checkpoints.It may be conjectured that, of all distributions of the checkpoint interval, the constant one with the same mean is optimal.In fact, such a result was proved for a diferent model in ( [23]).
The last experiment involving exact results aims to quantify the gains achievable by a checkpointing policy.This is done by comparing the average number of jobs present, , when no checkpoints are used (i.e.= 0 or = ∞), with the number present when the policy uses the optimal checkpoint frequency * .The latter value is determined by a search.The other parameters are as before, with = 10 In Figure 8, the unoptimized and optimized values of are plotted against the arrival rate, which varies from = 0.005 to = 0.008.On that range, the ofered load, , varies from about 55% to about 90%.As might have been expected, when the system is lightly loaded, the gains of checkpointing are slight.Jobs afected by breakdowns do not tend to delay other jobs at light loads, because the queue is short.Hence, any savings in their efective service times show limited beneits.As the load increases, the queue gets longer and the savings are noticed by more waiting jobs.At the 90% load, the diference between no checkpointing and optimal checkpointing is about 25%.
We carried out the same comparison under the assumption that the , and random variables are distributed exponentially.The results were very similar to the ones presented in Figure 8 and are therefore omitted.

Approximation for non-exponential breakdown intervals
To gauge the efect of a diferent distribution of the intervals between breakdowns, we have applied the approximation described in section 4.4.A less variable breakdown pattern was modelled by assuming that the intervals between breakdowns have an Erlang distribution with four phases.Otherwise, all parameters are the same as in Figure 3, with a breakdown rate ixed at = 10 −3.5 .To achieve that rate, each of the four Erlang phases has an average length of 10 3.5 /4.
In Figure 9, the average number of jobs, , is plotted against the average interval between checkpoints, = 1/, for exponential and Erlang distributions of those intervals.The former plot corresponds to the top curve in Figure 3 and the latter is computed according to equation (56).Usually, when the coeicient of variation of a relevant variable in a queueing system is reduced, the system performance improves.Here we observe the opposite phenomenon.Replacing the exponentially distributed breakdown intervals (coeicient of variation 1) by four-phase Erlang ones (coeicient of variation 1/4), leads to a slight but consistent increase in the average number of jobs present.Assuming that this is a genuine efect, rather than a consequence of the approximation, it could be explained as follows.
The Erlang distribution implies that a breakdown occurs when phase 4 expires.In other words, during phases 1, 2 and 3 there are no breakdowns, while during phase 4 a breakdown occurs at rate 4 × 10 −3.5 .On the other hand, checkpoints are always established at rate , regardless of the phase.Three quarters of those checkpoints are not really necessary; they incur the back-up penalties even though there would not be a breakdown during their phase.When the breakdown intervals are distributed exponentially, a breakdown may occur at any time with rate 10 −3.5 , making all checkpoints potentially advantageous.

CONCLUSIONS
We have examined the efects of checkpointing on performance by analysing a rather general queueing model involving breakdowns, repairs and back-up operations.A major objective of the study was to handle a job population with a large variability of required service times.Exact solutions were obtained under both Markovian and non-Markovian assumptions.These were used in order to determine the optimal checkpoint frequency for diferent parameter settings, and to quantify the beneits of checkpointing.
Two of the proposed generalizations, to non-exponential intervals between breakdowns and to multiple parallel servers, involve approximate solutions.As indicated in the Remark at the end of subsection 4.5, assessing the accuracy of these approximations is outside the scope of the present paper but would be a worthy topic for future research.
Another interesting generalization that would be worth studying is to divide the mixed job population into several classes, with a separate queue for each class.A non-preemptive priority scheduling policy could be in operation among the classes.For example, short jobs might be given higher priority than long ones.It may be possible to generalize existing results on multi-class M/G/1 queues to such a system.To obtain an exact solution, it would be necessary to analyse the efect that breakdowns during idle periods have on the behaviour of queues.That too would be a worthy topic for future research.
The reviewers have suggested that it would be interesting to investigate a model where the recovery periods consist of the exact elapsed times since the last checkpoint, rather than being resampled from the appropriate distribution.That could also be a topic for future work.

Table 1 .
Summary of notations