Heavy-Traffic Optimal Size- and State-Aware Dispatching

Dispatching systems, where arriving jobs are immediately assigned to one of multiple queues, are ubiquitous in computer systems and service systems. A natural and practically relevant model is one in which each queue serves jobs in FCFS (First-Come First-Served) order. We consider the case where the dispatcher is size-aware, meaning it learns the size (i.e. service time) of each job as it arrives; and state-aware, meaning it always knows the amount of work (i.e. total remaining service time) at each queue. While size- and state-aware dispatching to FCFS queues has been extensively studied, little is known about optimal dispatching for the objective of minimizing mean delay. A major obstacle is that no nontrivial lower bound on mean delay is known, even in heavy traffic (i.e. the limit as load approaches capacity). This makes it difficult to prove that any given policy is optimal, or even heavy-traffic optimal. In this work, we propose the first size- and state-aware dispatching policy that provably minimizes mean delay in heavy traffic. Our policy, called CARD (Controlled Asymmetry Reduces Delay), keeps all but one of the queues short, then routes as few jobs as possible to the one long queue. We prove an upper bound on CARD's mean delay, and we prove the first nontrivial lower bound on the mean delay of any size- and state-aware dispatching policy. Both results apply to any number of servers. Our bounds match in heavy traffic, implying CARD's heavy-traffic optimality. In particular, CARD's heavy-traffic performance improves upon that of LWL (Least Work Left), SITA (Size Interval Task Assignment), and other policies from the literature whose heavy-traffic performance is known.


INTRODUCTION
Dispatching, or load balancing, is at the heart of many computer systems, service systems, transportation systems, and systems in other domains.In such systems, jobs arrive over time, and each job must be irrevocably sent to one of multiple queues as soon as it arrives.It is common for each queue to be served in First-Come First-Served (FCFS) order.
Motivated by the ubiquity of dispatching, we study a classical problem in dispatching theory: How should one dispatch to FCFS queues to minimize jobs' mean response time? 1e specifically consider size-and state-aware dispatching.This means that the dispatcher learns a job's size, or service time, when the job arrives; and the dispatcher always knows how much work, or total remaining service time, there is at each queue.We make typical stochastic assumptions about the job arrival process, working with M/G arrivals (see Section 2).
Despite the extensive literature on dispatching in queueing theory (see Section 1.2), optimal sizeand state-aware dispatching is an open problem, as highlighted by Hyytiä et al. [23].The problem is a Markov decision process (MDP), so it can in principle be approximately solved numerically [27].But the numerical approach has two drawbacks.First, the curse of dimensionality makes computation impractical for large numbers of queues.Second, the solution is specific to a particular instance (meaning a given number of queues, job size distribution, and load) and one has to solve the MDP again for a different instance.

Our contributions
In this work, we take the first steps towards developing a theoretical understanding of optimal sizeand state-aware dispatching, making two main contributions.
• We give the first lower bound on the minimum mean response time achievable under any dispatching policy (Theorem 3.1).• We propose a new dispatching policy, called CARD (Controlled Asymmetry Reduces Delay), and prove an asymptotically tight upper bound on its mean response time (Theorem 3.3).We illustrate CARD in Figure 1.1.Our upper and lower bounds match in the heavy-traffic limit as load  approaches 1, the maximum load capacity.Specifically, we find an explicit constant  such that the dominant term of both bounds is   1− .This makes CARD the first policy to be proven heavy-traffic optimal, aside from the implicitly specified optimal policy.Characterizing the optimal constant , which was previously unknown, is another contribution of our work.
How CARD outperforms previous policies.Below, we describe the intuition behind CARD's design in a two-server system.See Figure 1.1 for an illustration.
To minimize mean response time, one generally wants to avoid situations where small jobs need to wait behind large jobs.One way to do this is to dedicate one server to small jobs and the other server to large jobs, where the size cutoff between "small" and "large" is defined such that half the load is due to each size class.This is the approach taken by the SITA (Size Interval Task Assignment) policy [17,18].Under SITA, due to Poisson splitting, the dispatching system reduces to two independent M/G/1 systems.As shown by Harchol-Balter et al. [18], SITA can sometimes perform very well, but it can sometimes be much worse than simple LWL (Least Work Left) dispatching, under which the system behaves like a central-queue M/G/2.
As each of LWL and SITA can sometimes be worse than the other in heavy traffic, one might expect that they can be strictly improved upon.Indeed, in Appendix B.1, we show that in the two-server case, both LWL and SITA are strictly suboptimal in heavy traffic.But the question remains: where in LWL or SITA's design is there a specific opportunity for improvement?
Our key observation is that the main reason SITA performs poorly is that its "short server", namely the queue to which it sends small jobs, can accumulate lots of work.CARD avoids this issue by actively regulating the amount of work at the short server.To do so, CARD creates a third class of "medium" jobs, which are on the border between small and large, and sets a threshold which serves as a target amount of work at the short server.Whenever a medium job arrives, CARD Small and large jobs are always dispatched to the short or long server, respectively.Medium jobs are dispatched based on whether  , the amount of work at the short server, exceeds a threshold .The size cutoffs  − and  + are chosen so that small and large jobs each constitute slightly less than half the load.dispatches it to the short server if and only if the short server has less work than the threshold.This prevents too much work accumulating in the short server, and it also prevents the short server from unduly idling.
CARD's performance beyond heavy traffic.Of course, practical systems rarely operate at loads very near capacity, but our theoretical bounds on CARD's performance are admittedly not tight outside the heavy-traffic regime.As such, we also study CARD in simulation across a wider range of loads.We find empirically that CARD has good performance outside of heavy traffic, but slightly modifying CARD can significantly improves performance.Both the original and modified versions of CARD improve upon traditional heuristics like LWL and SITA, sometimes by an order of magnitude.The modified version is competitive with the Dice policy of Hyytiä and Righter [26], the best known heuristic for the size-and state-aware setting.See Figure 1.2 for an example where at high load, CARD achieves reductions of over 75% relative to LWL and over 50% compared to SITA.
Outline.The remainder of the paper is organized as follows.
• Section 1.2 reviews related work.
• Section 2 presents our model and defines the CARD policy.
• Section 3 states our main results and gives some intuition for why they hold.
• Sections 4-6 prove our results: a lower bound on the performance, namely mean response time, of any policy (Section 4); stability of CARD (Section 5); and an upper bound on CARD's performance, which implies its heavy-traffic optimality ( = 2 servers in Section 6, general case in Appendix B.4). • Section 7 studies CARD outside of heavy traffic via simulation.
We note that a preliminary version of this work appeared as a three-page workshop abstract [57], but it was extremely limited compared to the current version: it treated only the case of two servers and two job sizes, it did not provide any lower bound, and it omitted all proofs.

Related Work
FCFS dispatching with incomplete information.Whether a dispatching policy is optimal depends critically on the information available to the dispatcher.When the size of the arriving job is unknown, but server states (e.g.number of jobs at each server, work at each server, etc.) are known (state-aware), depending on the server-state information, Round-Robin (RR) [7,34,35], Join-Shortest-Queue (JSQ) [51,55], and LWL [1,5,11,31] are shown to be optimal.The common key idea of these policies is to join the queue with least (or least expected) amount of work.
When only the sizes and the distribution of the arriving jobs are known, SITA is known to be optimal [9].But this result assumes that the dispatching policy must be entirely static.Recently, it was shown that combining SITA with RR can improve performance [2,25], which combines SITA with just a little bit of memory, namely which servers most recently received a job.
Perhaps the closest the SITA line of work gets to size-and state-aware dispatching is the SITA-JSQ policy proposed in Wang and Down [50], in which the dispatcher uses the size of the arriving job and number of jobs at each server to make dispatching decisions.CARD is in some ways similar to SITA-JSQ, particularly the "multi-band" variant of CARD introduced in our simulation study (Section 7).But SITA-JSQ does not actively control the amount of work in each queue, and in particular does not maintain a large imbalance between queues.Our lower bound (Section 4) shows this imbalance is necessary for heavy-traffic optimality.
FCFS size-and state-aware dispatching.For size-and state-aware FCFS dispatching, various heuristics have been proposed and studied in simulations.Many of them are based on approximate dynamic programming e.g.[22,24,27].Another class of policies, called sequential dispatching policies, are introduced in [23].Among the sequential dispatching policies, Dice [23] shows superior performance in simulations and is among the best heuristics that has been developed.In our simulations (Section 7), Dice often slightly outperforms CARD.However, there is no theoretical analysis so far on the performance of Dice, even in heavy traffic.
Heavy-traffic Optimality Results.The aforementioned optimality results are strong in the sense that they either show stochastic ordering optimality on sample paths, or show optimality for any load of jobs.For more complicated policies and systems, characterizing the mean response time for an arbitrary load is a difficult task.Therefore, a large number of works focus on analyzing the heavy-traffic regime and establish optimality therein.One approach is to prove optimality via process limits e.g.[30,48].Such approach focuses on the transient regime and interchange of limits are usually not established for analysis in steady state.Another approach is to work directly in the stationary regime and establish heavy-traffic optimality results on mean response times in steady state e.g.[8,59,61].However, these optimality results focus on settings where job sizes are unknown, so they do not address our goal of optimal size-aware dispatching.
Tools and Methodology.Recently, Eryilmaz and Srikant [8] introduced and popularized a Lyapunov drift-based approach that is applied to study the steady-state performance of queueing systems in heavy traffic.The approach has been adopted in studying various switches (e.g.[21,28,29,32,36,37,49]), load-balancing algorithms (e.g.[20,33,52,58,59,61]), and other stochastic models (e.g.wireless scheduling, Stein's method, mean-field models).In some sense, our paper applies drift method to continuous-time continuous state Markov processes.Our use of the Rate Conservation Law [42] parallels the use of "zero drift" condition in drift analysis.An important step in drift analysis is establishing state-space collapse.We prove a result of this type in Lemma 5.2.
Other Relevant Work.When scheduling is allowed at the servers, optimal dispatching policies can be very different.When there are multiple parallel SRPT servers, Down and Wu [6]  dispatching policy and show optimality using a diffusion limit argument.Grosof et al. [14] develop a dispatching policy, called Guardrails, that achieves optimal mean response time in heavy traffic. 2oth of these prior dispatching policies involve, roughly speaking, balancing work evenly across the multiple SRPT servers.This is in contrast to CARD, which maintains a large imbalance between the multiple FCFS servers.One interpretation is that while SRPT prioritizes jobs at each individual server, CARD prioritizes jobs at the dispatching stage, namely by sending shorter jobs to servers with less work.
In recent years, learning-based dispatching policies have also been studied in literature [12,44].CARD involves tuning some parameters that depend on the job size distribution, and we thus assume knowledge of the job size distribution.An interesting question for future work is whether CARD's parameters could be learned online in settings where the job size distribution is unknown.
In the context of scheduling jobs on a single server, when SRPT (Shortest Remaining Processing Time) is shown to be optimal [45], Chen and Dong [4] show that having two priority classes is sufficient for a good performance in heavy traffic.The heavy-traffic performance of CARD ends up roughly equivalent to the performance of a single-server system with two priority classes.However, we cannot match the performance demonstrated by Chen and Dong [4]: they decrease the fraction of load in the lower-priority class to zero in heavy traffic, whereas CARD's "lower-priority jobs", namely those sent to the long server, must constitute a roughly 1   fraction of the load.

Model Description
We consider a system of  ⩾ 2 identical FCFS (First-Come, First-Served) servers, each of which has its own queue.The system has one central dispatcher, which immediately dispatches jobs to a server when they arrive.We consider M/G job arrivals with (Poisson) arrival rate  and job size distribution .We assume E[ 2 ] < ∞.The system load, namely the average rate at which work arrives, is  = E[].We assume a server never idles unless there are no jobs present in its queue.We use the convention that each server completes work at rate 1  , so a job of size  requires  time in service.This convention means the largest possible stability region is  ∈ [0, 1), regardless of the number of servers .The convention is also convenient when comparing our system's performance to that of a "resource-pooled" M/G/1 with the same arrival process and server speed 1.
We write E[ M/G/1 ] for the mean amount of work in such a resource-pooled M/G/1.
We consider size-and state-aware dispatching policies.That is, when a job arrives, the dispatcher may use both the job's size and the system state to decide where to dispatch it to.For our purposes, the most important aspect of the system state is the amount of work remaining at each server.We write   for the amount of work at server  (but see also Section 2.2), W = ( 1 , . . .,  ) for the vector of work amounts, and  all =  =1   for the total work.We write   () or W() when discussing work at a specific time .
The main metric we consider is mean response time.A job's response time is the amount of time between its arrival and completion.Due to our 1   service rate convention, if a job of size  is dispatched to a server with  work, the job's response time is ( + ).We write E[  ] for the mean response time over all jobs (in the usual limiting long-run average sense) under policy .
Purely for simplicity of notation, we assume the job size distribution  has no atoms.This is to ensure that expressions like E[I ( < )] are continuous functions of .One can generalize all of our definitions and results to distributions with atoms using a lexicographic ordering trick. 3

Defining the CARD Policy
We now introduce our policy, CARD, which stands for Controlled Asymmetry Reduces Delay.We first present it in the context of  = 2 servers, then generalize to  ⩾ 2 servers.
CARD for two servers.In the  = 2 case, CARD designates server 1 as the short server and server 2 as the long server.To emphasize this, when discussing CARD, we write   =  1 and  ℓ =  2 for the work at the short and long servers, respectively.
CARD has three threshold parameters to set: • The two size thresholds  − and  + , 0 ⩽  − ⩽  + , divide jobs into small, medium, and large (see below).• The work threshold ,  ⩾  + is, roughly speaking, a target work level for the short server.
Based on these parameters, CARD dispatches jobs as follows (see also Figure 1.1): • A small job, namely one with size in [0,  − ), is always dispatched to the short server.Setting CARD's parameters.There are a range of ways to set  − ,  + , and  that yield stability and heavy-traffic optimality.We specify these formally in the statements of Theorems 3.2 and 3.3, but we highlight the key points here (see also Section 2.3).
The size thresholds  − and  + should be chosen such that small jobs and large jobs are each less than half the load.Formally, we require In particular, we have  − <  <  + , where  is the solution to E[I ( < )] = 1  2 E[].As we show in our lower bound (Theorem 3.1), this value  is in some sense the ideal cutoff between small and large jobs.As such, it is important that in heavy traffic, either  − →  or  + →  (or both).We do the former in our upper bound (Theorem 3.3).
The work threshold  must balance a tradeoff between two concerns.On one hand, we want there to be little work at the short server so that small jobs have low response times.On the other hand, we do not want the short server to run out of work, as excessive idling could increase response times or even cause instability.Roughly speaking, this means setting  = Θ 1 1−  for a suitable choice of  ∈ (0, 1).
It is convenient in our proofs to ensure  ⩾  + , so we assume this throughout.It also makes intuitive sense that a single medium job should not bring the short server from empty to above the work threshold.However, this assumption can be easily relaxed at the cost of a little more computation in the proofs.
Generalizing CARD to any number of servers.We now generalize the above policy to  ⩾ 2 servers.Here we focus on an extension that prioritizes simplicity of analysis while still achieving optimal heavy-traffic performance.In our simulation study (Section 7), we consider a more complex variant which has better performance at practical loads. 3Have the system assign each job an i.i.d.uniform  ∈ [0, 1] independent of its size , and replace comparisons  <  with comparisons (,  ) ≺ (, ) for some  ∈ [0, 1], where ≺ is the lexicographic order.If E[ I ( (,  ) ≺ (, ) ) ] has a jump discontinuity at , varying  interpolates continuously between the left and right limits.The basic idea of -server CARD is to reduce to the two-server case.We use the same three parameters  − ,  + , and , and we define small, medium, and large jobs in the same way.The only difference is that instead of one short and one long server, we use  − 1 short servers 1, . . .,  − 1 and a single long server .We thus write    =   and  ℓ =   when discussing -server CARD.Abusing notation slightly, we write simply   when discussing a generic short server whose index is not important.Jobs are dispatched as follows: • A small job is always dispatched to a uniformly random short server.
• A medium job is dispatched as follows.The dispatcher selects a uniformly random short server  ∈ {1, . . .,  − 1} and inspects its amount of work    .If    ⩽ , the job is dispatched to the chosen short server  , and if    > , it is dispatched to the long server.• A large job is always dispatched to the long server.
Another way to view -server CARD is in the following distributed manner.Suppose that instead of one dispatcher, we have  − 1 independent "subdispatchers", each associated with a short server, and suppose that all jobs arrive at a uniformly random dispatcher.Then -server CARD is the result of each of the subdispatchers using two-server CARD, except they all share the same long server.
The way we set the parameters of -server CARD is essentially the same as how we set the parameters of two-server CARD.The only difference is that instead of wanting small and large jobs to both have less than half the load, we want small jobs to be less than a 1 − 1  fraction of the load, and we want large jobs to be less than a 1   fraction of the load.We therefore set This means  − <  <  + , where now  is the solution to

Key Definitions for Main Results and Analysis
We state our main results and perform our analysis in terms of the following quantities.
Drift-related quantities.The following quantities are related to characterizing drifts, which are the average rates at which work increases or decreases in various situations.
• Let   ,   , and  ℓ be the loads due to small, medium, and large jobs, respectively: • Let  and  be the following quantities related to the drift of   : If   > , then   has drift −, and if 0 <   ⩽ , then   has drift +.• Let  ∈ (0, ] be a bound on the probability the short server is idle, i.e.P[  = 0] ⩽ .We show how to set CARD's parameters to achieve this bound in Theorem 3.2(a).
To specify CARD's  − and  + parameters, it suffices to specify  and : these determine   and   , which in turn determine  − and  + .Moreover, for any given , we show in Theorem 3.2 how to set CARD's  parameter to achieve P[  = 0] ⩽ .As such: Instead of specifying  − ,  + , and  directly, we specify , , and .
In particular, Theorem 3.3 specifies how , , and  should scale as functions of .
Heavy traffic.Our main results consider the  ↓ 0 limit, which we call the heavy-traffic regime.This is equivalent to  ↑ 1/E[].In particular, we leave the number of servers fixed.
Underlying our results are explicit bounds that hold even outside the limiting regime (see e.g.Theorem 6.11).Because of our focus on heavy traffic, we assume for convenience that  < 1  .In particular, this ensures we can set  > 0, which ensures that   always drifts towards .The case where  > 1  and  < 0 is less interesting, as then both   and  ℓ always drift towards 0. Performance-related quantities.The following quantities are used in our response time bounds (Theorems 3.1 and 3.3).Define  CARD and  such that This characterization of  is equivalent to the aforementioned where 2 is the mean work in a resource-pooled M/G/1 (Section 2.1).

MAIN RESULTS AND KEY IDEAS
We now present our main results, followed by some intuition for why they hold.See Sections 4-6 for the proofs, with some details deferred to Appendix B.
Our first result is a lower bound on the mean response time for any dispatching policy.
The rest of our results are about CARD: stability for all  > 0, and heavy-traffic optimality as  ↓ 0. Both results are stated as sufficient conditions on CARD's parameters under which it achieves the corresponding property.See Sections 2.2 and 2.3 for descriptions of and notation for CARD's parameters. in the  ↓ 0 limit, then CARD achieves mean response time bounded by In particular, CARD is heavy-traffic optimal: lim sup ↓0 ⩽ 1 for any dispatching policy .
Proof.See Section 6 for the case of  = 2 servers and Appendix B.4 for the general case.

Intuition for Lower Bound on All Policies
We now give some intuition for Theorem 3.1.We focus on the heavy-traffic regime, where our aim is to show that the best possible mean response time is roughly where E[ ] is the mean number of jobs in the system.The key idea is to relate E[ ] to the mean amount of work E[ all ].This is helpful because one can easily show E[ all ] ⩾ E[ M/G/1 ] (see e.g.Theorem 6.1).
How can we relate E[ ] to E[ all ]?In heavy traffic, most jobs in the system are waiting in a queue and have yet to enter service.We thus approximate , where E[ queue ] is the mean size of jobs waiting in a queue.This means minimizing E[ ] amounts to maximizing the mean size of jobs waiting in the queue.This makes sense in light of the fact that when studying scheduling policies beyond FCFS, serving small jobs ahead of large jobs reduces mean response time [45].
What is the largest that E[ queue ] can be?Because we are restricted to FCFS service, the only mechanism by which we can affect the sizes of jobs in the system is dispatching.In particular, we can dispatch jobs of different sizes to different servers.Suppose, for example, that servers 1, . . .,  − 1 have a negligible amount of work, meaning nearly all of the work is at server .Then E[ queue ] would be the average size of jobs dispatched to server , which could be much greater than E[].The best we could hope to do is E[ queue ] = E[ |  ⩾ ] for as high a threshold  as possible.But in heavy traffic, we need server  to handle a 1   fraction of the load, so the largest value of  possible solves . This is equivalent to the characterization of  from (2.1), so it leads to To make this reasoning rigorous, it turns out that reasoning directly in terms of E[ queue ] is difficult.We instead prove Theorem 3.1 using a potential-function approach.However, the potential function and manipulations we perform on it were directly inspired by the intuition: The best-case scenario is to dedicate one server to the jobs of size at least , and to ensure that all other servers have a negligible amount of work.

Intuition for Upper Bound on CARD
We now give some intuition for Theorem 3.3.By the lower bound intuition above, CARD is already well on its way to achieving the best-case scenario: it attempts to keep the amount of work at the  − 1 short servers near , and the long server only serves medium and large jobs.To show CARD matches the lower bound in heavy traffic, it would suffice to show the following.
• CARD does not have much more work than a resource-pooled M/G/1: -Roughly speaking, this amounts to showing that we avoid situations where one server is idle while another server has lots of work (see Theorem 6.1).• CARD's short servers do not exceed  work by too much: E[  ] ≈ .
-We also need to set  such that it is negligible in heavy traffic.• CARD rarely dispatches medium jobs to the long server: Our main tool for showing these and related properties is examining what we call below-above cycles.Consider a particular short server.It alternates between below periods, during which   ⩽ , and above periods, during which   > .It turns out that much of our analysis rests on below-above cycles not being too long.One reason for this is that when enough short servers are in above periods, the long server is temporarily overloaded.Long periods of transient overload could cause E[ all ] to be significantly greater than E[ M/G/1 ].Short below-above cycles prevent this possibility.See Section 6.1 for more details about how we use below-above cycles.

UNIVERSAL LOWER BOUND
Theorem 3.1.Under any dispatching policy  and for any  ∈ (0, 1), Before diving into the proof, we give the high-level idea for  = 2 servers.Suppose an arrival occurs while  1 <  2 .For that individual arrival, its response time if it were sent to queue  would be 2  because each server processes work at rate 1/2, so the "benefit" of sending it to queue 1 instead of queue 2 is 2( 2 −  1 ).Reasoning symmetrically if  1 <  2 , we conclude that the benefit of dispatching jobs to the shorter queue is proportional to The main challenge is therefore to show that no dispatching policy can both frequently dispatch to the shorter queue, and also maintain large difference | 2 −  1 | between the queues.The key observation is that if we dispatch the job to the shorter queue, then | 2 − 1 | decreases, so the next arrival would see less benefit.That is, we can view | 2 −  1 | as a type of resource: dispatching jobs to the shorter queue depletes it, while dispatching jobs to the longer queue replenishes it.It is thus best to dispatch shorter jobs to the shorter queue, which slowly depletes | 2 −  1 |, and dispatch longer jobs to the longer queue, which quickly replenishes | 2 −  1 |.To formalize the idea of viewing | 2 −  1 | as a resource, we use the potential function 1  2 ( 2 −  1 ) 2 .The proof below handles any number of servers .The idea is essentially the same as the  = 2 case, except we look at the work differences |  −   | for every pair of servers  ≠ .
Proof of Theorem 3.1.Consider an arbitrary stationary dispatching policy .We first introduce notation for 's dispatching decisions.Suppose a job of random size  arrives and observes work vector W = ( 1 , . . .,  ).We denote by  choice the work at the queue the arrival is dispatched to.Note that while  is independent of W, it is not independent of  choice .We also write  all =  =1   for the total work at all queues.Because each server does work at rate 1/, we can write E[  ] as The main task is to give a lower bound on E[ choice −  all ].To do so, we apply the rate conservation law of Miyazawa [42] to  (W), where The value of  (W) can change in two ways.
• Work is done continuously at each nonempty queue.We denote this average continuous change by E[   (W)].• Arrivals add work to whichever queue the dispatcher chooses.By PASTA (Poisson Arrivals See Time Averages) [56], this yields average change , where e choice is the standard basis vector with a 1 indicating the queue the job is dispatched to.The rate conservation law [42] states that the average rate of change of  (W) is zero, so We now investigate each of the two terms in (4.2).We first observe that E[   (W)] ⩽ 0, because in the absence of arrivals, for any two queues  and , the absolute difference |  −   | either decreases (if exactly one server is idle) or stays constant (otherwise).Therefore, Expanding the definition of  (w) and writing ≠choice for sums over all queues other than the one the job is dispatched to, we obtain 0 ⩽ Subtracting both sides from E[ choice −  all ] and using the fact that − all ⩽  choice −  all ⩽ ( − 1) all , we obtain where (a) follows from the fact that an arriving job's size  is independent of the work vector W it observes upon arrival.We now substitute the bound from (4.3) into (4.1),obtaining The bound follows from (see e.g.Theorem 6.1) and (2.1), which implies

CARD STABILITY ANALYSIS
Proving CARD's stability is more than a straightforward application of the Foster-Lyapunov theorem, which is widely used to establish stability of queueing systems.The main obstacle here is that the long server alternates between being underloaded and overloaded.It is thus difficult to find a Lyapunov function that is negative outside a compact set.
To overcome this obstacle, we use a result of Foss et al. [10,Theorem 1].Notice that, under CARD,   is itself a Markov process because the decision of where to dispatch a job depends only on the work at the shorter server.Roughly, [10, Theorem 1] says that since   is a Markov process of its own, if it is ergodic, then it suffices to do a drift analysis of  ℓ , averaged over the stationary distribution of   .Of course, we first need to show that   is ergodic.Our proof for CARD's stability therefore proceeds in three steps.
• We show that the short server's work   (), as a Markov process of its own, is Harris ergodic (Lemma 5.1).• With the stability of   () in hand, we bound the idleness probability of the short server in steady state (Lemmas 5.2 and 5.3).• We apply the result of Foss et al. [10, Theorem 1] (Theorem 3.2) to show stability whenever the long server is on average not overloaded.Our bound on the short server's idleness probability from the previous step thus gives a sufficient condition for stability.
Armed with these key ideas, the proofs themselves are relatively straightforward, with the bulk of the work being computation.As such, we defer most of these computation details to Appendix B.2.
Proof sketch.The proof uses a Foster-Lyapunov theorem for continuous-time Markov processes [41,Theorem 4.2].The key step is to verify that the Lyapunov function  (  ) =   has bounded drift when   ⩽  and negative drift when   > .This is true because when   > , we only send small jobs to the short server.We defer the details to Appendix B.2.
We establish our short server idleness bound by first proving a general bound on the probability that   is lower than  by a general amount .The idleness bound follows by plugging in  = .Proof sketch.This result is a Chernoff-type bound on ( −   ) + , so the main task is to bound E[exp( ( −  ) + )].We do this by applying the rate conservation law [42] to exp( ( −  ) + ).We defer the details to Appendix B.2. Lemma 5.3.We have the following bound on the idleness of the short server, and all small and medium jobs have length at most  + , we have . We can therefore apply Lemma 5.2, from which the bound follows by the computation below and setting  = : We defer the proof of Theorem

CARD MEAN RESPONSE TIME ANALYSIS
With the lower bound from Theorem 3.1 in mind, our next step is to establish an upper bound on the mean response time under CARD.We focus here on the two-server case.The general case uses the same ideas but has more complicated computations, so we defer its proof to Appendix B. 4.
Let E[ CARD, ], E[ CARD, ], and E[ CARD,ℓ ] be the mean response times of small, medium, and large jobs under CARD, respectively.We have where the inequality follows from how CARD dispatches jobs, the PASTA property [56], and the fact that the servers complete work at rate 1/2.The main difficulty of analyzing (6.1) lies in bounding We now give a high-level overview of the obstacles and our approach.
, where E[ M/G/1 ] is the work in an M/G/1 with arrival rate  and job size distribution .
The key component we need to bound from Theorem 6.1 is E[ all ].We would like to study The main difficulty here is to bound E[ ℓ I (  = 0)].Since CARD dispatches differently to the long server based on the state of the short server,  ℓ depends on the state of   .Such a dependency also poses challenges in analyzing Under CARD,   alternates between being above and below the threshold .Such a behavior naturally leads to renewal intervals consists of the "above" periods and "below" periods.Definition 6.2.We partition time into alternating intervals, called below periods and above periods, as follows: A below-above cycle is then a complete below period followed by a complete above period.Belowabove cycles start at times  for which   () = .We can partition time into below-above cycles.We introduce the following notation for working with below periods, above periods, and belowabove cycles: • We write E 0  [•] for the Palm expectation [3] taken at the start of a below-above cycle.Roughly speaking, E 0  [•] = "E[• | a below period starts at time 0]", but the formal definition avoids conditioning on a measure-zero event.
• In the context of a below-above cycle starting at time 0, meaning   (0) = , we denote the lengths of the below and above period by  and , respectively: Abusing notation slightly, we also use  and  to denote the lengths of the below and above period in a generic below-above cycle, not necessarily one that starts at time 0.
Why are above and below periods helpful for analyzing CARD?Within an above or below period, CARD does not change how it dispatches jobs, making it easier to analyze  ℓ within one below-above cycle.The Palm inversion formula [3], which is a generalization of the celebrated renewal-reward theorem, allows us to connect the average behavior of  ℓ within one below and above cycle to a steady-state average.For example, it implies Our high-level idea is to relate both of these quantities to E 0  [ ℓ (0)], the mean work at the long server at the start of a below-above cycle.We show in Lemmas 6.6 and 6.7 that, roughly speaking, The rest of this section is organized as follows.
• Section 6.2 analyzes the behavior of the short server.In particular, we show that above and below cycles are not too long.• Section 6.3 analyzes the behavior of the long server.Using the fact that above and below cycles are not too long, we show (6.2).As part of this, we bound E[ all ].• Section 6.4 assembles the pieces to prove Theorem 3.3.

Analyzing the Short Server and Below-Above Cycles
In this section, we bound various quantities relating to work at the short server and the below-above cycles.Of particular importance are the mean excesses of the above and below periods E[  ] and E[  ], as they are used to better understand the relations between E[ ℓ ] and E 0  [ ℓ (0)].The techniques we use to obtain bounds on E[  ] and E[  ] also immediately yield bounds on E[] and E[].Despite not using these bounds, given that they help complete the picture of how the system behaves, we state them, too.
As a reminder, the excess or equilibrium distribution of a random variable  is the distribution   whose probability density function is The excess arises naturally in renewal theory [3,16,43].Most important for our purposes is the fact that Proof.Suppose that at time 0, the short server has   (0) =  ⩽  work, so time 0 is in a below period.Let  () be the time until the end of the below period.We will show where the last step follows because  ⩾  + (Section 2.2).This implies all three of the bounds.• A below period starts with  work at the short server, so • The excesses   and (  )  can both be interpreted as the distribution of the amount of time until the below period ends, starting from some random amount of work at the short server, so their means can each be written as E[ ( )] for an appropriate variable  .It remains only to show (6.4), which we do using a supermartingale argument.Suppose   (0) =  as above, and define  () =  −   () + .We now show that  () is a supermartingale with respect to the Markov process   ().Let • Δ  (, ) be the amount of work completed by the short server during (, ] and • Σ  (, ) be the amount of work that arrives to the short server during (, ]. For any 0 ⩽  ⩽ , we have 1  2 −   −  ℓ +  = 0, so  () is indeed a supermartingale.Applying the optional stopping theorem to  () and  (), which we justify below, yields ⩾ − + +  (), from which (6.4) follows.Above, (a) uses the fact that all medium jobs have size at most  + , so at the moment the below period ends, the short server's work can jump to at most  +  + .
All that remains is to verify that we can indeed apply the optional stopping theorem. • Until the end of the above period,   −  evolves like the amount of work in an M/G/1 queue with server speed 1/2, job size distribution   , and work arrival rate   < 1/2.This means (  −  |   > ) has the same distribution as an M/G/1 with vacations, where the vacation length distribution is that of   −  at the start of an above period.The desired bounds follow from the work decomposition formula for the M/G/1 with vacations [13] and the observation that both job sizes and vacation lengths are bounded by  + .We defer the details to Appendix B. 3.
As in the proof sketch of Lemma 6.4, we view the short server during an above period as an M/G/1 with server speed 1/2 and work arrival rate   , so the mean drift of   is −(1/2−  ) = −.By standard results for M/G/1 busy periods [16], starting from   −  = , it takes / time in expectation for the above period to end.
• The E[] bound follows from the fact that at the start of an above period,   −  ⩽  + , implying E[] ⩽  + /.• The E[  ] bound follows from the fact that the residual time of an above period is distributed as   .But the residual time of an above period is the same as the amount of time until an above period ends starting from the stationary distribution of   −  conditional on being in an above period.This means /, so the result follows from Lemma 6.4.□

Analyzing the Long Server
In this section, we bound differences between , and E[ ℓ I (  = )], separately.These bounds will help us upper bound E[ all ], thereby obtaining a bound on E[ ℓ ].
Let   and   be the probabilities of being in an above or below period, respectively.That is, and where the expressions in terms of expectations of  and  follow from renewal-reward theorem.

E[𝑊
Proof.The long server workload process can be described as where • Δ  (0, ) is the total work processed by the long server in (0, ], • Σ  ℓ (0, ) is the total work added to the long server from medium job arrivals in (0, ], and • Σ ℓ ℓ (0, ) is the total work added to the long server due to large job arrivals in (0, ].Applying the Palm inversion formula [3] to  ℓ gives where (a) holds since  ℓ (0), the amount of long server work at time 0, is independent of  + , the length of the below-above cycle starting at time 0. We now bound E[ ℓ ] − E 0  [ ℓ (0)] separately from above and below.To obtain a lower bound, we bound the integrand below by −Δ(), obtaining , where (b) holds because the server completes work at rate 1 2 while it is busy.To obtain an upper bound, we bound the integrand above by Σ  ℓ (0, ) + Σ ℓ ℓ (0, ).We first bound its conditional expectation given  and .Notice that Σ  ℓ (0, ) + Σ ℓ ℓ (0, ) consists of arrivals of large jobs during (0, ] and medium jobs during (, ].Neither of these types of arrivals impacts the lengths of the above and below periods, so From (6.7) and a computation similar to the lower bound, we obtain .
Combining this with the lower bound, the result follows from (6.3), (6.5), and Cauchy-Schwarz: To complete the proof, we use the AM-GM inequality (arithmetic mean ⩾ geometric mean) on then apply our bounds on E[  ] and E[  ] from Lemmas 6.3 and 6.5.□ Lemma 6.7.
Proof.Similar to that of Lemma 6.6.See Appendix B.3.
Proof.Applying Palm inversion formula [3] to  ℓ I (  = 0) yields where we can end the integral at  because we only have   () = 0 during below periods, which corresponds to  ∈ [0, ).We further expand the right-hand side using (6.6).No medium jobs are dispatched to the short server during below periods, so where (a) follows from the independence of  ℓ (0) and ∫  0 I (  () = 0) d.To analyze the first term, we observe that by the Palm inversion formula [3] and Theorem 3.2, To analyze the second term, we apply (6.7), yielding The right-hand side is difficult to compute directly due to the dependency of  and   .To resolve this, we apply the Palm inversion formula [3] to   I (  = 0), where   () is the age process of the below-above cycle, namely the amount of time since the current cycle began.This yields Thus, to bound T 3 , it suffices to bound E[  I (  = 0)].By Cauchy-Schwarz, where (b) follows because   has distribution   , and (c) follows from Theorem 3.2.The result then follows from bounding E[ 2  ] using (6.3) and Lemma 6.3.□ Lemma 6.9.
Proof.We use Theorem 6.1 to bound E[ all ], which amounts to analyzing E[ all ].We have I ( ℓ = 0).Combining Lemmas 6.6 and 6.8 and noting E[ ℓ ] ⩽ E[ all ] yields a bound on E[ ℓ I (  = 0)]: , where (a) follows from   ⩽  + (  − ) + , (b) follows from Cauchy-Schwarz, and (c) follows from Lemma 6.4 and the fact that . Combining the bounds on E[ ℓ I (  = 0)] and E[  I ( ℓ = 0)] with Theorem 6.1, we obtain The result follows after rearranging and simplifying.We use the fact that we have defined the parameters such that  ⩾  + and  ⩽  (Sections 2.2 and 2.3), which means 1/ 1 −  2 ⩽ 1 +   ⩽ 2. And, using the fact that ,  ⩽ 1 2 , we loosely bound the terms with a √  factor by

Bounding Mean Response Time
We now prove Theorem 3.
We use these with Lemmas 6.9 and 6.10 to express the right-hand side in terms of , , , and  + , then simplify.We defer the details to Appendix B.3. □ Proof of Theorem 3.3 for  = 2 servers.The bound follows directly from plugging the parameter choices into Theorem 6.11, and comparing with the lower bound in Theorem 3.1 implies heavy-traffic optimality.But the main question is why these are the right ways to set the parameters.
If we set  = Θ(  ) for fixed , the only expression in Theorem 6.11 that is increasing as a function of  is log 3 2 = Θ log 1  .We thus ignore factors of √  when determining  and .One can check at the end that  ⩾ 3 suffices.
Observe that we want / ↓ 0 to ensure the multiplier of E[ M/G/1 ] approaches  CARD .If we substitute  =  into Theorem 6.11, then for any fixed , the resulting expression is a decreasing function of , so we set  = Θ (1).With this choice, the largest terms from the maximum in Theorem 6.11 are Θ √︁ / and Θ 1  log 1  , which are balanced by  = Θ  1/3 log 1  2/3 .□

SIMULATIONS
We have established the optimality of CARD as load approaches capacity.In this section, we investigate the performance of CARD in moderate traffic via simulations.We aim to provide insights into the following questions with our simulations.
• How good is CARD's performance compared with other dispatching policies in the literature?
• Are there simple modifications of CARD that exhibit better performance in practice?
• CARD has three tunable parameters: , , and .The recipe provided in Theorem 3.3 is optimal in heavy traffic, but are there rules of thumb that work well beyond heavy traffic?How sensitive is CARD's performance to these parameters?In all of our simulations, we consider three benchmark policies: LWL, SITA-E4 [17], and Dice [26].Roughly, Dice lets the server with least work pick small jobs from the arrival stream, leaving the large jobs for servers with more work.We refer interested readers to Appendix A for details. 5f course, there are many more dispatching policies.We pick LWL and SITA-E because they are extensively studied, and we pick Dice because among all heuristics for size-and state-aware dispatching, it has the best empirical performance at high load [26].
Our simulations include job size distributions with exponential and heavier tails.Heavy-tail distributions are common in computer systems and networks (e.g.[38]) and the high mean response times they incur make a good dispatching policy essential.Throughout this section, we consider three Weibull distributions with mean 1 and coefficients of variation (cv) 1, 10, and 100.We simulate 40 trials for each data point, with 10 7 arrivals per trial for cv = 1 and cv = 10, and 3 × 10 7 arrivals per trial for cv = 100.We show 95% confidence intervals when wider than the marker size.

Performance of CARD with Two Servers
Although CARD as introduced in Section 2.2 is heavy-traffic optimal, we can improve its performance under moderate traffic with one small modification: instead of statically deciding which server is short and which is long, dynamically treat whichever server has less work as the short server.We call this variant Flexible CARD, and call the original version Rigid CARD to disambiguate.
Figure 7.1 shows us that both CARD versions significantly outperform LWL and SITA-E, especially at high loads and with large coefficients of variation.For instance, with cv = 100 and  = 0.98, CARD gives a 93% reduction compared to LWL, and a 61% reduction compared to SITA-E.Flexible CARD is also almost tied with Dice at all loads simulated.

Calibrating the Parameters of Two-Server CARD
We now discuss how to calibrate parameters , , and .In practice,  and  as prescribed in Theorem 3.3 are difficult to calibrate, because the ranges of  and  change as  increases.Therefore, we consider instead the parameters .Adjusting  ′ can therefore be understood as adjusting the fraction of small jobs and adjusting  ′ can be understood as adjusting the fraction of large jobs.
After trying a few strategies for scaling  as a function of , we found that thresholds of the form  =  1 √  log 1   , where  depends on the distribution, yield decent performance.In general, for the three job size distributions we consider, mean response time under flexible CARD is not very sensitive to these parameters near the optima (see Figure 7.2).Any choice of parameters not too far from the optima yields decent performance.We found that  ′ =  ′ = 0.15 for all three distributions and  = 0.3, 0.6, 2.5 for cv = 1, 10, and 100, respectively, lead to decent performance.These are also the parameters we used in Figure 7.1.

Improving CARD's Performance for More than Two Servers
As the number of servers increases, flexible CARD with three parameters (,  ′ , and  ′ ) no longer performs well for distributions with large coefficients of variation Figure C.1.Therefore, we propose another variant of CARD for  servers called multi-band CARD.We first present the general dispatching rules, then explain how multi-band rigid and flexible CARDs are defined.
• We divide the job size into  + 1 small intervals such that each interval amounts to 1  of the total load except for the first and last interval, each of which amounts to 1 2 of the total load.Denote the endpoints of these intervals as 0,  1 , . . .  , ∞.
• Server  except the last one has a threshold   , which can be different for different servers.
When a job of size  arrives, it is dispatched according to the following general rules: Multi-band rigid CARD numbers the servers 1 to  and dispatches according to the rules outlined above.Server numbers do not change under rigid CARD.On the other hand, Multi-band flexible CARD sorts the servers in increasing work order when a job arrives so that then dispatch according to the general rules.Since all the   's are fixed for each distribution, the tunable parameters are the   's.Our experiments show that we achieve good performance by setting   =   / √ .As we can see in Figure 7.3, multi-band CARDs significantly outperforms LWL and SITA-E at high loads and for job size with large coefficients of variation.When the job size distribution has cv=10, at  = 0.98, mean response time under flexible CARD is ∼22% and ∼19% of the mean response times under LWL and SITA-E, respectively.When the job size distribution has cv=100, at  = 0.98, mean response time under flexible CARD is ∼4% and ∼21% of the mean response times under LWL and SITA-E, respectively.Moreover, multi-band flexible CARD almost ties with Dice in all loads simulated.

Tail Simulations
Although our paper focuses exclusively on mean response time analysis, metrics based on the tail of response time are often of interest in practice.As such, in this section, we conduct some simulations comparing the response time tails of CARD against the benchmark policies and provide some insights into the results.
We focus on a two-server system.The parameters of rigid CARD, flexible CARD, and Dice are the same as those in Section 7.2.The results are presented in Figure 7.4.We can see that for light-tail job size distribution, the tails of LWL and M/G/1/FCFS are better than those of rigid and flexible CARDs and Dice.On the other hand, for heavy-tail job size distribution, the tails of rigid and flexible CARDs and Dice are far better than those of LWL and M/G/1/FCFS up to 99-percentile of flexible CARD response time.
This result is not surprising.CARD and Dice both starve large jobs by making them wait in a long queue.For light-tail job size distributions, although giving a little priority to small jobs improves tail performance [15], starving large jobs in general only worsens tail performance [54].For heavy-tail job size distributions, however, starving large jobs improves tail performance tremendously [54].As is shown in Figure 7.4, CARD and Dice significantly outperform LWL and M/G/1/FCFS up to 99-percentile of flexible CARD response time for heavy-tail job size distributions.

Comparing CARD to Dice
Given the excellent performance of Dice, we feel Dice warrants a more in-depth discussion.We refer interested readers to Appendix A, where we discuss the following questions: • How complicated is Dice compared with CARD?
• Why does Dice perform so well in simulations?
• Is Dice heavy-traffic optimal?
• Why is it hard to analyze Dice?
The main takeaway is that Dice may not be heavy-traffic optimal, but it may be possible to modify Dice to make it heavy-traffic optimal.Dice remains a compelling option in practice that certainly deserves further study.

CONCLUSION
In this paper, we prove the first mean response time lower bound for FCFS servers.We design a new dispatching policy, called CARD (Controlled Asymmetry Reduces Delay), and show that it is heavy-traffic optimal, thus making CARD the first proven heavy-traffic optimal size-and stateaware dispatching policy.CARD can thus serve as a new benchmark policy for future work in dispatching or load-balancing for FCFS servers.Methodologically, our method of analyzing CARD using below-above cycles could be of independent interest, as it can be adapted to study other threshold-based policies.Underlying our results is the insight that in the size-aware dispatching setting, it is helpful to have a significant imbalance between the amounts of work at each server.This insight has been made multiple times throughout the size-aware dispatching literature (e.g.Harchol-Balter et al. [17], Hyytiä et al. [24]).It is in contrast to the natural idea of always balancing the queues, which is helpful in size-oblivious dispatching [60].
In addition to minimizing mean response time, researchers today are also interested in tail performance.We conjecture that for job sizes with exponential tails and mean 1/, work under CARD asymptotically decays exponentially with rate −  in an -server system, which is worse (i.e.smaller) than the  −  decay rate of an M/M/1 under FCFS.An interesting follow-up question is how to balance the tradeoff between mean response time and response time decay rate, perhaps starting with the heavy-traffic regime.We leave this to future work.be the heavy-traffic constant of policy .In Theorem B.1 below, we show that  LWL and  SITA-E are strictly greater than  CARD as defined in (2.1).
However, this does not finish the story for SITA, because while SITA-E is the version of SITA that splits the load equally, it is possible to improve SITA's performance by using an unequal load split.The version of SITA that uses the optimal load split is known as SITA-O.Surprisingly, SITA-O can have a significantly better heavy-traffic constant than SITA-E, even though there is very little flexibility in the amount of load each server can receive.In Theorem B.2, we sketch a computation of  SITA-O , showing that it, too, is strictly greater than  CARD .
For simplicity of computations, we focus on  = 2 servers, but the results should generalize to more servers.Similarly, while we continue to assume continuous job size distribution  (Section 2.1) for simplicity of defining SITA, the results should hold for any job size distribution for which  is well-defined. 6  Theorem B.1.In a system with  = 2 servers and and a continuous job size distribution , we have Proof.It is known that  LWL = 1 (see e.g.[53]).To compute  SITA-E , we first note that under SITA-E, the system decouples to two independent M/G/1 queues.For any  ∈ (0, 1), we have where (a) follows from the fact that P[ < ] ⩾ P[ ⩾ ].Looking at the  ↓ 0 limit, we have  SITA-E ⩾ 2 CARD .□ Theorem B.2.In a system with  = 2 servers and a continuous job size distribution , we have  SITA-O >  CARD .
Proof sketch.SITA-O works like SITA-E, except instead of using size threshold  to split jobs between the servers, it uses a different size threshold  ′ .The key insight is that in the  ↓ 0 limit, we must have  ′ −  ⩽  (), beause otherwise we would overload one of the servers.This means, roughly speaking, that SITA-O can affect the denominators in (B.1), but it cannot significantly affect the numerators.Specifically, there exists  ∈ (−1, 1) such that Optimizing over the value of  yields , where (a) follows from the fact that

Stability
As outlined in Section 5, we begin by showing that the short server is stable for any threshold  ⩾ 0.
Our main tool is a continuous-time Foster-Lyapunov theorem developed in Meyn and Tweedie [41].A key component of the theorem is the infinitesimal generators for Markov processes.Let  () be a Markov process, its infinitesimal generator, , is the operator defined by The domain of  is all functions  for which the limit on the right exists for all  in the state space.Since work at the short server,   (), is a Markov process, for a function  with left derivative, we may explicitly derive the infinitesimal generator of   () under CARD: where We now present the continuous-time Foster-Lyapunov theorem [41,Theorem 4.4] below for easy reference.Theorem B.3.Suppose that a Markov process Φ is a non-explosive right process.If there exists constants ,  > 0, a function  ⩾ 1, a closed petite set , and a function  ⩾ 0 that is bounded on  such that for all  ∈   and  ∈ Z, Here,   is a family of precompact sets that increases to the entire state space as  → ∞ and   is the generator for the truncated process restricted to   .This restriction is in place mainly to handle possibly explosive processes.Our process W() is not explosive.More importantly, the Lyapunov function  we consider in Lemma 5.1 is increasing and differentiable.It follows that    () ⩽  () for all .Proof.Define  (  ) = ( −   ) + and fix some  > 0. Since   has a stationary distribution, we can apply the rate conservation law [42] to   (  ) , which yields Here We first analyze the second term on the LHS.Conditioning on a given state   , we have  we have Since  ⩾ 0 is chosen so that ( , )  ( ) > (b) If  <  −1 , then the system is stable.Specifically, the set {(0, . . ., 0)} is positive recurrent for the process W() = ( 1 (), . . .,  ()).
Proof.Part (a) is a corollary of Lemma 5.3.For part (b), we first establish the result for  = 2 servers, then show how it generalizes to  > 2 servers.For  = 2, we denote the state as W() = (  (), ℓ ()).
To establish (b), we first apply Condition B2: Let   be the stationary distribution of   ().By PASTA,   is also the stationary distribution of {  (  −)}.We have where (a) comes from Lemma 6.10 and PASTA.We then compute 7 Since {W(  −)} is positive Harris recurrent and easily seen to be {(0, 0)}-irreducible, the expected number of steps until returning to (0, 0) is finite from any starting state.The time between steps is exponentially distributed with mean 1/, so we conclude from Wald's equation that the expected return time of the original process W() to state (0, 0) is also finite.Positive Harris recurrence of W() immediately follows.We now generalize the above proof to  > 2 servers.To begin with, we define a vector-valued process W short servers () = (  1 (), . . .,  −1 ()).Under multiserver CARD,  short servers () has the following properties: • W short servers () is a Markov process of its own and is Harris ergodic.
• Since stationary distribution of the short servers are i.i.d., the stationary distribution of W short servers is the product of stationary distributions of the short servers in isolation.
With these two properties in hand, the argument for  = 2 servers as presented above works for  > 2 servers with the same functions ℎ,  2 , and the following  : Proof.As stated in the proof sketch, (  −  |   > ) has the same distribution as an M/G/1 with vacations.
• The job size distribution is   = ( |  <  − ).In particular, using the fact that   is stochastically dominated by  + , one can show that (  )  is stochastically dominated by a uniform distribution on [0,  + ]. 8 • The load is 1 − 2, and so the slackness is 2.
-The reason we use 2 instead of  is because the server operates at speed 1  2 .By "doubling the clock speed", the server speed becomes 1, and the distribution of (  −  |   > ) is unaffected.This makes it easy to apply standard results about the M/G/1 with vacations.
• Let  denote the vacation length distribution.It is hard to characterize exactly, but because   −  ⩽  + at the start of an above period,   is stochastically dominated by a uniform distribution on [0,  + ].
The desired bounds follow from the work decomposition formula for the M/G/1 with vacations [13].Specifically, for an M/G/1 with vacations, we can write its steady-state work  M/G/1/vac as an independent sum of random variables with distributions  M/G/1 and   .This means Applying the PK formula with the relevant parameters, we obtain .
The result then follows from Proof.The proof is very similar to that of Lemma 6.6, so we give only the key steps.Applying the Palm inversion formula [3] to where we can start the integral at  and remove the indicator because   () >  exactly during above periods, which corresponds to  ∈ [,  + ).Expanding this using (6.6) and noting the independence of  ℓ (0) from the below-above cycle, we obtain Applying (6.5) to the left-hand side, we see it suffices to give bounds on the right-hand side.The same reasoning as the proof Lemma 6.6 yields .
The result then follows from a computation similar to the end of the proof of Lemma 6.6.□ Theorem 6.11.In a system with  = 2 servers, if  ⩽  < 1 2 and  ⩾ 2, then by setting  according to Theorem 3.2, CARD achieves mean response time bounded by Proof.Consider a tagged job arriving to the system.Recall from (6.1) that We now bound the work expectations and probabilities in the last line.We now use Lemma 6.10 to express as much as possible on the right-hand side in terms of , , , and .After some simplification, including using the preconditions of the theorem, we obtain  The proof is similar to that of Lemma 6.10.Note that   and   are the same for all short servers because    are i.i.d. in steady state.Lemmas 6.3 and 6.5 follow from the same arguments as the two-server case.We would like to obtain a counterpart of Lemma 6.9.To this end, we use a multi-server version of Theorem 6.1.Note that we have

C ADDITIONAL SIMULATIONS
Our additional simulations applies flexible CARD with three parameters to  = 10 servers.We simulate 40 trials for each data point, with 10 7 arrivals per trial for cv = 1 and cv = 10 and 3 × 10 7 job arrivals per trial for cv = 100.We show 95% confidence intervals when wider than the marker size.Figure C.1 show that, for  = 10 servers, flexible CARD has decent performance when the coefficient of variation is small.However, for large coefficients of variation, flexible CARD does not perform well, even if we use LWL to dispatch small and medium jobs among the short servers.Specifically, when cv=10, flexible CARD deviates from Dice, although still better than LWL and SITA-E.When cv=100, flexible CARD performs worse than SITA-E at high loads.The unsatisfactory performance of flexible CARD for  = 10 servers motivates us to design multi-band CARD.

Fig. 1 . 1 .
Fig. 1.1.Sketch of the CARD policy for two servers.Small and large jobs are always dispatched to the short or long server, respectively.Medium jobs are dispatched based on whether  , the amount of work at the short server, exceeds a threshold .The size cutoffs  − and  + are chosen so that small and large jobs each constitute slightly less than half the load.

Fig. 1 . 2 .
Fig. 1.2.Mean response time as a function of load for several policies, including two versions of CARD.Rigid CARD is the version we theoretically analyze, while Flexible CARD is modified slightly to improve empirical performance.The job size distribution has coefficient of variation cv = 10.See Section 7 and Figure 7.1(b) for further details.

Fig. 7 . 2 .
Fig. 7.2.For each of the three plots, we fix two parameters and vary one parameter across a range of values.Size distribution simulated has cv = 10, and load is fixed at  = 0.8.
It now follows from Foss et al. [10, Theorem 1] that the embedded pre-jump chain {W(  −)} is positive Harris recurrent. .

T
At this point, we note that the upper bound for E[ CARD ] − E[] is, after letting  = 2, the same as that in Theorem 6.11, except for T .Thus, setting  = Θ(1),  = Θ  1T → 0 as  ↓ 0, we conclude that the bound yields the same heavy-traffic scaling as that in Theorem 6.11.Finally, we note that  CARD emerges because lim ↓0  ℓ = P[ > ] =  CARD .□ Fig. C.1.Plots for the mean response times under the aforementioned policies for  = 10 servers.On the top row are plots of mean response times of the policies.On the bottom row are plots for mean response times normalized by the mean response time of a resource-pooled M/G/1 queue.We use LWL, instead of random, dispatching to short servers when a small or medium job arrives.
• A medium job, namely one with size in [ − ,  + ), is dispatched depending on   at time of arrival.If   ⩽ , it is sent to the short server, and if   > , it is sent to the long server.•A large job, namely one with size in [ + , ∞), is always dispatched to the long server.