Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization

We present a theoretical study of server-side optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is very useful in the context of Federated Averaging (FedAvg) with local passes over the client data. Each local pass is performed without replacement using Random Reshuffling, which is a key reason we can show improved complexities. In particular, we prove that whenever the local stepsizes are small, and the update direction is given by FedAvg in conjunction with Random Reshuffling over all clients, one can take a big leap in the obtained direction and improve rates for convex, strongly convex, and non-convex objectives. In particular, in non-convex regime we get an enhancement of the rate of convergence from O (ε−3) to O (ε−2). This result is new even for Random Reshuffling performed on a single node. In contrast, if the local stepsizes are large, we prove that the noise of client sampling can be controlled by using a small server-side stepsize. To the best of our knowledge, this is the first time that local steps provably help to overcome the communication bottleneck. Together, our results on the advantage of large and small server-side stepsizes give a formal justification for the practice of adaptive server-side optimization in federated learning. Moreover, we consider a variant of our algorithm that supports partial client participation, which makes the method more practical.


Introduction
The unprecedented industrial success of modern machine learning techniques, tools and models can to a large degree be attributed to the abundance of data available for training.Indeed, the most popular and best performing deep learning models rely on a very large number of parameters, and in order to generalize well, need to be trained using optimization algorithms over very large training datasets.Other things equal, the more data we have, the better.A key driving force behind the proliferation of such data is the massive digitization of society of the last few decades.People have access to increasingly more elaborate personal and home smart devices capable of generating, capturing and processing data such as text, images and videos.Similarly, in the sphere of governments and corporations, much of what used to be done through a physical exchange (e.g., via paper/fax/letter) is now performed in a digital form, generating treasure troves of potentially useful data.For example, hospitals collect, store and make us of a variety of patient data, ranging from routine bodily functions to PET scans and genome sequencing.

Federated learning
The traditional way of learning from this data is to collect it in a single (and often proprietary) data center, where it is subsequently processed using modern machine learning algorithms.However, due to several considerations which keep gaining in importance, such as energy efficiency and privacy, it is often desirable to avoid centralized training altogether, and instead perform the training without the data ever leaving the clients' secure sites.Introduced in 2016 by Konečný et al. (2016); Konečný et al. (2016); McMahan et al. (2017), this is precisely the promise and subject of study of federated learning (FL).In other words, federated learning means efficient machine learning over data stored in a distributed fashion across a network of heterogeneous clients (e.g., mobile phones, smart devices, companies) that captured and own the data, using these clients' machines/devices not only as data sources, but also as computers that contribute to the training.

Problem formulation
We consider the standard optimization formulation of federated learning where M is the total number of clients, x ∈ R d represents the parameters of the model we wish to train, and Since the training dataset on each client is necessarily finite, we assume that f m has the finite-sum structure where f i m : R d → R is the loss of model x on training example i ∈ [n] def = {1, 2, . . ., n} stored on client m.We assume that the functions f i m are differentiable, and consider the strongly convex, convex and non-convex regimes.

Ingredients of successful federated learning methods
Practical considerations of federated learning systems and vast experimental evidence accrued over the last few years point to several design constraints and algorithmic ingredients which have proved useful in the context of federated learning methods for solving (1)-(2).We now very briefly outline some of them.More details can be found in the appendix where we review related work.
Partial participation.In federated learning, training is performed through several communication rounds in each of which an orchestrating server chooses a cohort of clients that will be participating in the training process in that round.This practice is known as partial participation, and is necessary due to practical considerations and limitations, such as limited server capacity, and limited client availability (Kairouz et al., 2021).However, partial participation can be useful also due to the diminishing returns one gets as the number of participating clients grows (Charles et al., 2021).Partial participation is a necessity in the cross-device regime where the training is performed over a very large number of clients (i.e., M is very large) most of which will only participate in the entire training procedure at most once.Sampling of clients to form a cohort can be done adaptively so as to choose the most informative clients (Chen et al., 2020).
Local training.At the beginning of each communication round, each client in the cohort is provided with the latest model by the orchestrating server, which is used as a starting point for local training.Local training refers to the common practice in FL of performing several steps of a suitably chosen local optimization procedure, such as one of the many variants of SGD, using its own local training data.Perhaps the simplest approach is to perform a single local GD iteration.
If the model updates are simply just aggregated by the server, then the resulting method can be seen as Minibatch SGD, where the minibatches correspond to the cohorts.However, it is typically more efficient to perform multiple local steps (McMahan et al., 2017), and to use local optimizers that rely on incremental data processing, such as SGD.
Data shuffling.Typically, the local training dataset is processed once or several times in an incremental fashion; that is, one data point (or one small minibatch) at a time.However, experimental evidence shows that processing the local data without replacement can lead to substantially better results than processing the data with replacement.In particular, processing the local training data in an order dictated by a random permutation-a technique known as Random Reshuffling (RR)-is often set as default in modern deep learning and federated learning software (Bottou, 2009;Bengio, 2012;Sun, 2020).This is in sharp contrast with the with-replacement sampling of data employed by SGD.With-replacement sampling ensures that the gradient updates are unbiased, and this simplified the analysis.For this reason, SGD is significantly better understood in theory than its better performing but much more poorly understood cousin RR.However, recent results of Mishchenko et al. (2020), and extensions due to Mishchenko et al. (2021) and Yun et al. (2021) to distributed training, show that RR can have clear theoretical advantages over SGD.
Server stepsizes.Once local training is finished, the clients in the cohort send their models or model updates to the orchestrating server, which typically aggregates them via averaging.This information is then used to perform server side optimization.The simplest approach is to do nothing; that is, to treat the aggregated models as the next global model that is broadcast to the new cohort in the next communication round.However, empirical evidence suggests that it is better to aggregate model updates, and treat them as gradienttype information which can be injected into a suitably chosen server side optimization routine (Karimireddy et al., 2020).For example, the server may run one step of GD using the aggregated model update as a proxy for the gradient which is not available, with its own server-side stepsize.

Summary of Contributions
Despite the fact that partial participation, local training, data shuffling and server stepsizes have all been empirically found to be very useful building blocks of FL methods, most of these techniques are not very well understood in theory even in isolation.Informally speaking, and at the risk of oversimplifying the current state of affairs, we know virtually nothing about server stepsizes, very little about data shuffling, relatively much more about local training, and quite a bit, but still "not enough", about partial participation.
The key focus of this paper is to make a substantial advance in the current theoretical understanding of server stepsizes in the context of realistic federated learning.
In order to theoretically understand the server stepsize phenomenon in a realistic context of techniques commonly used in FL, we study this phenomenon together with data shuffling, local training and partial participation.While this makes the analysis substantially harder and different from all1 existing analyses of FedAvg, we believe it is important to do so as this will highlight the interplay between these algorithmic techniques and their combined impact on training.
A brief visual summary of this in the context of selected existing methods is provided in Table 2.We summarize our contributions as follows: • New algorithm.We design a new algorithm, for which we coin the name Nastya (Algorithm 1; see Section 4), which combines all the of the aforementioned practical tricks and techniques in a single method: partial participation, local training, data shuffling and, most importantly, server stepsizes.In our method, in each communication round t, the cohort is chosen as a random subset S t of the set {1, 2, . . ., M } of clients of cardinality 1 ≤ C ≤ M , chosen uniformly from all subsets of cardinality C. Each device performs local training via a single pass of incremental GD with client stepsize γ > 0 over the local training data points in an order dictated by a random permutation.We allow for two options: i) either the random permutation for all clients is sampled just once and used in all communication rounds (Shuffle-Once option), or ii) the random permutation is sampled afresh at the start of each communication round (Random-Reshuffling option).At the end of local training, the updated models are communicated back to the server, which uses these updates to form a gradient estimator, and applies one step of GD using a server stepsize η > 0 with this estimator in lieu of the true gradient.The new model is then broadcast to a new cohort in the next communication round, and the process is repeated.
• Complexity analysis.We provide strong complexity analysis of our new algorithm for strongly convex (Theorem 1), convex (Theorem 2) and non-convex (Theorem 3) functions; see Table 3.This is the first theory for a variant of FedAvg that combines the benefits of partial participation, data shuffling, local training and, most importantly, also server stepsizes.Most importantly, with a couple exceptions only (Karimireddy et al., 2020;Woodworth et al., 2020), there are no prior theoretical works analyzing the effect of server stepsizes in FL.The methods in the aforementioned works use local training and partial participation, but do not use data shuffling, and are significantly different from ours.
• Small client stepsizes, large server stepsizes, and no need for drift reduction.In particular, Theorems 1, 2 and 3, covering the strongly convex, convex and non-convex regimes, respectively, suggest that the server can use the large O(1/L) stepsize, where L is the Lipschitz constant of the gradient of f .In the strongly convex and convex regimes, based on our theory, it is optimal for the client stepsize γ to be small, which completely eliminates the second of the three terms in the complexity bounds (see the third column of Table 3) which controls the price one pays due to data heterogeneity.Indeed, our theory allows for the client stepsize γ to be small while the server stepsize η can be large (see the second column of Table 3).
Note that in all three regimes, and thanks to the fact that we employ a data shuffling strategy, this second term depends on the square γ 2 of the client stepsize, which means that we can make this term small without making the client stepsizes infinitesimal.So, thanks to Nastya's use of data shuffling strategies, it does not require any explicit drift reduction technique such as SCAFFOLD to handle data heterogeneity (Karimireddy et al., 2020).Non-convex Reference This paper (1) The analysis is done under the bounded variance assumption: gi(x) := ∇fi (x; ζi) is unbiased stochastic gradient of fi with bounded variance E ζ i gi(x) − ∇fi(x) 2 ≤ σ 2 , for any i, x.
(2) The Õ notation omits log 1 ε factors (3) Here we use • Small server stepsizes can be beneficial.To the best of our knowledge, no prior theoretical work suggests that it might be beneficial to use small server stepsizes.Our results (see Theorem 5) suggest that this can be the case when each f i m is strongly convex and smooth, and when the strong convexity parameter is very small.
• Experimental validation of our theoretical predictions.We provide experimental examination of Nastya and compare it with selected benchmarks.Our goal is not to perform large scale experiments and claim empirical superiority because the algorithmic ingredients embedded in Nastya already are being used in practical FL methods precisely because they have already been empirically found to be useful.This allows us to focus on simple experiments which test the theoretical predictions of our theory.
Our experimental results confirm our theory, and illustrate the behavior of the methods we test in various settings.Moreover, we go beyond the theory and conduct additional experiments with the adaptive stepsize strategy introduced by Malitsky and Mishchenko (2020).Inspired by Reddi et al. (2020), we additionally utilize several server-side optimization subroutines on top of the local updates.

Preliminaries
In this section we introduce several key concepts that will help us to formulate our theoretical results.

Convexity and smoothness
In all our theoretical results we rely on smoothness, and in some we require convexity or strong convexity.
Definition 2 (Convexity and strong convexity).Function φ : and µ-strongly (5) r q 7 g l g K g 6 7 7 7 e R W V t f W N / K b h a 3 t n d 2 9 4 v 5 B w 0 S J Z r z O I h n p V k A N l 0 L x O g q U v B V r T s N A 8 m Y w u p 3 6 z U e u j Y j U A 4 5 j 7 o d 0 o E R f M I p W a j 5 1 0 7 N z b 9 I t l t y y O w N Z J l 5 G S p C h 1 x X l 3 P m a t K 8 5 8 5 g j + w P n 8 A c i R j q 4 = < / l a t e x i t > x ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " E Q 8 R X e n N h 5 c d 6 d j 3 n r i p P N H M E f O J 8 / p H S P I A = = < / l a t e x i t > x ⇤,2 < l a t e x i t s h a 1 _ b a s e 6 4 = " e w h U 9 i f N O j X p n 7 q K f f v O 7 J o s r D M = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S I I Q k m k q M e i F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 2 J 2 I J / R F e P C j i 1 d / j z X / j t s 1 B W x 8 < l a t e x i t s h a 1 _ b a s e 6 4 = " A u 2 x L t y D m v q x Y 8 E W g s j a X e k D 6 l M = " > x X l 3 P m a t K 8 5 8 5 g j + w P n 8 A c i R j q 4 = < / l a t e x i t > x ⇤ < l a t e x i t s h a 1 _ b a s e 6 x ⇤,1 x ⇤,2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 G D q q V M a N h v p A 6 K 1 9 l j j < l a t e x i t s h a 1 _ b a s e 6 4 = " e w h U 9 In the case of small client stepsizes γ, the average of local steps is not large, but at the same time the variance is small and the direction is close to direction of the full gradient, which allows us to go further towards this direction by employing a large server stepsize η.(b) In the case of large client stepsizes γ, each client step contributes to the global step, but the variance grows as well, so it is useful to use smaller server stepsize η to reduce this variance.These intuitions are confirmed by our theory.
In our analysis we use the following assumption.

Measures of data heterogeneity
While our theory does not require any assumptions on data homogeneity, our results will reflect the degree to which the data are heterogeneous, and are better for data that are "more" homogeneous.In particular, in the strongly convex and convex regimes we rely on the following notions.
Definition 3 (Variance at the optimum).The variance of the gradients {∇f m } M m=1 at x * is defined as where x * is a minimizer of f .The variance of the gradients An important lemma that allows us to obtain a strong upper bound for variance in the case of sampling without replacement, which our data shuffling methods rely on, was formulated by Mishchenko et al. (2020).We include it here for completeness.
Lemma 1 (Sampling without replacement).Let X 1 , . . ., X n ∈ R d be fixed vectors, X i − X 2 be the population variance.Fix any k ∈ {1, . . ., n}, let X π1 , . . .X π k be sampled uniformly without replacement from {X 1 , . . ., X n } and X π be their average.Then, it holds For non-convex functions, we use a different notion of data heterogeneity.
Definition 4 (Functional dissimilarity).The variance at the optimum in the non-convex regime is defined as For each device m, the variance at the optimum is defined as where Again, the above is a definition and not an assumption.
The concepts are well defined as long as Assumption 1 is satisfied.

The Nastya Algorithm
We now formally describe our Nastya algorithm (see Algorithm 1).Nastya combines several techniques that for all local training data points i = 0, 1, . . ., n − 1 do 10: (client m makes one pass over its local training data in the order dictated by πm) 11: 12: (server aggregates the local update directions gt,m discovered by the cohort St of clients) 13: x t+1 = x t − ηg t (server updates the model using the aggregated direction gt and applying server stepsize η) were empirically found to be useful in FL: partial participation, local training, data shuffling and server stepsizes.
In each communication round t ≥ 0 of Nastya, the cohort S t is chosen as a random subset of the set {1, 2, . . ., M } of all clients.In particular, we choose a random subset of cardinality C (the cohort size), where 1 ≤ C ≤ M , uniformly at random.The server then sends the global model x t to all clients in the cohort.
Setting C = M models the full participation regime.
Each participating client m ∈ S t then performs local training using a single pass of incremental GD with client stepsize γ > 0 over the local training data points in an order dictated by a random permutation of the indices of the local training dataset {1, 2, . . ., n}.
In particular, the following update is iterated for i = 0, . . ., n − 1: where x 0 t,m is initialized to x t , and γ > 0 is the client stepsize.That is, we run one pass over the local data using the RR method (Mishchenko et al., 2020).This differs from one pass over the data via SGD in that each data point is sampled exactly once.
Note that we allow for two options for how the permutation is formed: i) either the random permutation is sampled just once for all clients, and used in all communication rounds (Shuffle-Once option), or ii) the random permutation is sampled afresh at the start of each communication round (Random-Reshuffling option).Both have the same theoretical properties in our analysis.
At the end of local training, the updated models x n t,m are communicated back to the server, which uses these updates to form a gradient-type estimator g t , and applies one step of GD using a server stepsize η > 0 with this estimator in lieu of the true gradient.Equivalently and this is how we decided to formally state the method, each client m ∈ S t sends the following scaled model difference to the server: where x n t,m is the model found by the client after one pass over the data via RR.The server then aggregates these vectors from all clients in the cohort to form g t = 1 C m∈St g t,m , and then takes a gradient-type step using this quantity in lieu of the gradient, using server stepsize η > 0: The new model is then broadcast to a new cohort in the next communication round, and the process is repeated.

Warm-up: How to Improve Random Reshuffling
In this section, we provide the intuition behind our complexity improvements through the lens of singlenode Random Reshuffling (RR).In particular, when M = 1, objective (1) recovers the standard empiricalrisk minimization (ERM) problem: The update of RR for this problem has the form where we use a permutation π = (π 0 , . . ., π n−1 ) that is randomly sampled at the beginning of epoch t.Unrolling this recursion, we get The key insight is that the gradients evaluated at points x i t can be viewed as approximations of the gradients at point x t .If we denote, for simplicity, then one can show that g t ≈ ∇f (x t ) whenever γ is small.The update of Algorithm 1 becomes much simpler and reduces to where η = (1 + β)γn.If we imagine for a moment that g t is indeed a very good approximation of ∇f (x t ), then the theory of gradient descent suggests that one should use η ∼ 1 L , regardless of the value of γ.Complexity improvements.By following this intuition, we can establish, as special cases of our general theory, several complexity improvements.In strongly convex case, we obtain the O κn log , where A and B are defined following Mishchenko et al. (2020) as the constants from the following assumption:

Extending the Intuition to Multiple Nodes
Motivated by the example of Random Reshuffling, we can extend the complexity improvements to the case of multiple nodes.To achieve this, we utilize large server stepsize and small client stepsize.The main idea of this approach is again to approximate the full gradient using local passes over the clients' datasets.During each round, a worker node m computes n steps of permutation-based algorithm and obtains local model parameters x n t,m : If γ is not large, the sum of local steps serves as a good approximation of the full client gradient, so we define When the epoch ends, the server aggregates local approximations and then computes a step with the larger stepsize, which is equivalent to averaging the final local iterates and then extrapolating in the obtained direction: Above, β = η /γn − 1 is the extrapolation coefficient.Small client stepsizes allow us to get better approximation of full gradient, hence we obtain significantly smaller variance of stochastic steps.In the extreme case when client stepsize goes to zero, γ → 0, the gradient estimator converges to the exact gradient: g t,m → ∇f m (x t ) and we obtain distributed gradient descent method.
Large client stepsizes, on the other hand, combine better with small server stepsize.In that case, each local step has a big impact, and full-gradient approximation breaks.Since g t no longer stays close to ∇f (x t ), we use a different analysis for this case, which shows a benefit whenever client sampling noise is significant.This is particularly relevant to the cross-device federated learning, where only a tiny percentage of clients can participate at each round.

Theory
We now formulate our three main results.
Theorem 1 (Strongly convex regime).Let Assumption 1 hold, each f i m be convex and f be µ-strongly convex.Let γn ≤ η ≤ 1 16L .Then for iterates x t generated by Algorithm 1, we have In the full participation regime, the server stepsize restriction can be relaxed to η ≤ 1 8L .

Convex regime
Next, we cover the convex regime.
Table 3: The main convergence results obtained in this paper (also see Theorem 5).
Theorem 2. Let Assumption 1 hold, each As it can be seen, we get additional source of variance which is proportional to η and σ 2 * .This term means variance of client sampling.Since this sampling of clients have SGD-type structure, we have that variance is proportional to the first order of server-side stepsize.

Non-convex regime
Finally, we provide guarantees in the non-convex case.
Theorem 3. Let Assumption of smoothness hold.Let Similarly to analysis in full participation case, we use ∆ * ,m and ∆ * instead of σ 2 * ,m and σ 2 * , since point of minimizer cannot be defined.
Client and server stepsizes.Theorems 1, 2 and 3 suggest that the server can use the large O(1/L) stepsize, where L is the Lipschitz constant of the gradient of f .In all regimes, it is optimal for the client stepsize γ to be small, which completely eliminates the second of the three terms in the complexity bounds, which controls the price one pays due to data heterogeneity.
Partial participation.Notice that if the cohort size is equal to M , then M −C C max{1,M −1} is equal to 0, and this means that the last (third) term in all our complexity results disappears.The last term can thus be interpreted as the price we pay for partial participation.While we can reduce the variance of RR and the client drift by decreasing γ, we cannot make the variance due to client sampling arbitrary small, since it depends on η.
Comparison with existing rates.In Table 2 we compare our results in the strongly convex and nonconvex regimes with selected existing results.

Benefits of Small Server Stepsize
Our analysis shows that small client stepsizes can control variance.It turns out that using small client stepsizes means that we do not have any benefits from local steps.However, in some cases, our analysis shows that using small server stepsize and large client stepsizes can be beneficial and it means that we gain from using local steps.The advantage of local steps is obtained in case of data reshuffling Mishchenko et al. (2021).Moreover, the goal of learning is not obtaining the best value of the loss function, but the performance of the model.In recent papers, it was shown that large stepsizes are the better option in terms of generalization Smith et al. (2020).
Next, we introduce analysis for the case when each f i is strongly convex.and 0 ≤ α < 1.Then, for iterates x t generated by Algorithm 1, we have where σ 2 rad is introduced in (Mishchenko et al., 2021) and it corresponds the variance of Random Reshuffling method.The upper bound depends on α in a nonlinear way, so the optimal value of α would often lie somewhere in the interval (0, 1).Furthermore, the last term does not change with α, so the optimal value α * of α is completely determined by the first two terms.
Let us derive optimal α * under some approximations.In particular, when for ill-conditioned problems where µ is sufficiently small, it holds (1 − γµ) n ≈ 1 − γµn.Ignoring the last term in the upper bound of Theorem 5, which does not affect the value α * , and using 1 1−α ≤ 2 for α ≤ 1 2 , we simplify the upper bound to To have this upper bound smaller than some ε ≥ 0, we need to use α = O nεC γσ 2 * and T = O( 1 αγµn log 1 ε ), where we ignore constants unrelated to α, γ, ε, µ and n.Thus, the server stepsize η = αγn should ideally be η = O Cε σ 2 * .In other words, it is better to decrease η if only a small subset of clients is used and the variance of client sampling

Experiments
To showcase the speed-up that can be obtained from the server-side stepsizes, we run a toy experiment in the single-node setup, i.e., we consider standard minimization of a finite-sum.We combine the local passes over the data with the adaptive estimation of smoothness proposed by Malitsky and Mishchenko (2020).We run our experiment on 2 -regularized logistic regression with the 'mushrooms' dataset from LibSVM (Chang and Lin, 2011).The results are reported in Figure 2.
We use standard LeNet architecture, which is a 5layer convolutional neural network, implemented in PyTorch (Paszke et al., 2017) and train them to classify images from the CIFAR-10 dataset (Krizhevsky et al., 2009) with cross-entropy loss.At each iteration, we use a minibatch of size 1024.For the tuned SGD, we start with stepsize 0.2 and divide by 10 at epochs 150 and 200.For the other version, we take SGD with stepsize 0.2 and decrease as O( 1 t ), where t is the epoch number.For our method, we treat the full sum of gradients over epoch as an approximation of full gradient and use Adam with stepsize 0.01 to improve this update.We can see from Figure 2 that by applying Adam, we can improve the performance of SGD with decreasing stepsize.At the same time, applying it to the tuned stepsize schedule only made the results much worse, so we do not report that line.This highlights that adaptive outer stepsizes are helpful when the base stepsize γ is not chosen well, which is in line with our theory.

A Basic Facts and Notation
A consequence of ( 7) is that for any a, b ∈ R d , we have A function h : R d → R is called µ-convex if for some µ ≥ 0 and for all x, y ∈ R d , we have Function h : R d → R is called L-smooth if for some L ≥ 0 and for all x, y ∈ R d , we have A useful consequence of L-smoothness is the inequality holding for all x, y ∈ R d .If h is L-smooth and lower bounded by h * , then For any convex and L-smooth function h it holds For a convex function h : R d → R and any vectors y 1 , . . ., y n ∈ R d , Jensen's inequality states that Applying this to the squared norm, h(y) = y 2 , we get Simple multiplication on both sides of ( 16) also yields, We use the following decomposition that holds for any random variable X with E X 2 < +∞, We will make use of the particularization of (18) to the discrete case: Let y 1 , . . ., y n ∈ R d be given vectors and y i be their average.Then,

A.2 Notation
We define the variance of the local gradients from their average at a point x t as A summary of the notation used is given in Table 4.
Table 4: Summary of notation used.Symbol Description The iterate used at the start of epoch t.The stepsize used when taking descent steps in an epoch.
The current iterate after i steps in epoch t, for 0 ≤ i ≤ n.
g t The sum of gradients used over epoch t such that x t+1 = x t − ηg t . β The epoch jumping parameter. η The effective epoch stepsize, defined as The variance of the individual loss gradients from the average loss at point x t .

L
The smoothness constant of f and each

A.3 Sampling without replacement
We provide the full proof of Lemma 1.
Lemma.Let X 1 , . . ., X n ∈ R d be fixed vectors, X i be their average and the population variance.Fix any k ∈ {1, . . ., n}, let X π1 , . . .X π k be sampled uniformly without replacement from {X 1 , . . ., X n } and X π be their average.Then, it holds Proof.The first claim follows by linearity of the expectation and uniformity of the sampling, To show the second claim, let us first establish that for any i Therefore,

B Large Server Stepsize
B.1 Strongly convex and general convex case Lemma 2. Let Assumption 1 holds and further assume f is µ-strongly convex and each Proof.We start with the inner product and decompose it using the three-point identity: Using the representation (21), L-smoothness and µ-strong convexity we have a bound: Lemma 3. Assume that Assumption 1 holds, then Proof.We start with Young's inequality.Note that f m ( We use Young's inequality and L-smoothness again: Lemma 4. Suppose that Algorithm 1 is used and Assumption 1 holds.If γ ≤ 1 2Ln , then Proof.We start from the definition of x i t,m : Now let us look at the last term.We can apply Lemma 1 and get Let us go back: Summing the terms leads to Choosing γ ≤ 1 2Ln , we verify Using Young's inequality, we get Using L-smoothness, we obtain

B.1.1 Proof of Theorem 1
Theorem.Assume that Assumption 1 holds and f is µ-strongly convex function.Let γn ≤ η ≤ 1 16L .Then for iterates x t generated by Algorithm 1 we have Proof.We start from definition of x t+1 , Using Lemma 3, we get Taking conditional expectation over sampling S t , we get Using Lemma 2, we obtain Rearranging the terms, we obtain: Using the tower property of conditional expectation and Lemma 4, we get Taking γ ≤ 1 16nL and η ≤ 1 16L , we derive Taking full expectation yields Unrolling this recursion, we have Proof.We start from equation ( 23) with µ = 0: Taking full expectation, we get Rearranging the terms leads us to Averaging from 0 to T − 1, we get 4η 10 Using Jensen inequality (15), we have

B.3 General non-convex case
Finally, we provide guarantees in the non-convex case.
Lemma 5. Assume that Assumption 1.For uniform sampling of cohort S t we have Proof.We start from Young's inequality and then we use Jensen's inequality: Taking expectations and using Lemma 1 we get Next, we follow steps of Proposition 2 from Mishchenko et al. (2020).Using the definition σ Finally, we get Lemma 6. Suppose that Algorithm 1 is used and Assumption 1 holds.If γ ≤ 1 2Ln , then Lemma 7. Suppose that there exists constants a, b, c ≥ 0 and nonnegative sequences (s t ) T t=0 , (q t ) T t=0 such that for any t ∈ {0, 1, . . ., T } s Then if a > 0 we have, min t=0,...,T −1 And if a = 0 we have, Proof.The first part of the proof (for a > 0) is a distillation of the recursion solution in Lemma 2 of (?) and we closely follow their proof.Let w −1 = w 0 > 0 be arbitrary.Define Note that w t (1 + a) = w t−1 .Multiplying both sides of (24) by w t , w t s t+1 ≤ (1 + a) w t s t − bw t q t + cw t = w t−1 s t − bw t q t + cw t .
Rearranging, bw t q t ≤ w t−1 s t − w t s t+1 + cw t .
Summing up as t varies from 0 to T − 1 and noting that the sum telescopes, Let W T = T −1 t=0 w t .Dividing both sides by W T we have, We now separate the proof into two cases: • If a > 0: Note that the left-hand side of ( 27) satisfies b min t=0,...,T −1 And for the right hand-side of ( 27 Proof.We start from L-smoothness (12): ≤ f (x t ) + ∇f (x t ), x t+1 − x t + L 2 x t+1 − x t Using L-smoothness, we get Utilizing Lemma 5 and taking conditional expectation, we get Applying Lemma 6 and using η ≤

C Small Server Stepsize
In this section, we present a result when it is useful to pull back the last iterates of local passes.In particular, we show that one can reduce the variance of FedAvg with uniform partial participation.
Theorem 5. Assume that all losses f m,i are L-smooth and µ-strongly convex.Define α = η γn .Let γ ≤ 1 L and 0 ≤ α < 1.Then, for iterates x t generated by Algorithm 1, we have Proof.Let us denote f St = 1 C m∈St f m .We start by rewriting the distance to the optimum in the following way: Therefore, by convexity of the squared norm, We bound the two terms in the right-hand side separately.For the first term, it suffices to take expectation over the sampling of client cohort S t , For the second term, we use the results of prior work on convergence of RR that gives where, as shown by Mishchenko et al. (2021), σ rad ≥ 0 is some constant satisfying (n 2 ∇f m (x * ) 2 + n 4 σ 2 * ,m ).
Notice that the upper bound depends on α in a nonlinear way, so the optimal value of α would often lie somewhere in the interval (0, 1).Recurrence a t+1 ≤ (1 − ρ)a t + c implies by induction a t ≤ (1 − ρ) t a 0 + c ρ , so by propagating the bound above to x 0 , we obtain Notice that the last term does not change with α, so its optimal value is completely determined by the first two terms.
is the loss of model x on the training data owned by client m ∈ [M ] def = {1, 2, . . ., M }.Typically, M is very large.

<
l a t e x i t s h a 1 _ b a s e 6 4 = " m A d X t G d U A R O l w k M 6 m e u l D l e T + P o = " > A A A B 7 n i c b V D L S g N B E O y N r x h f U Y 9 e B o M g I m F X g n o M e v E Y w T w g W c L s Z J I 0 q p e p P F k Y c j O I Z T 8 O A K q n A H N a g D g x E 8 w y u 8 O b H z 4 r w 7 H / P W n J P N H M I f O J 8 / o u + P H w = = < / l a t e x i t >x ⇤,1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 e m I S a W / C Q E H g j 0 J 4 3 X 21 t h Q 6 Y g = " > A A A B 7 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S K I h 5 J I U Y 9 F L x 4 r m L b Q h r L Z b t q l m 0 3 Y n Yg l 9 D d 4 8 a C I V 3 + Q N / + N 2 4 + D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w Y Z J M M + 6 z R C a 6 F V L D p V D c R 4 G S t 1 L N a R x K 3 g y H t x O / + c i 1 E Y l 6 w F H K g 5 j 2 l Y g E o 2 g l / 6 m b n 4 + 7 p b J b c a c g y 8 S b k z L M U e + W v j q 9 h G U x V 8 g k N a b t u S k G O d U o m O T j Y i c z P K V s S P u 8 b a m i M T d B P j 1 2 T E 6 t 0 i N R o m 0 p J F P 1 9 0 R O Y 2 N G c W g 7 Y 4 o D s + h N x P + 8 d o b R d Z A L l W b I F Z s t i j J J M C G T z 0 l P a M 5 Q j i y h T A t 7 K 2 E D q i l D m 0 / R h u A t v r x M G h c V 7 7 J S v a + W a z f z O A p w D C d w B h 5 c Q Q 3 u o A 4 + M B D w D K / w 5 i j n r 6 t H L Y B B E J O y G o B 6 D X j x G M A 9 I l j A 7 m S R D Z m e X m V 4 x L P k I L x 4 U 8 e r 3 e P N v n C R 7 0 M S C h q K q m + 6 u I J b C o O t + O y u r a + s b m 7 m t / P b O 7 t 5 + 4 e C w Y a J E M 1 5 n r a d e h u f e p F c q u x V 3 B r J M v J y U I U e 9 V / r q 9m O W R l w h k 9 S Y j u c m 6 G d U o 2 C S T 4 r d 1 P C E s h E d 8 I 6 l i k b c + N n s 3 A k 5 t U q f h L G 2 p Z D M 1 N 8 T G Y 2 M G U e B 7 Y w o D s 2 i N x X / 8 z o p h t d + J l S S I l d s v i h M J c G Y T H 8 n f a E 5 Q z m 2 h D I t 7 K 2 E D a m m D G 1 C R R u C t / j y M m l e V L z L S v W + W q 7 d5 H E U 4 B h O 4 A w 8 u I I a 3 E E d G s B g B M / w C m 9 O 4 r w 4 7 8 7 H v H X F y W e O 4 A + c z x 8 S f o 9 o < / l a t e x i t > xt+1 H N a g D g x E 8 w y u 8 O b H z 4 r w 7 H / P W n J P N H M I f O J 8 / o u + P H w = = < / l a t e x i t > e n N h 5 c d 6 d j 3 n r i p P N H M E f O J 8 / p H S P I A = = < / l a t e x i t > 5 H E U 4 B h O 4 A w 8 u I I a 3 E E d G s B g B M / w C m 9 O 4 r w 4 7 8 7 H v H X F y W e O 4 A + c z x 8 S f o 9 o < / l a t e x i t > xt+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " h O z g e x 4 C F 4 f 0 bt G U S 3 l 3 n s X z 8 4 M = " > A A A C B H i c b Z C 7 S g N B F I Z n v c Z 4 W 7 V M M x g E C w m 7 I a h N I G h j G c F c I F m X 2 c k k G T J 7 Y e a s J C x b 2 P g q N h a K 2 P o Q d r 6 N k 2 Q L T f x h 4 O M / 5 3 D m / F 4 k u A L L + j Z W V t f W N z Z z W / n t n d 2 9 f f P g s K n C W F L W o K E I Zd s j i g k e s A Z w E K w d S U Z 8 T 7 C W N 7 q e 1 l s P T C o e B n c w i Z j j k 0 H A + 5 w S 0 J Z r F s Z u A i m u 4 v G 9 p e n M T q s Z l V P X L F o l a y a 8 D H Y G R Z S p 7 p p f 3 V 5 I Y 5 8 F

Figure 1 :
Figure 1: Illustration of the dependence between server and client stepsizes on a simple example with M = 2 clients.x * ,1 and x * ,2 are the minimizers of the local functions f 1 and f 2 , respectively, and x * is the minimizer of the global functionf = 1 2 f 1 + 1 2 f 2 .(a)In the case of small client stepsizes γ, the average of local steps is not large, but at the same time the variance is small and the direction is close to direction of the full gradient, which allows us to go further towards this direction by employing a large server stepsize η.(b) In the case of large client stepsizes γ, each client step contributes to the global step, but the variance grows as well, so it is useful to use smaller server stepsize η to reduce this variance.These intuitions are confirmed by our theory.
we further assume the existence of minimizers x * = arg min x∈R d f (x) and x i * ,m = arg min x∈R d f i m (x).

Theorem 4 .Figure 2 :
Figure 2: Left and middle: We compare running standard Random Reshuffling (RR), adaptive gradient descent (Adaptive GD), and the combination of RR with outer adaptive stepsize (Nastya) (RR (two stepsizes)) on logistic regression.As one can see, the variant with two stepsizes outperforms both of them and does not require more hyper-parameters than RR, and the middle plot shows the exact values of γ and η.Right: The right plot shows the training curves of LeNet on CIFAR-10 with minibatch size 1024, where we compare carefully tuned SGD (blue) to poorly tuned SGD (orange) and show that using Adam optimizer with stepsize 10 −2 after each data pass can significantly improve the poorly tuned version.

A. 1
Basic factsFor any two vectors a, b ∈ R d and any ζ > 0 γ −1 s t − w t s t+1 ) + c T −1 t=0 w t = w 0 s 0 − w T −1 s T + c

Table 1 :
Conceptual comparison of results for FedAvg from prior work with our results.

Table 2 :
Comparison of convergence results for FedAvg from prior work with our results.
n}, which is resampled every epoch for Random Reshuffling.