LoSAC: An Efficient Local Stochastic Average Control Method for Federated Optimization

Federated optimization (FedOpt), which targets at collaboratively training a learning model across a large number of distributed clients, is vital for federated learning. The primary concerns in FedOpt can be attributed to the model divergence and communication efficiency, which significantly affect the performance. In this paper, we propose a new method, i.e., LoSAC, to learn from heterogeneous distributed data more efficiently. Its key algorithmic insight is to locally update the estimate for the global full gradient after {each} regular local model update. Thus, LoSAC can keep clients' information refreshed in a more compact way. In particular, we have studied the convergence result for LoSAC. Besides, the bonus of LoSAC is the ability to defend the information leakage from the recent technique Deep Leakage Gradients (DLG). Finally, experiments have verified the superiority of LoSAC comparing with state-of-the-art FedOpt algorithms. Specifically, LoSAC significantly improves communication efficiency by more than $100\%$ on average, mitigates the model divergence problem and equips with the defense ability against DLG.


INTRODUCTION
Federated optimization (FedOpt) is essentially a distributed optimization in machine learning under the specific setting that data is unevenly distributed over a large number of clients [11,16,17,25,40]. extension of SAGA, which only considers local gradient estimate. Then, we incorporate the global gradient estimate and formulate LoSAC. We further extend our proposed method to the proximal version for solving a wide class of nonsmooth problems. In Section IV, we study the theoretical properties of LoSAC, which has the global variance reduction on the search direction, is able to defense against DLG, and ensures the convergence and mitigates the model divergence problem. We conduct extensive experiments to verify the effectiveness of LoSAC in Section V. Mathematical notations: [ ] means the integer set {1 : }. Δ := + − is presented as 's increment when it has the updated value + . The gradient operator for a smooth function is denoted as ∇ and the statistical expectation is provided by E. We use 2 -norm and for simplicity it is denoted as ∥·∥. ⟨, ⟩ is the inner product. Moreover, is called -smoothness if ∥∇ ( ) − ∇ ( ) ∥ ≤ ∥ − ∥, where > 0 is the Lipschitz constant, and that is strongly convex with > 0 satisfies ( ) ≥ ( ) + ⟨∇ ( ), − ⟩ + 2 ∥ − ∥ 2 . prox ( ) is the proximal operator defined in the following: prox ( ) = armin ( ) + 1 /2 ∥ − ∥ 2 2 .
2 RELATED WORK

Stochastic Optimization
Considering there are data samples ( is large), the stochastic optimization aims to solve where ∈ R is the model, and : R → R is the loss function with respect to the th sample. One of the most popular method stochastic gradient descend (SGD) [31] utilizes a small mini-batch of samples I to calculate ( ) = 1 /|I | ∈I ( ) and update the model via where is the step-size. Although ( ) is an unbiased estimator of the full gradient ∇ ( ), it may have large variance leading to a slow convergence [4]. Thus, how to control and reduce stochastic variance during mini-batch optimization is a central issue.
The variance reduction techniques [8] have been developed to solve the above issue and greatly accelerate the convergence of SGD. Exemplar algorithms are stochastic variance reduction gradient method (SVRG) [10], stochastic average gradient method (SAG) [33] and its extension SAGA [7]. In particular, SAGA utilizes full gradient without direct calculation and is instead updated with the newest partial information at each iteration: where is the delayed model and is updated via: Letˆ:= =1 ∇ ( ) and = ∇ ( −1 ). Mimicking the efficient implementation in SAG [33], and SAGA can be equivalently carried out in real applications as whereˆis updated viaˆ+ 1 =ˆ− + ∇ ( ). SAGA has been shown the fast convergence speed. Moreover, it has the low computation level as SGD since only a single gradient is calculated during each update step. Unfortunately, directly adapting these methods to FedOpt may not be effective since the fast convergence may cause the client quickly moving towards the individual optimum instead of the global one [21].

Federated Optimization (FedOpt)
In [30], a quantized version of FedAvg, known as FedPaq, is proposed for reducing the message overload. Mimicking the adaptation of SGD to FedOpt, the momentum gradient descent method which is a variant of SGD has been modified to fit for federated learning (MFL) [24]. Similarly, FedAdam [29] accommodated Adam [15] to FedOpt. However, these methods still suffer from the model divergence problem [21,38,41,43]. A possible solution is to incorporate a quadratic restriction for the model divergence, which was known as FedProx [20]. Another solution may be the control variate for variance reduction, and based on which VRL-SGD has shown faster speed even with non-IID data [23]. However, VRL-SGD does not support the client sampling which is more practical in FL. Furthermore, with the control variate, SCAFFOLD has shown the significant performance improvement for the data heterogeneity problem in FedOpt [13]. Its core notion is to estimate the full gradient for the local search direction. Most recently, a framework called Mime was proposed [12], which adapts popular centralized algorithms (e.g., SGD, Adam etc.) to FedOpt.
As is mentioned in Section 2.1, the naive adaptation of the variance reduction strategy in the federated settings may worsen the convergence. This is due to the reason that variance reduction is applied for the local gradient estimation, which is biased from the global gradient. Both the SCAFFOLD and Mime framework are motivated by the idea of global variance reduction. However, the global full gradient estimates are kept over the whole local iterations, which may use outdated information and thus these methods' capability to correct the model divergence is limited. For LoSAC, both the local and global information are compactly utilized to keep the accurate estimate of the global gradient. Thus LoSAC can reach the high quality of global variance reduction. For the details, we compare the recent FedOpt methods in Table 1.

LOSAC ALGORITHM
In this section, we describe LoSAC to handle the major challenges in FedOpt. We first formulate the federated optimization problem, which aims to be solved by collaboration via many clients. Suppose there are clients, and each client ∈ [ ] has the local loss function ( ) with its own dataset D containing samples, i.e., ( ) = =1 , ( ), where , ( ) is a single loss function calculated by using the th data in D . Moreover, the total dataset over all clients are denoted by D, i.e., D = ∪ ∈ [ ] D . As in the literature [11-13, 16, 25], the FedOpt aims to collaboratively solve the following empirical risk minimization problem over clients: where is the averaged loss function. The model as the optimization variable satisfies ∈ R . Moreover, the above functions satisfy : R → R, : R → R and , : R → R.

Naive Federated SAGA (FedSaga)
For better illustration of LoSAC, we start with the simple extension of SAGA to the FedOpt (FedSaga). Recall the local update step in FedAvg includes multiple local SGD iterations. FedSaga simply reforms the local SGD to the local SAGA. Specifically, given the local stochastic gradient , , FedSaga updates the local model at the th local iteration via: where ∇ ({ }) := =1 ∇ , ( , ) and we have denoted { } = { , } =1 , and , is updated via: A simple choice for , is , = ∇ , . As shown in (8), , is the delayed version of the local model if the th data is not sampled, thus is called the delayed local model.Moreover, it can be seen that the search direction in (7) approaches the local gradient ∇ when the algorithm progresses to the optimum. Similarly to the SAGA update in (4), we denote˜:= =1 ∇ , ( , ), which is the delayed local gradient, and store ∇ , ( , ) as , on client . Then, the local model can be equivalently updated in order via (for simplicity we omit the superscript ): where − is the local model before the local model update. Hence, we only need to calculate ∇ , ( ) for the local update, which makes FedSaga computationally efficient as SAGA. For the global update step, it aggregates the local models as FedAvg does, i.e., ← 1 / . We summarize FedSaga in Algorithm 1. Server implements steps 5-7: 5: Updates the and respectively: 6: ← + 1/ ∈S Δ , 7: Sample clients S ⊆ [ ] and transmit to client ∈ S.

8:
Clients implement steps 9-14 in parallel for ∈ S: 9: After receiving, set ← . Calculate: Δ ← − . 14: Client transmits Δ to the server. 15: end for 3.1.1 Limitation of FedSaga. As shown in (7), the variance reduction is realized by the local gradient ∇ ({ }) on client , which is biased from the global gradient ∇ ( ). Moreover, the multiple local update steps will make the local model fast approach to the local individual optimum (based on the local loss function) instead of the global one [21] (See the empirical results that FedSaga even performs worse than FedAvg). Since only a small portion of the clients participate in the update on each communication round, the aggregated model will be further biased from the optimal global one. Intuitively, one can think of the global aggregation step in the aspect of the SGD update step, with the gradient only containing the partial information from the participated clients.

Local Stochastic Average Control
As is discussed, FedSaga uses partial local information for the local update which results in bias from the global information. Hence, we propose LoSAC, which uses and updates the global information estimates to make up for the bias. To be specific, LoSAC updates the local model on each client ∈ S via where , is the delayed local model and updated via (8). It can be seen the key difference between LoSAC and FedSaga is that LoSAC has used the estimate for the global information, i.e., =1 ∇ ({ }), while FedSaga only uses the local one. For the local step, we denote := =1 ∇ ({ }), which is the delayed global gradient. At the beginning of the local step, the server transmits to each client ∈ S as the initialized . Then, the following equations in order, which are equivalent to (10), are carried out multiple iterations: After the local update step, except sending the updated quantity Δ to the server as FedSaga, client also needs to send Δ := + − for further aggregation. Here, + is the updated after the local update step. Moreover, although contains all the delayed local gradients ∇  Fig. 1. Illustration of local updates in LoSAC. First, the local dataset block is randomly chosen for the gradient calculation ∇ , ( ). Second, note , stores the delayed gradient ∇ , ( , ), therefore, the delayed gradient , can be replaced by the calculated gradient ∇ , ( ). Third, with ∇ , ( ) and , , the global gradient estimate + is formulated. Fourth, the local update is performed to obtain + . Fifth, , is updated with the gradient ∇ , ( ). update its own ∇ while others are remained unchanged, owing to the FedOpt setting that client can only have access to its own dataset D , i.e., The local update procedure can be illustrated in Fig. 1.
For the global step, the server receives all the update quantities (Δ , Δ ) for ∈ S and performs the aggregation: While is compactly updated with the local information, it is also aggregated with the global information. Therefore, the twice estimates make reach an accurate estimate for the global gradient. The detail of LoSAC is summarized in Algorithm 2. Remark 1. In real applications, a block of data points can be bundled for evaluating a single loss function , and in this way, the memory cost on client will be ( ⌈ / ⌉), where is the number of data points in each block.

Extension with Proximal Operator
Proximal operator has been shown as an effective tool for solving nonsmooth, constrained, largescale, or distributed problems [28]. In this subsection, we extend our proposed method with proximal operator for solving a wider class of problems, such as 1 regularization or low-rank matrix estimation. Specifically, the problem under federated settings can be formulated in the following: where ∈ R is the model, ( ) : R → R is the local loss function, and Ψ : R → R is the nonsmooth convex regularizer.

7:
Clients implement steps 8-14 in parallel for ∈ S: 8: After receiving, set ← and ← . Calculate: Δ ← − and Δ ← − . 14: Client transmits (Δ , Δ ) to the server. 15: end for dataset on client , and Ψ = ∥·∥ 1 , the problem is known as LASSO [32]. As for the local gradient descent based proximal step, it can be derived: which can be solved via + ← prox Ψ ( − ∇ ( )). Mimicking the PGD step, we incorporate the global gradient estimate in LoSAC to replace the local gradient in (15), which subsequently lead to While adopting (16), we maintain the local estimate for and the global aggregation for ( , ), the resulting algorithm is called LoSAC-Prox. In Section 5, we will apply LoSAC-Prox for solving the low-rank matrix estimation problem for further showing the superiority over the state-of-the-art algorithm.

THEORETICAL ANALYSIS
In this section, the theoretical analysis of LoSAC is presented. Specifically, we first provide the global variance reduction of LoSAC to show that the variance of the search direction is vanishing. Moreover, we also analyze the enhanced defense ability of LoSAC in gradient leakage. Then we study the convergence analysis, of which one challenge is resulted from the multiple local iterations. Moreover, it can be seen in (10) that the delayed local models { } of client make the convergence analysis difficult for evaluation, since after a few iterations, the delayed local gradient ∇ ({ }) has the mixed arguments. Finally, we show LoSAC equips with the ability in handling model divergence.

Global Variance Reduction
The variance of the search direction in LoSAC is progressively reduced to zero comparing to the SGD update in FedAvg. Moreover, it maintains the robustness in LoSAC for the convergence.
Comparing with the recent method MimeSVRG [12] and SCAFFOLD [13], LoSAC equips with the benefit of compactly refreshed global variance reduction. Hence its convergence performance can be improved comparing to MimeSVRG and SCAFFOLD, which is demonstrated in the numerical experiments. In the following, we provide the global variance reduction in Lemma 1: Lemma 1. Suppose the sequence { } generated by Algorithm 1 is expected to converge, i.e., E ∥ − * ∥ → 0. Moreover, for the th client, denote the search direction in the local update as = 1 / − , ( , ) + , ( ). Then, the variance of the search direction˜is progressively vanished, i.e., E∥˜− E(˜) ∥ 2 → 0.

Defense Ability to the Gradient Leakage
An important benefit of LoSAC is the enhancement of the defense ability to the recent technique Deep Leakage from Gradients (DLG) [45] which aims to obtain the information leakage from the gradient. As the illustration, we denote ( ; D ) = ( ) to explicitly emphasize the dependency on the input sample D . Then, the DLG is defined as follows: Definition 1 (DLG [45]). For an algorithm A, let the associated gradient be ∇ ( ; D ) and the model parameter be , and If a Malicious Attacker (MA) is able to obtain D by finding D ′ above, then the algorithm A suffers from Deep Leakage from Gradients (DLG).
Hence, with the technique of DLG, MA is able to progressively match the gradient ∇ ( , D ) by minimizing the difference between the "dummy gradient" ∇ ( ,D ) and ∇ ( , D ). Moreover, when the optimum is reached, MA steals the data D from the th client, i.e., D ′ = D .
According to the above definition, MA is not able to apply DLG algorithm to obtain the local dataset in LoSAC. We illustrate this from two reasons. First, the th client's gradient is evaluated at many delayed local models { }, i.e., ∇ ({ }, D ). Moreover, { } are stored locally that MA is difficult to obtain; second, LoSAC transmits the increments Δ and Δ instead of the gradients. Howover, the distributed SGD [45] and Mime framework [12] communicates the local gradient with the the local model, i.e., and ∇ ( ; D ), hence MA is able to steal the local data D by DLG technique.

Convergence Result
In this subsection, we study the convergence property of our proposed method. We first show the progress of each communication round in Lemma 2. Particularly, we need the following regular assumptions. Here, Assumptions A1 and A2 have been regularly made for convergent analysis in optimization literatures [14,21,35,36,42] and [1,21,36,42], respectively. Moreover, Note that Assumptions A1 and A2 imply that the second-order moments of the gradients ∇ ( ) and ∇ ( ) are also bounded, i.e., E ∥∇ ( ) ∥ 2 ≤ − and E ∥∇ ( ) ∥ 2 ≤ − .
From (10), we intuitively have the approximate gradient descent (GD) step in each local iteration Hence, the local iteration progress can be bounded above with reference to the GD theory [22] and subsequently, the one round progress can be obtained in the following: Lemma 2. Suppose functions , { } and { , } are strongly convex and -smooth that satisfy Assumption 1, and denote * as the optimal point, if is sufficiently large 1 , there exist positive variables ℎ 2 , ′ 2 and ′ 2 such that It should be noted that the assumption that , is strongly convex is strong in real applications, e.g., , is the loss function in a neural network, but this assumption can be simply realized by appending a 2 regularization term to , to form the strongly convex function.
Based on Lemma 2, the convergence speed can be obtained in the following.
It can be seen the convergence speed will be faster if there are more local iterations while the computation complexity is also increased. Furthermore, if more clients (larger |S|) are participated in model training, it will be faster for convergence. Moreover, our analysis does not assume data heterogeneity while [13] does. This is due to the reason that estimates the global information. We intuitively and empirically illustrate this in Section 4.4 and numerical experiments respectively.

Handling Model Divergence
LoSAC is expected to equip with the capability to overcome the model divergence problem that resulted from data heterogeneity and client sampling in FedOpt. As an intuitive illustration of this, the local update step on client in expectation can be approximated as where we have assumed , ( , ) ≃ , ( ). It mimics the full gradient descent step. To a certain extent, appending the term 1 / − , in the local update step makes up the deviation for , ( ) from the full gradient. Thus, the model divergence problem can be relatively mitigated. [25], logistic regression and low rank matrix estimation [27] as the training models. Specifically, 2NN is a fully connected neural network with 2 hidden layers with 200 ReLU activation functions in the each hidden layer and a softmax output.

Datasets.
Three real datasets are chosen for overall performance, ablation study and DLG study, namely MNIST [18], Human Activity Recognition Using Smartphones dataset (HAR) [3] and Epileptic Seizure Recognition dataset (ESR) [2]. We choose MNIST dataset since it has been widely applied for the study. Moroever, HAR and ESR are chosen due to the increasing interests and the large potential for the FedOpt applications in mobile devices and healthcare, respectively. Specifically, we use 60, 000 for training and 10, 000 for testing in MNIST, 7, 352 for training and 2, 947 for testing in HAR, and 9, 200 for training and 2, 300 for testing in ESR. Moreover, we choose synthetic dataset for the low rank matrix estimation. Table 2. Ablation study: measured by the communication rounds for LoSAC and SCAFFOLD to reach a specific accuracy (85% for all datasets), loss and test accuracy. We consider the different local iteration and the local dataset division ( also corresponds to the local memory size for storing , ) for calculating the gradient for comparisons to show the efficient computation and communication of LoSAC. Other parameters are set as = 10 and = 10 −4 . Moreover, the accuracy is measured at round 500, which is sufficient for reaching a satisfactory accuracy.   [25], FedCM [39], SCAFFOLD [13], FedADMM [34] and MimeSVRG [12]. Moreover, to show the ineffectiveness of the naive extension of SAGA to FedOpt, we have also implemented FedSaga. As is mentioned, FedAvg improves the communication efficiency comparing to FedSGD [25]. FedCM [39] adopts the momentum strategy in FedOpt. SCAFFOLD [13] equips with the capability in handling non-IID data. Particularly, MimeSVRG is developed by adapting SVRG [10] to FedOpt using the framework Mime [12]. Especially, FedADMM adapts the alternating direction method of multipliers (ADMM) to FedOpt [34], which are known to conveniently and efficiently solve the nonsmooth optimization problems [5].

Rounds
For the default parameters for all algorithms, ( , ) = (1, 000, 50) for MNIST cases and ( , ) = (100, 10) for HAR and ESR cases. We set the local data division = 5 and the local iteration = 5. Particularly, the step size is set to yield as the best performance as possible for each algorithm, i.e., = 4 × 10 −4 for MNIST cases and = 10 −4 for HAR and ESR cases. As for FedADMM [34], the details of FedADMM solving low rank matrix estimation are in Appendix E. Different from [25] that the local update in FedAvg traverses all the dataset for multiple epochs, we follow [13] that all algorithms are implemented by sampling a mini-batch of data samples for search direction in the local update.

Evaluation tasks.
For the overall performance and ablation study, the cross entropy and the classification accuracy are evaluated. Specifically, the overall performance is to show the general performance with different data and parameter settings, with the comparison to the baseline algorithms; the ablation study aims to show the communication and the computation efficiency of the algorithms for reaching a specific accuracy. Moreover, for the IID setting, the datasets are shuffled, and for the non-IID setting, the datasets are sorted by the labels. Then the datasets are divided evenly into clients. Note since MimeSVRG's performance is comparative to SCAFFOLD and its computational complexity is twice of SCAFFOLD and LoSAC, hence for the parameter sensitivity study, it is only implemented with MNIST. For DLG study, we mainly consider the similarity measured by the Frobenius norm between the estimated data samples by DLG and the real data samples. For the low rank matrix estimation, FedADMM [34] and the proximal versions of LoSAC (LoSAC-Prox) and SCAFFOLD (SCAFFOLD-Prox) are implemented with the evaluation of the recovery matrix error and the recovery matrix rank.

Data Heterogeneity.
In this subsection, we study the effect of the data heterogeneity on LoSAC and show the impacts of the local and global variance reduction schemes. Specifically, FedSaga utilizes local variance reduction, while SCAFFOLD, MimeSVRG and LoSAC uses global one. Moreover, FedAvg is implemented as the benchmark. For the experimental settings, the MNIST dataset is utilized for training and testing. We choose (1 − %) of the uniformly shuffled training dataset and distribute it to = 1, 000 clients. Then, the left % of the training dataset is set to be non-IID, namely is sorted by the labels and distributed to all clients. Therefore, the larger the parameter % is, the data is more heterogeneous. For other parameters settings, is set to = 50, the step size is chosen as = 4 × 10 −4 for all algorithms since it has shown the best performances. Moreover, the iteration number is ( , ) = (100, 5). For LoSAC and FedSaga, Since the local dataset is divided into blocks, the parameter is set as = 5. As Fig. 2 shows, when the data is more heterogenous (namely % is larger), both FedAvg and FedSaga suffers from the data heterogeneity problem more seriously. For SCAFFOLD, MimeSVRG and LoSAC, since they all utilize the global information to correct the bias from the global model in the local update, the data heterogeneity problem has the little impact on the performance. Hence, the global variance reduction is much more robust to the data heterogeneity problem than the local one. Moreover, it can be seen that FedCM also has the impact of data heterogeneity, but the impact is not as serious as FedAvg and FedSaga. This can be attributed to the global aggregation of the local momentum term in FedCM, which has mitigated the model divergence problem.

Parameter Sensitivity.
We conduct extensive experiments to study the effects of { , }, which play significant roles on the convergence results in Theorem 3. The experiments use IID and non-IID settings for each evaluation respectively. The results are shown in Figs. 3∼4, Fig. 5 and Fig. 6 with MNIST, HAR and ESR datasets respectively. In particular, HAR and ESR datasets correspond to the mobile and the medical applications respectively. While we uses the default settings for MNIST, we fix = 100, = 10 −4 and = 5 for HAR and ESR. In general, larger and lead to faster speed and better classification performance for LoSAC. This matches well with the convergence results in Theorem 3.
In Figs. 3∼4, while our proposed method exhibits the prominent performance improvements, the non-IID data has significantly degraded the performances of FedAvg and MimeSVRG. With regard to the comparison of FedSaga and FedAvg, it can be seen that even with the acceleration scheme (here it is variance reduction), FedSaga performs worse than FedAvg. Although the acceleration scheme has shown the effectiveness in centralized optimization methods, it is not effective in the naive extension of SAGA to the FedOpt settings. This may due to the reason that the acceleration scheme using only local information may adversely lead FedSaga to fast approach the local optimum instead of the global one, resulting in large bias from the global optimum. For FedCM, more clients participated for local updates in FedCM can generally lead to the higher performances, this can be attributed to that more clients will contribute more information. Moreover, the non-IID case has affected the performance of FedCM, and more local steps in FedCM will adversely lead performance degradation. This is due to the reason that FedCM uses momentum acceleration in local model update, but with non-IID setting, it will accelerate the speed of model divergence and lead to performance degradation. In particular for MimeSVRG, it shows the large fluctuations in the non-IID setting, and more local iterations will lead to larger fluctuations. While SCAFFOLD and our method exhibit the strong capability in handling data heterogeneity problem, our method outperforms SCAFFOLD. Thus, it demonstrates the strong capability of LoSAC for mitigating the model divergence problem. For HAR, Fig. 5 has shown the performances of all methods with different and . In general, the result matches the Theorem 3 that larger and leads to better performances. In particular, our proposed method with = 2 even exhibits better performances than SCAFFOLD with = 8. This means that with only 25% of the computational complexity in SCAFFOLD, our proposed method still yields quite high performances. However, it shows the large fluctuations of LoSAC in the initial few updates. The reason may due to the randomness in the delayed full gradient that has brought the large variance in the initial updates, when , seriously differs from . When the algorithm progresses, it is expected to satisfy , → , and the variance begins to reduce. Fig. 5 has also exhibited the cases with ESR dataset using the same settings. Here, both SCAFFOLD and FedAvg have been significantly affected by the model divergence problem, while our proposed method has demonstrated the remarkable performances.

Ablation Study
We have shown the overall performance with different parameter and data heterogeneity settings. In this subsection, we continue to conduct the ablation study with different local memory (corresponding to the local data division ) and local iterations. We implement SCAFFOLD as the benchmark since it performs the second best in overall performance. Moroever, we adopt the non-IID setting. The step size for all cases is set to = 10 −4 for the algorithms to yield as the best performances as possible. We set ( , ) = (100, 10) for all cases. { , } are tuned to for the comparisons. In general, it shows the significantly higher communication and computation efficiency over SCAFFOLD, which also demonstrates the effectiveness of the estimate for the global full gradient. The results are shown in Table 2. Table 2 that LoSAC requires much fewer communication rounds than SCAFFOLD to reach a given accuracy. In particular, the communication efficiency is improved by more than 300% for ESR case and 100% in average for all cases. Thus, LoSAC is communication quite efficient. This is because LoSAC estimates the global gradient more accurately than SCAFFOLD, which has accelerated the convergence speed and mitigated the model divergence problem. Table 2 further demonstrates the high computation efficiency of LoSAC. To be specific, when LoSAC with = 2 and = 3, it requires comparable communication rounds to reach the specific accuracy with SCAFFOLD with = 6. This means with only around 33% computation complexity of SCAFFOLD, LoSAC can still yield higher communication efficiency.

Local memory.
For LoSAC, each client needs to spend the sufficient memory to store , , ∈ [ ], depending on the partitioning of the local dataset D . Hence, we choose different memory sizes for the evaluations, i.e., = {2, 3, 5}. It also corresponds to divisions of the local datasets. Table 2 indicates that larger memory size leads to a better performance for LoSAC. This may due to the reason that the local estimation of the global full gradient is improved with a larger memory size, and thus is less affected by the non-IID data. However, it will also cost local resources. In FL applications, clients may have limited memory, e.g., the case for mobile phones, thus we suggest a better trade-off between the performance and the resource. Note when = 2, the storage cost of LoSAC is O (2 ) and the same with SCAFFOLD (SCAFFOLD requires to store the control variates and the model parameter for each local iteration), LoSAC yields much better performances than SCAFFOLD, i.e., more than 40% averaged performance improvements.  Fig. 6. The performance evaluations on the defense against DLG. Binary logistic regression on the federated datasets is collaboratively performed over all clients. Moreover, GD is applied for solving the DLG problem to steal the dataset on the 1st client.

Defense Against DLG
We study the defense ability of each algorithm against DLG in Figure 6. Specifically, except for LoSAC, SCAFFOLD, FedAvg and MimeSVRG, we also implement DSGD since it has been the major attack objective by DLG [19]). The real datasets MNIST, HAR and ESR are utilized for the evaluations. For simplicity, we perform each FedOpt on logistic regression for binary classification on all datasets. Specifically, for MNIST, the label is set to 0 when it is smaller or equal to 5 and 1 otherwise; for HAR, the label is set to 0 when it is smaller or equal to 3 and 1 otherwise; For ESR, it has two classes that match well with binary logistic regression. We set ( , ) = (100, 100), i.e., all the clients participate in the local update in each round. All local datasets are utilized for the search direction in each local step, i.e., = 1. Furthermore, we apply gradient descent (GD) to perform the DLG attack in Definition 1. We tune the step size in the (a) MNIST and HAR cases: = 10 −4 for all algorithms and = 10 −3 for GD in DLG, (b) ESR case: = 10 −3 for all algorithms and = 10 −3 for GD in DLG. For each case, the GD in DLG has the iteration number 100. For simplicity without loss of generalization, we perform the DLG attack aiming to obtain the 1st client's data samples D 1 in the 5th round for each algorithm in all cases. We denoteD 1 as the obtained data samples by DLG, and use the metric ∥D 1 − D 1 ∥ for performance evaluations. Therefore, smaller value of the metric ∥D 1 − D 1 ∥ means more successful for DLG to obtain the local dataset D 1 . As shown in Fig. 6, both DSGD and MimeSVRG are vulnerable to the DLG since the estimated data samples are more and more approaching to the true data samples. This is because they transmits the gradients and the corresponding models, which satisfies exactly the DLG attack in Definition 1. As for FedAvg, SCAFFOLD and LoSAC, the DLG aims to obtain the data samples based on the term 1 / · ( + 1 − ), which is essentially the averaged gradients over the local steps. Under DLG attacks, the estimated data samples in FedAvg, SCAFFOLD and LoSAC are more and more divergent from the true data samples, which demonstrates the capabilities of these algorithms to defense against DLG.

Low Rank Matrix Estimation
We further verify LoSAC for solving the problem of low rank matrix estimation via the comparisons with SCAFFOLD. First, the problem can be formulated in the following: where ( ) = =1 (⟨ , ⟩ − ) 2 , ∈ R × is the data sample, is the noisy observation and ∥·∥ * denotes the nuclear norm of a matrix to obtain a low rank matrix solution * . The problem (20) targets at recovering the low rank matrix ∈ R × from the noisy observations . The proximal version of the state-of-the-art algorithm SCAFFOLD, namely SCAFFOLD-Prox, and FedADMM are studied as the comparisons. In particular, FedADMM is known to be convenient and efficient to solve the nonsmooth optimization problem [5,34]. For LoSAC-Prox and SCAFFOLD-Prox, solving the above problem via the proximal operation, we can derive * ← argmin ∥ ∥ * + 1 2 ∥ − ( − ∇ ( )) ∥ 2 , where we denote∇ ( ) as the global gradient estimate. Then, the above problem can be solved via * = · diag prox ∥ · ∥ 1 ( ) · , where , and can be conveniently obtained via the singular value decomposition of the matrix ( −∇ ( )). For FedADMM [34] solving low rank matrix estimation, the details are in Appendix E.
We evaluate the algorithms on the synthetic dataset for simplicity with the known ground truth ∈ R × , which is obtained as follows: As for the synthetic dataset generation, each element of a data sample is randomly generated by N (0.1, 1). Moreover, ∈ R is generated by = ⟨ , ⟩ + N (0, 0.1). Specifically, we set = 64 for and five scenarios of the matrix rank are studied, i.e., ( ) = [32,16,8,4,2]. We set ( , ) = (100, 10) and the local dataset division to be = 5 for each algorithm. We tune the step size = 2 × 10 −3 for each algorithm to yield as the best performance as possible. We adopt two metrics for performance evaluation, namely the recovery matrix error ∥ − ∥ and the rank of the recovery matrix ( ) (we count the number of singular values that are greater than 10 −3 ).
The results are shown in Fig. 7. It can be seen although ADMM has shown to conveniently and efficiently solve the nonsmooth optimization problem, with FedOpt settings (e.g., partial client participation), its performance is substantially degraded. In particular for SCAFFOLD, although it performs satisfactory results in the overall performance and ablation study, SCAFFOLD works poor in the low rank matrix estimation and cannot precisely recover the ground truth matrix, i.e., while the recovery matrix error is large, the rank of the estimated matrix significantly differs from the rank of the ground truth matrix. For LoSAC-Prox, it has shown the superiority in solving the problem. The recovery matrix matches very well with the ground truth matrix , i.e., it has the smallest recovery error and the EXACT rank with .

CONCLUSION
Due to the data heterogeneity and client sampling, the performance of federated optimization suffers from degradation. Although there are several works attempting to mitigate these problems, none of them could well address them. We proposed a new FedOpt algorithm LoSAC for handling the challenges by compactly estimating the global gradient. Moreover, we extend LoSAC to its proximal version for solving a wider class of problems. We demonstrate the effectiveness of LoSAC via theoretical guarantees and the empirical studies. It shows that LoSAC equips with the strong ability in handling the model divergence problem, and the high communication and computation efficiency over the state-of-the-art methods. Especially in the low-rank matrix estimation problem, LoSAC has demonstrated its superior performances over SCAFFOLD and FedADMM, i.e., it can recover very well the true matrix with the exact rank. It is worth mentioning that LoSAC has the defense ability against the information leakage from the gradient.

A USEFUL LEMMAS
For proving Lemma 2, we provide the following useful lemmas. First, we present the Lemma 4, of which the relaxed triangle inequalities have provided quite important tools for evaluating the update progress in each round. Based on Lemma 4, the inequalities in Lemma 5 can be derived. Lemma 4. For vectors { 1 , . . . , } in R , the following inequalities hold: where > 0.
Lemma 5. Suppose is a strongly convex function with > 0 and has -smoothness Lispschitz gradient, then the following inequalities holds: and moreover where > 0 and > 0.
Proof : Lemma 5 can be derived with Lemma 4 and the definitions of -smoothness andconvexity. Specifically, since is strongly convex and has -smooth gradient, it implies that ≥ , which follows: hold by the definitions of -smoothness and strongly convexity respectively. Combining the two inequalities yields: where we have used the triangle inequality. We further use the triangle inequality as follows: which we substitute into (27) that can lead to the desired result in (24). For the second inequality in (25), it follows from the first inequality in (27) that Furthermore, the triangle inequality with > 0 is utilized and it yields: By combining (29) and (30), we can obtain the desired result in (25) and this completes the proof of Lemma 5.

B PROOF OF LEMMA 2
Proof. The proof of Lemma 2 is based on the results of Lemmas 4 and 5, which are in the Appendix A. We first recall the following inequalities hold respectively: Then we start with evaluating Δ within iterations given as Next, we aim to evaluate the progress in one round. To be specific: We first expand the term E +1 − 2 as follows, then the upper bound can be further derived (with 0 := 2 2 / 3 ): thus by defining 1 := 3 2 2 / 3 we have where we have adopted the relaxed triangle inequality in Lemma 4. By using the condition of Lipschitz continuity, it yields: The above equality has used the fact that * is the optimal point which satisfies the first-order condition, i.e., 1 / =1 ∇ ( * ) = 0. Moreover, the above inequalities include the terms E∥ , , − * ∥ 2 and E∥ , , − * ∥ 2 , directly evaluating them is extremely difficult since they are randomly selected and updated. Note that both the gradient variance and second-order moment are bounded, it implies from (32) that: for any and in the domain of . Therefore, we use (38) for bounding E∥ , , − * ∥ 2 and E∥ , , − * ∥ 2 . Moreover, to bound the last term E∥ − * ∥ 2 , the one iteration progress for the client can be expanded as follows: E∥ +1 − * ∥ 2 ≤ E∥ +1 − ∥ 2 + E∥ − * ∥ 2 + 2E⟨ +1 − , − * ⟩, which has the same form with (34). We omit the duplicated procedure and use the inequality E∥ − * ∥ 2 ≤ M . Hence, by combining (35) and (38), the following result can be derived: hence, to let ℎ 2 → 0, the following must be satisfied In fact, if > 1+ 1− , (67) may be satisfied. For the convergence study, we assume all these conditions are satisfied for simplicity.
Sample clients S ⊆ [ ] and transmit to client ∈ S.

11:
Obtain and respectively via: 12: end for 13: Set ← and ← for ∈ S, and ← −1 and ← −1 for ∉ S. 14: Client transmits + to the server. 15: end for We have illustrated FedADMM and next, the experimental settings of FedADMM for low rank matrix estimation in Section 5.5 is described. To be specific, on each client is set to = 10 −4 and on the server is set to be = 10 −4 . Moreover , the regularization parameter is tuned to = 5. In terms of solving (82), = 20 proximal steps of (81) are performed.