Federated Learning with Label-Masking Distillation

Federated learning provides a privacy-preserving manner to collaboratively train models on data distributed over multiple local clients via the coordination of a global server. In this paper, we focus on label distribution skew in federated learning, where due to the different user behavior of the client, label distributions between different clients are significantly different. When faced with such cases, most existing methods will lead to a suboptimal optimization due to the inadequate utilization of label distribution information in clients. Inspired by this, we propose a label-masking distillation approach termed FedLMD to facilitate federated learning via perceiving the various label distributions of each client. We classify the labels into majority and minority labels based on the number of examples per class during training. The client model learns the knowledge of majority labels from local data. The process of distillation masks out the predictions of majority labels from the global model, so that it can focus more on preserving the minority label knowledge of the client. A series of experiments show that the proposed approach can achieve state-of-the-art performance in various cases. Moreover, considering the limited resources of the clients, we propose a variant FedLMD-Tf that does not require an additional teacher, which outperforms previous lightweight approaches without increasing computational costs. Our code is available at https://github.com/wnma3mz/FedLMD.


INTRODUCTION
The development of multimedia technology and its various emerging commercial applications have sparked global discussions on the ethics of artificial intelligence [21].Among these discussions, privacy and security issues have become a key concern for society [29].Artificial intelligence technology relies heavily on user data uploaded to central servers, which could lead to the leakage of sensitive personal data [30].The centralized collection and use of massive personal data pose serious threats to individual privacy.
Once the data are breached or misused, the consequences can be devastating.Additionally, countries worldwide have enacted laws and regulations, such as the European Union's General Data Protection Regulation (GDPR), to restrict such behavior [2,32].Therefore, the multimedia field needs to improve the centralized model training method to gain public recognition and address such concerns.
Federated learning (FL) [28] has been proposed to provide a feasible solution to jointly train models on distributed data from multiple parties or clients in a privacy-preserving manner.It generally applies a server as coordinator to communicate parameters (gradients or weights of the model) between each client and server, realizing the knowledge sharing rather than data among clients.Since the data only stays local, it is considered to be a privacy-preserving algorithm.It has shown promising results in multimedia applications such as person re-identification [51,52], medical images [24,27], emotion prediction [33] and deepfake detection [12].
In classical FL algorithm FedAvg, the uploaded model parameters are weighted and averaged to implicitly exchange the knowledge of each client.It can work well when the data distributions are identical in clients.However, the realistic data distribution usually is different across clients [14], i.e., non-independent isodistribution (Non-IID).It means that the optimization goals of various client models are much different, and the server-side model is much more difficult to optimize, and may even fail to converge [23].In this paper, we focus on a more specific case, i.e., label distribution skew.For instance, diseases can be simply divided into several class labels according to severity, and small clinics in rural areas usually have more examples of minor diseases but fewer or no examples of severe diseases compared to large hospitals.For convenience, we call this realistic scenario the label heterogeneity case.
To address the challenge, some researchers improve FedAvg in terms of weight assignment during aggregation and model aggregation way on the server-side [11,13,26,42].While compared to server-side optimization, client-side optimization is often more effective and straightforward because the data resides on the clientside.The existing client-side methods usually regularize the constraints on the model output or the model parameters themselves [1,19,20,23,43,45].Although these methods can alleviate the challenge in a certain extent, they don't effectively utilize the useful information of varying label distributions in clients under large label heterogeneity, leading to a suboptimal optimization.And this information is crucial and determines the severity of label heterogeneity.Thus, it is necessary to explore an effective solution that can address a key problem: how to exploit the information of label distributions in various clients to perform stable and effective FL?
By revisiting the training process on a particular client in the classical FL Fig. 1, the learned model is prone to be biased toward the majority class labels and forget the absent (or minority) class labels under the label heterogeneity case.In order for the model to learn about minority labels without additional communication, we propose an approach named Label-Masking Distillation for federated learning (FedLMD) via perceiving the label distributions of each client.The knowledge distillation (KD) has been shown to extract dark knowledge from models and thus reduce the risk of catastrophic forget in FL [8,9,19,46].As in the previous study, we use the local model as the student, while the global model is updated based on multiple client models.Thus, it is considered as the teacher with more comprehensive label knowledge.To achieve a more effective distillation process, we employ label masking distillation on the client-side model.We classify the labels into majority and minority labels based on the number of examples per label during training.The model can easily learn the knowledge of the majority labels, because of they have sufficient samples.However, the knowledge of the minority labels is prone to being forgotten by the model [19].Therefore, to preserve minority knowledge in the model, we only distilled the minority part of the global model to the client-side model.Specifically, we decouple the logits of the global model into two parts: majority and minority, and mask out the majority part of the global logits.Overall, the client-side model learns the knowledge from two sources: majority from the local data and minority from the global model.
When FL is deployed in real-world applications, the client-side resources have to be seriously considered [37].Therefore, we further optimize the computational cost and storage space of the proposed approach.We found that a teacher model with poor performance can still help local models in FL.So we replace the teacher logits with a fixed vector, as demonstrated in [48].Since it does not require an additional teacher model, we named it FedLMD-Tf.
In summary, our main contributions are three folds.
• We revisit the problem caused by label heterogeneity through a simple experiment and find that the main reason why local models are prone to be biased is the lack of supervision information from minority labels.• We propose FedLMD under the label heterogeneity case.By decoupling the logits of the teacher model and masking out the majority part, the proposed approach is able to retain the forgotten label knowledge for clients by distilling knowledge from the minority part.• We conduct a series of sufficient experiments to show that FedLMD outperforms the state-of-the-art methods on classification accuracy and convergence speed.We also propose FedLMD-Tf which consistently outperforms previous lightweight federated learning methods.

RELATED WORK
Federated Learning on Non-IID Data.One of FL's current significant challenges, data heterogeneity, can lead to difficulties in model convergence [22].The optimization can be done from the server-side and the client-side respectively.For server-side optimization, they focus on improving the robustness of the global model by improving the aggregation method [11,34,42,44,47,49].
The optimization in client focuses on constraining model update to avoid catastrophic forgetting.FedProx [23] constrains the optimization of local model by computing L2 loss between the local model and global model parameters.Similarly, FedDyn [1] and FedCurv [39] are improved based on the relationship between the model parameters.SCAFFOLD [15] corrects the local updates by introducing control variates and they are also updated by each client during local training.On this basis, FedNova [43] achieved automatically adjusts the aggregated weight and effective local steps according to the local progress.Unlike these methods, FedRS [25] adds the scaling factor to SoftMax function using information about the distribution of the data to restrain the update of the parameters of the constrained model updates to the missing classes.
Knowledge Distillation in Federated Learning.KD [10] is considered to be able to extract dark knowledge from the teacher.It can be optimized from both server-side and client-side perspectives.Some researchers exploit the feature of multiple models on the server-side of FL to perform integrated multi-teacher KD [3,4,6,26,31,36,38,40,41].
From the perspective of client-side, some studies use data-free KD to expand the local dataset to ensure that the model has access to sufficient data examples during training [49,50].However, they cause additional communication overhead and may also result in privacy leakage.Alternatively, KD can improve the performance of the local model by extracting the dark knowledge of the global model.MOON [20] constrains the model training by constructing the contrastive loss between the local model and the global model.FedNTD [19] mitigates the catastrophic forgetting of the global model by removing the target label when the global model is used as a teacher-distilled local student model.While they effectively mitigate the challenge of data heterogeneity and do not introduce additional communication overhead as well as privacy risks, they impose additional computational overhead on the client-side.
In particular, it should be noted that FedNTD [19] is the most similar to our approach.FedNTD preserves global knowledge, while our approach focuses more on preserving minority label knowledge corresponding to the forgetting of each client.Unlike FedNTD, which only masks out the target class in the teacher model output, we mask out the locally majority labels in the teacher model output from the perspective of label distribution.This achieves more effective knowledge retention.And considering the problem of limited client-side resources, we update the proposed approach to the lightweight version with no additional overhead.

CHALLENGE REVISITING
To better understand the challenge caused by label heterogeneity, we first experimentally revisit the problem encountered by FedAvg during the training process 1 .The results are shown in Fig. 2, where the darker the color is, the greater the number of samples for the corresponding label is.Fig. 2 (Top), we present the total number of training examples for each label in the uploaded clients under different communication rounds, which clearly shows that the label distribution varies a lot during training.
Fig. 2 (Middle), we show the prediction distribution of the Fe-dAvg method under different rounds, which reflects the instability of its optimization.It can be noticed that the class labels with the most training examples severely affect the prediction distribution, making the model biased toward the majority of class labels and forgetting the minority class labels.Specifically, in the 9-th round, when class label 7 has the most examples, then the prediction distribution of the model is largely biased toward class label 7.Although the model is relatively less affected by the heterogeneity at the later stage of training (e.g., after 100 rounds), the bias toward majority labels still exists.Therefore, it can be found that the main reason why local models are prone to be biased under such cases is the lack of supervision information from minority class labels, which inspires us to introduce the information of minority class labels into supervision.
By perceiving the label distributions, as shown in Fig. 2 (Bottom), our FedLMD approach can well resist the bias of majority labels, leading to stable and effective optimization.It can see that the color depth of different labels tends to be the same at the later stage of training.It means that the prediction label distribution achieved by our FedLMD approach is close to the uniform distribution.

PROPOSED METHOD 4.1 Problem Setting
We consider a classical supervised FL system that contains a server and  clients.For the -th client, it has a local dataset D  = {   ,   }   =1 , where as   ,   ∈ (X  , Y  ) and the weight parameters of model is   .The goal of FL is to obtain a global model by jointly training all clients as follows: where   is the weight of the global model and L  is the loss function for training the th client model.On the server-side, the FL system aggregates all uploaded model weights.In each communication round, the clients are specified in K to train and upload parameters, where |K | is the number of models to upload.
As mentioned before, label heterogeneity in clients can make the local model biased to the majority labels, leading to unstable and poor optimization.Our goal is to facilitate stable and effective FL via perceiving the various label distributions of each client.

Label-Masking Distillation
First of all, we divide all class labels into majority labels and minority labels.When  , >=    , , class label  is a majority label in the -th client.The   is the total number of samples for all classes and  , is the number of samples for class  of the -th client.In this section, we assume the majority labels are all in Y  on -th client for the sake of convenient expression.
For a training example (, ) ∈ (X  , Y  ), let the output of the th local model as   , the output of the global model as   , and 1  is the one-hot vector form of .KD [10] is to achieve dark knowledge where L CE is the cross-entropy loss for learning the majority labels knowledge, and L KD is the distillation loss for retaining all the labels knowledge.Here, we fix the weight of L CE to 1 and  is used as a weighting factor to control the distillation loss.
Although the L KD can learn from -th data and assist the bias toward minority labels, it performs the regularization without considering the varying label distributions across clients, leading to a suboptimal optimization.Hence, we improve it by enhancing the KD for minority labels via perceiving the label distributions.We decouple the logits of the global model into two parts: majority and minority.The majority part of teacher logits corresponds to majority labels, and naturally, the minority part corresponds to minority labels.We focus on the minority part of the teacher logits for distillation by masking out the majority part.Because of the majority labels knowledge can be learned from L CE .This leads to a modified teacher distribution  ′  as: where  , is the logits of the global model for -th class label, and  is a temperature factor.We mask out the majority labels for teacher logits (set to 0), which encourages the student model to learn from the knowledge of the minority labels or not all labels, and helps prevent forgetting this knowledge.
For the student model's predictions   , a straightforward way is to leave it unchanged.However, this leads to a conflict between L CE and distillation loss.Because of the teacher's logits for the target label is 0 in distillation (Eq. 3) and the one-hot vector 1  is 1 in L CE .Therefore, we mask out the target label in the student model to avoid such conflicts.Additionally, for the majority nottarget labels, the student's performance can be further improved by learning from negative supervision [7,16].Therefore, we modify the distribution from the student model as: where  , is the logits of the -th client model for -th class label.
Then, the improved loss can be proposed as follows: where the label-masking distillation loss L LMD is defined as the Kullback-Leibler divergence between  ′  and  ′  : And the framework of FedLMD can be seen in Fig. 3.

Teacher-free Variant
In practical scenarios, the clients may have limited storage space and computation resource.FedLMD introduces an extra model for each client, which will undoubtedly increase the hardware overhead of the client.Therefore, we consider dropping the teacher model to avoid the cost.The teacher model in distillation generally needs to be pretrained so that they can better provide knowledge to the student model.However, FL is an online learning, i.e., the teacher model does not have good performance in the beginning stage.Inspired by [48], we treat distillation as the label smoothing (LS) regularization by introducing a fixed minority label distribution to replace the output of the teacher model.Specifically, we replace  ′  with   in Eq. 7 as follows: K ← a random subset of the  clients.

5:
for each client  ∈ K in parallel do // Using Eq. 5 for FedLMD or Eq. 7 for FedLMD-Tf 8: end for 10: end for 11: Upload    ( ∈ K) to the server 12: The fixed minority label distribution for -th client is where  and   denote the total number of class labels, the number of majority labels for the −th client, respectively.As this method does not require the teacher model, it is named FedLMD-Tf (Teacher-free).In this way, such a lightweight version does not increase computation, communication overhead and privacy risk, and can achieve much better performance via perceiving the label distributions.The detailed training is shown in Alg. 1.
Implementation.For a fair comparison, we use a network model with two convolution layers followed by max-pooling layers, and two fully-connected layers for all methods.The cross-entropy loss and the SGD optimizer are adopted.The learning rate is set to 0.01 and it decays with a factor of 0.99 at each communication round.The weight decay is set to 1e-5 and the SGD momentum is set to 0.9.The batch size is set to 50.For data augmentation, we employ techniques such as random cropping, random horizontal flipping, and normalization.Note that our default experimental dataset is CIFAR-10 ( = 0.05) unless specified.
For the FL task, we set some additional hyperparameters.Referring to the settings of previous studies, we set  In all of the experiments, we conduct a grid search on the parameters of each method to determine the optimal performance.After each communication round, we evaluate the global model on the test dataset and select the best test accuracy as the result display.

Improvement with Knowledge Distillation
In this subsection, we utilize and improve KD to alleviate the situation that the model is prone to be biased toward majority labels under the label heterogeneity case.
First of all, we briefly compared the change in FedAvg accuracy after applying distillation and the results are shown in Fig. 4. We found that KD can help FedAvg alleviate the label heterogeneity problem.However, the traditional KD treats all labels in the same way, which affects the effectiveness of dark knowledge transfer.Therefore, FedNTD only selects non-target labels for distillation.And our proposed FedLMD goes one step further by masking out the majority labels in the output of the teacher model, i.e., selecting minority labels for distillation.From Tab. 1, we can find that FedLMD has significant superiority.Moreover, the teacher is often assumed to be a well-pretrained model in KD.But the global model performs poorly at the beginning stage in FL.As shown in Fig. 4, the poor global model as a teacher still improved student performance at the beginning of FL.This observation inspired us to discard the teacher model and use a fixed distribution vector as an unreliable teacher to replace it.Further, to better understand the effectiveness of our teacherfree distillation, we use LS ( = 0.1) on FedAvg which is similar to the teacher-free distillation.In addition, to be fair, we modified FedNTD to a teacher-free version as well, called FedNTD-Tf, for comparison.From the Tab.2, we find that LS and FedNTD-Tf can alleviate the bias of FL.And when we use teacher-free distillation for minority labels, the FedLMD-Tf is further enhanced by being more focused on preserving the minority label knowledge.

Results on Label Heterogeneity
In this subsection, we compare FedLMD and FedLMD-Tf with the previous FL methods comprehensively.
Accuracy and Convergence Speed.We show the results of our experiments with two different strategies of data partition in Tab.
3. The effectiveness of the proposed approach is well illustrated by different datasets and the degree of label heterogeneity.Especially in the case of CIFAR-10 ( = 0.05), it is up to 17% improvement over FedAvg.FedLMD outperforms the previous results in the vast most of cases, and the performance improvement becomes more and more obvious as the degree of label heterogeneity increases ( or  keeps decreasing).Even though the results in a few cases are not the best, they are still very close to the SOTA baselines.Moreover, we measure the communication rounds required for different methods to reach the top-1 test accuracy of FedAvg, which is used as the evaluation metric for convergence speed [20].As shown in in Tab. 3, FedLMD clearly converges faster than the other methods.Specifically, in the experiment on MNIST dataset, it achieves 2.47 times speedup against FedAvg.We compared the training processes of different methods on the CIFAR-10 dataset.We evaluated their test accuracy on CIFAR-10 under three scenarios:  = 0.05, 0.3, and 0.5.As illustrated in Fig. 6, FedLMD exhibited greater stability training compared to the other methods in each case scenario.
Comparison with Light Baselines.When FL is deployed on lowpower devices, it have to consider the client-side computational costs.Therefore, in such a situation, lightweight FL methods are valuable.Here, we compare the performance of FedLMD-Tf with some previous lightweight approaches on the CIFAR-10 dataset to show its advantage.As shown in Fig. 5, FedLMD-Tf consistently outperforms other methods under various cases without increasing computational costs.We should additionally note that the size of the vector predefined by FedLMD-Tf on each client depends on the number of class labels  in the FL system.

Discussion
Model Architecture.We verify the applicability of the approach with different network architectures, and illustrate the performance with several typical architectures on CIFAR-10 ( = 0.05).As shown in Tab. 4, FedLMD works well with these network architectures.Local Epoch Number.We study the effect of local training epochs on accuracy, and report the results on the left of Fig. 7.As for FedLMD, the enhancement is stable and with excellent performance.While FedLMD-Tf does not perform as well as expected.When  = 10, the performance of FedLMD-Tf starts to deteriorate (green line).It may indicate that the client-side model will rely too much on the teacher's performance as  increases, and an unreliable teacher like a fixed distribution vector will limit the optimization of the client-side model with too many local epochs.It is worth pointing out that a larger  will also increase the computational overhead of the client and not all scenarios are better with a larger  [28].
Number of Uploaded Clients.Another point worth discussing in the FL is the number of uploaded clients per communication round.As shown in the right of Fig. 7, the optimal accuracy of each method increases with the number of participating clients.It can be found that FedLMD can achieve good results without having too many models for aggregation, which may be due to its ability to effectively preserve the knowledge of minority labels with a  small number of clients aggregated.As for FedLMD-Tf, it performs similarly to FedAvg when the number of uploaded clients is low.While, when the number of clients increases to 20, FedLMD-Tf has surpassed SOTA baselines with additional computing resources (such as FedNTD).And when the number of clients is 50, it is already quite close to the teacher version (FedLMD).
Combination with Other FL Methods.We consider the combination of FedLMD and other FL methods for improvement.Here, we select two representative methods, FedProx [23] and FedAvgM [35].
FedProx constrains the optimization of the local model from the perspective of model parameters.FedAvgM is based on the Adam optimization algorithm and incorporates a momentum parameter on top of FedAvg.By combining the previous global model parameters with the current aggregated global parameters, it updates the global parameters.As shown in Tab. 5, the combination of FedLMD and FedAvgM performs better than FedLMD on CIFAR-10 ( = 0.5), which indicates that the combination of FedLMD and FedAvgM can be applied simultaneously when the label heterogeneity is not very high.Due to the fact that both FedLMD and FedProx are optimization methods with parameter constraints, their optimization trajectories may clash and compromise system performance.Switching from FedLMD-Tf to FedLMD.As stated in Sec.4.3, the difference between FedLMD-Tf and FedLMD lies in whether the teacher is used or not.FedLMD-Tf is computationally efficient without a teacher, while FedLMD has a high performance with the teacher.We consider performing FedLMD-TF first and then switching to FedLMD later since the global model is not a good teacher in beginning.As shown in Fig. 8, we show that the optimization objective improves the performance of the method under different communication rounds of switching from FedLMD-TF to FedLMD on CIFAR-10 ( = 0.05).When the switching round is 200, the method becomes FedLMD-Tf, and when the switching round is 0, it is FedLMD.According to Fig. 8, the performance can be improved by earlier turn switching, which is also accompanied by an increase in computational cost.With such improvements, we can select which round to switch according to actual training situation for balancing between performance and computation.

Hyperparameters Analysis
Fig. 9 shows the performance of the proposed approach under different hyperparameters.FedLMD achieves excellent performance in most cases, which shows its robustness to the choice of hyperparameters.And for FedLMD-Tf, it suffers from severe performance degradation at higher .This is mainly due to an unreliable teacher constraining the optimization of the local model.For the temperature , a higher value leads to a better performance of FedLMD-Tf, which indicates that a smoother output of the local model is conducive to knowledge retention via teacher-free distillation.

CONCLUSION
In this paper, we propose FedLMD solve the challenge of label distribution skew in data heterogeneity, which achieves effective and stable FL by retaining knowledge of minority labels.It does not require additional parameters to be uploaded, and thus does not carry additional communication overhead and privacy risk.Our experimental results show that FedLMD is more effective compared to previous methods.Further, considering the limited computational resources on the client-side, we improve it to a teacher-free version.It achieves excellent performance without additional computation.
In future work, we will focus on how to apply in larger-scale application scenarios and the optimization solution for other data heterogeneous cases, like the rare labels in the all clients.

Figure 1 :
Figure 1: The model trained on the private dataset of a client with partial class labels Y − is generally biased to Y − due to knowledge missing over complete class labels Y.Our FedLMD method proposes to alleviate it by utilizing the global model from the server to retain the knowledge of minority labels Y\Y − .

Figure 2 :
Figure 2: The label distribution of the training examples (Top), the prediction distribution of the FedAvg (Middle), and the prediction distribution of the FedLMD (Bottom) under different communication rounds.

Figure 3 :
Figure 3: The framework of our approach.For the aggregation process, for the uploaded weight  1 , ...,   of the model are calculated as weighted averages to obtain   .For each client, the training loss is the combination of the cross-entropy loss L CE for learning from local data and the label-masking distillation loss L LMD for distilling from the global model.

Figure 5 :
Figure 5: Comparison of the accuracy (%) of the method without additional computational cost on two partition strategies Sharding (Left) and LDA (Right) of CIFAR-10.

Figure 7 :
Figure 7: The top-1 test accuracy (%) with different numbers of local epochs (Left) and the uploaded different number of client models (Right).

)
Algorithm 1 FedLMD and FedLMD-Tf. is the number of communication rounds,  the local epochs, and  the learning rate.Indices  denote  clients with local dataset D  ;    and    are the global and -th client model weights at round ; K is the set of selected clients per round. 4:

Table 3 :
The top-1 test accuracy (%) on MNIST, CIFAR-10, CIFAR-100, and CINIC-10.The values in the parentheses are the speedup of the approach computed against FedAvg.If Failed is displayed in the parentheses, the method cannot be converged.

Table 5 :
The top-1 test accuracy (%) under the combination of FedLMD and other federated learning methods.