Federated Few-shot Learning

Federated Learning (FL) enables multiple clients to collaboratively learn a machine learning model without exchanging their own local data. In this way, the server can exploit the computational power of all clients and train the model on a larger set of data samples among all clients. Although such a mechanism is proven to be effective in various fields, existing works generally assume that each client preserves sufficient data for training. In practice, however, certain clients may only contain a limited number of samples (i.e., few-shot samples). For example, the available photo data taken by a specific user with a new mobile device is relatively rare. In this scenario, existing FL efforts typically encounter a significant performance drop on these clients. Therefore, it is urgent to develop a few-shot model that can generalize to clients with limited data under the FL scenario. In this paper, we refer to this novel problem as federated few-shot learning. Nevertheless, the problem remains challenging due to two major reasons: the global data variance among clients (i.e., the difference in data distributions among clients) and the local data insufficiency in each client (i.e., the lack of adequate local data for training). To overcome these two challenges, we propose a novel federated few-shot learning framework with two separately updated models and dedicated training strategies to reduce the adverse impact of global data variance and local data insufficiency. Extensive experiments on four prevalent datasets that cover news articles and images validate the effectiveness of our framework compared with the state-of-the-art baselines. Our code is provided at https://github.com/SongW-SW/F2L.


INTRODUCTION
The volume of valuable data is growing massively with the rapid development of mobile devices [4,34].Recently, researchers have developed various machine learning methods [5,58,62] to analyze and extract useful information from such large-scale real-world data.Among these methods, Federated Learning (FL) is an effective solution, which aims to collaboratively optimize a centralized model over data distributed across a large number of clients [7,13,22,63].In particular, FL trains a global model on a server by aggregating the local models learned on each client [2].Moreover, by avoiding the direct exchange of private data, FL can provide effective protection of local data privacy for clients [31].As an example, in Google Photo Categorization [12,33], the server aims to learn an image classification model from photos distributed among a large number of clients, i.e., mobile devices.In this case, FL can effectively conduct learning tasks without revealing private photos to the server.
In fact, new learning tasks (e.g., novel photo classes) are constantly emerging over time [51,60].In consequence, FL can easily encounter a situation where the server needs to solve a new task with limited available data as the reference.In the previous example of Google Photo Categorization, as illustrated in Fig. 1, the server may inevitably need to deal with novel photo classes such as the latest electronic products, where only limited annotations are available.Nevertheless, existing FL works generally assume sufficient labeled samples for model training, which inevitably leads to unsatisfying classification performance for new tasks with limited labeled samples [14].Therefore, to improve the practicality of FL in realistic scenarios, it is important to solve this problem by learning an FL model that can achieve satisfactory performance on new tasks with limited samples.In this paper, we refer to this novel problem setting as federated few-shot learning.
Recently, many few-shot learning frameworks [15,45,53,56] have been proposed to deal with new tasks with limited samples.Typically, the main idea is to learn meta-knowledge from base classes with abundant samples (e.g., photo classes such as portraits).Then such meta-knowledge is generalized to novel classes with limited samples (e.g., photo classes such as new electronic products), where novel classes are typically disjoint from base classes.However, as illustrated in Fig. 1, it remains challenging to conduct few-shot learning under the federated setting due to the following reasons.First, due to the global data variance (i.e., the differences in data distributions across clients), the aggregation of local models on the server side will disrupt the learning of meta-knowledge in each client [23].Generally, the meta-knowledge is locally learned from different classes in each client and thus is distinct among clients, especially under the non-IID scenario, where the data variance can be even larger among clients compared with the IID scenario.
Since the server will aggregate the local models from different clients and then send back the aggregated model, the learning of meta-knowledge in each client will be potentially disrupted.Second, due to the local data insufficiency in clients, it is non-trivial to learn meta-knowledge from each client.In FL, each client only preserves a relatively small portion of the total data [1,13].However, meta-knowledge is generally learned from data in a variety of classes [15,53].As a result, it is difficult to learn meta-knowledge from data with less variety, especially in the non-IID scenario, where each client only has a limited amount of classes.
To effectively solve the aforementioned challenges, we propose a novel Federated Few-shot Learning framework, named F 2 L. First, we propose a decoupled meta-learning framework to mitigate the disruption from the aggregated model on the server.Specifically, the proposed framework retains a unique client-model for each client to learn meta-knowledge and a shared server-model to learn client-invariant knowledge (e.g., the representations of samples), as illustrated in Fig. 2. Specifically, the client-model in each client is updated locally and will not be shared across clients, while the servermodel can be updated across clients and sent to the server for aggregation.Such a design decouples the learning of meta-knowledge (via client-model) from learning client-invariant knowledge (via server-model).In this way, we can mitigate the disruption from the aggregated model on the server caused by global data variance among clients.Second, to compensate for local data insufficiency in each client, we propose to leverage global knowledge learned from all clients with two dedicated update strategies.In particular, we first transfer the learned meta-knowledge in client-model to server-model by maximizing the mutual information between their output (i.e., local-to-global knowledge transfer).Then we propose a partial knowledge distillation strategy for each client to selectively extract useful knowledge from server-model (i.e., global-to-local knowledge distillation).In this manner, each client can leverage the beneficial knowledge in other clients to learn meta-knowledge from more data.In summary, our contributions are as follows: • Problem.We investigate the challenges of learning metaknowledge in the novel problem of federated few-shot learning from the perspectives of global data variance and local data insufficiency.We also discuss the necessity of tackling these challenges.

PRELIMINARIES 2.1 Problem Definition
In FL, given a set of  clients, i.e., {C () }  =1 , where  is the number of clients, each C () owns a local dataset D () .The main objective of FL is to learn a global model over data across all clients (i.e., {D () }  =1 ) without the direct exchange of data among clients.Following the conventional FL strategy [13,32,36], a server S will aggregate locally learned models from all clients for a global model.
Under the prevalent few-shot learning scenario, we consider a supervised setting in which the data samples for client C () are from its local dataset: (, ) ∈ D () , where  is a data sample, and  is the corresponding label.We first denote the entire set of classes on all clients as C. Depending on the number of labeled samples in each class, C can be divided into two categories: base classes C  and novel classes C  , where C = C  ∪ C  and C  ∩ C  = ∅.In general, the number of labeled samples in C  is sufficient, while it is generally small in C  [15,45].Correspondingly, each local dataset can be divided into a base dataset D More specifically, if the support set S consists of exactly  labeled samples for each of  classes from D ()  , and the query set Q is sampled from the same  classes, the problem is defined as Federated  -way -shot Learning.Essentially, the objective of federated few-shot learning is to learn a globally shared model across clients that can be fast adapted to data samples in D ()  with only limited labeled samples.Therefore, the crucial part is to effectively learn meta-knowledge from the base datasets {D ()  }  =1 in all clients.Such meta-knowledge is generalizable to novel classes unseen during training and thus can be utilized to classify data samples in each D ()  , which consists of only limited labeled samples.

Episodic Learning
In practice, we adopt the prevalent episodic learning framework for model training and evaluation, which has proven to be effective in various few-shot learning scenarios [9,10,41,53,57].Specifically, the model evaluation (i.e., meta-test) is conducted on a certain number of meta-test tasks, where each task contains a small number of labeled samples as references and unlabeled samples for classification. .Then we randomly select  samples from each of the  classes (i.e.,  -way -shot) to establish the support set S. Similarly, the query set Q consists of  different samples (distinct from S) from the same  classes.The components of the meta-training task T is formulated as follows: where   (or   ) is a data sample in the sampled  classes, and   (or  ′  ) is the corresponding label.Note that during meta-test, each meta-test task shares a similar structure to meta-training tasks, except that the samples are from the local novel dataset D

METHODOLOGY
In this part, we introduce the overall design of our proposed framework F 2 L in detail.Specifically, we formulate the federated fewshot learning problem under the prevailing  -way -shot learning framework.Our target of conducting federated few-shot learning is to learn meta-knowledge from a set of  clients {C () }  =1 with different data distributions, and generalize such meta-knowledge to meta-test tasks.Nevertheless, it remains difficult to conduct federated few-shot learning due to the challenging issues of global data variance and local data insufficiency as mentioned before.Therefore, as illustrated in Fig 2, we propose a decoupled meta-learning framework to mitigate disruption from the servers.We further propose two update strategies to leverage global knowledge.The overview process is presented in Fig 3 .3.1 Decoupled Meta-Learning Framework 3.1.1Federated Learning Framework.We consider a server-model, which consists of an encoder   and a classifier   that are shared among clients.We denote the overall model parameters in the server-model as .Specifically,   : R  → R  is a function that maps each sample into a low-dimensional vector h  ∈ R  , where  is the input feature dimension, and  is the dimension of learned representations.Taking the representation h  as input, the classifier   : R  → C  maps each h  to the label space of base classes C  and outputs the prediction p  ∈ R | C  | , where each element in p  denotes the classification probability regarding each class in C  .
Following the prevalent FedAvg [36] strategy for FL, the training of server-model is conducted on all clients through  rounds.In each round , the server S first sends the server-model parameters  to all clients, and each client will conduct a local meta-training process on  randomly sampled meta-training tasks.Then the server S will perform aggregation on parameters received from clients: where    denotes the locally updated server-model parameters by client C () on round .  +1 denotes the aggregated server-model parameters which will be distributed to clients at the beginning of the next round.In this way, the server can learn a shared model for all clients in a federated manner.
Although the standard strategy of learning a single shared model for all clients achieves decent performance on general FL tasks [13,36], it can be suboptimal for federated few-shot learning.Due to the global data variance among clients, the aggregated model on the server will disrupt the learning of meta-knowledge in each client [23].As a result, the local learning of meta-knowledge in clients will become more difficult.In contrast, we propose to further introduce a client-model, which is uniquely learned and preserved by each client, to locally learn meta-knowledge.In other words, its model parameters will not be sent back to the server for aggregation.In this manner, we can separate the learning of client-model (metaknowledge) and server-model (client-invariant knowledge) so that the learning of meta-knowledge is not disrupted.
Specifically, for client C () , the client-model also consists of an encoder    and a classifier    .We denote the overall model parameters in the client-model for client C () as   .In particular, the encoder    takes the representation h  learned by the encoder   in server-model as input, and outputs a hidden representation h  ∈ R  .Such a design ensures that the client-model encoder    does not need to process the raw sample and thus can be a small model, which is important when clients only preserve limited computational resources [6].Then the classifier    maps h  to predictions p  ∈ R  of the  classes. .In particular, for client C () on round  = 1, 2, . . ., and step  = 1, 2, . . ., , we denote the sampled metatask as T ,  = {S ,  , Q ,  }.To learn meta-knowledge from metatask T ,  , we adopt the prevalent MAML [15] strategy to update client-model in one fine-tuning step and one meta-update step.We first fine-tune the client-model to fast adapt it to support set S ,  : where L   is the fine-tuning loss, which is the cross-entropy loss calculated on the support set S ,  .Here,    is the learning rate, and  ,  (or  ,  ) denotes the parameters of client-model (or servermodel) on round  and step .Then we update the client-model based on the query set Q ,  : where L  is the loss for client-model on the query set Q ,  , and   is the meta-learning rate for  .In this regard, we can update clientmodel with our global-to-local knowledge distillation strategy.For the update of server-model, we conduct one step of update based on the support set and parameters of client-model: where L  is the loss for the server-model, and   is the metalearning rate for .In this manner, we can update the server-model with our local-to-global knowledge transfer strategy.After repeating the above updates for  steps, the final parameters of servermodel  ,  is used as    in Eq. ( 2) and sent back to the server for aggregation, while the client-model (with parameters  ,  ) will be kept locally.By doing this, we can decouple the learning of local meta-knowledge in client-model while learning client-invariant knowledge in server-model to avoid disruption from the server.

Local-to-Global Knowledge Transfer
With our decoupled meta-learning framework, we can mitigate the disruption to the learning of local meta-knowledge in each client.Nevertheless, we still need to transfer the learned meta-knowledge to server-model (i.e., Local-to-global Knowledge Transfer), so that it can be further leveraged by other clients to handle the local data insufficiency issue.Specifically, to effectively transfer local meta-knowledge, we propose to maximize the mutual information between representations learned from server-model encoder   and client-model encoder   .In this way, the server-model can maximally absorb the information in the learned local meta-knowledge.

Mutual Information Maximization.
Given a meta-training task T = {S, Q}, as described in Sec.3.1, the server-model encoder   and client-model encoder   will output h  and h  for each sample, respectively.By stacking the learned representations of samples in the support set S (|S| = , where  =  × ), we can obtain the representations of support samples learned by the servermodel, i.e., H  ∈ R × , and the client-model, i.e., H  ∈ R × .For simplicity, we omit the annotations of round , step , and client .
The objective of maximizing the information between H  and H  can be formally represented as follows: where h   (or h   ) is the -th row of H  (or H  ).Since the mutual information  (H  ; H  ) is difficult to obtain and thus infeasible to be maximized [40], we re-write it to achieve a more feasible form: where p   (  ) ∈ R denotes the classification probability for the -th sample regarding class   (  =   when  ∈  ( )).8) is also a constant and thus can be ignored in the objective:

Estimation of 𝑝 (h
Combining the above equations, the optimal server-model parameter  * for the final optimization objective (i.e., max   (H  ; H  )) can be obtained as follows: Here L  is defined as follows: where we exchange the order of summation over  and  for clarity.It is noteworthy that L  is different from the InfoNCE loss [18,40], which considers different augmentations of samples, while L  focuses on the classes of samples in S.Moreover, L  also differs from the supervised contrastive loss [25], which combines various augmentations of samples and label information.In contrast, our loss targets at transferring the meta-knowledge by maximally preserving the mutual information between representations learned by the server-model and the client-model.More differently, the term p   (  )/  ∈ ( ) p   (  ) acts as an adjustable weight that measures the importance of a sample to its class.Combining the objective described in Eq. ( 13) and the standard cross-entropy loss, we can obtain the final loss for the server-model: where L  (S) is defined as follows: where p   (  ) ∈ R denotes the classification probability for the -th support sample belonging to the -th class   in C  , computed by the server-model.Here     = 1 if the -th support sample belongs to   , and     = 0, otherwise.Moreover,   ∈ [0, 1] is an adjustable hyper-parameter to control the weight of L  .

Global-to-Local Knowledge Distillation
With the learned meta-knowledge in each client transferred from the client-model to the server-model, other clients can leverage such meta-knowledge to deal with the local data insufficiency issue.However, since each meta-task only contains  classes, directly extracting meta-knowledge in the server-model can inevitably involve meta-knowledge from other classes, which can be harmful to the learning of local meta-knowledge from these  classes in each client.Instead, we propose a partial knowledge distillation strategy to selectively extract useful knowledge from the server-model, i.e., global-to-local knowledge distillation.

Partial Knowledge Distillation
. Specifically, we focus on the output classification probabilities of the server-model regarding the  classes in support set S while ignoring other classes.In this regard, we can extract the information that is crucial for learning local meta-knowledge from these  classes and also reduce the irrelevant information from other classes.
Particularly, we consider the same meta-task T = {S, Q}.We denote the output probabilities for the -th query sample   in Q (with label   ) of the server-model and the client-model as p   ∈ R | C  | and p   ∈ R  , respectively.It is noteworthy that the  classes in this meta-task, denoted as C  , are sampled from the base classes C  (i.e., |C  | =  and C  ⊂ C  ).Therefore, the output of server-model (i.e., p   ) will include the probabilities of classes in C  .In particular, we enforce the probabilities of in C  from the client-model to be consistent with the probabilities of the same classes from the server-model.As a result, the learning of local meta-knowledge can leverage the information of data in the same  classes from other clients, which is encoded in the server-model.In this regard, we can handle the local data insufficiency issue by involving information from other clients while reducing the irrelevant information from other classes not in C  .In particular, by utilizing the output of the server-model as the soft target for the client-model, we can achieve an objective as follows: where   is the -th class in C  (i.e., the  classes in meta-task T ).q   (  ) and q   (  ) are the knowledge distillation values for   from server-model and client-model, respectively.Specifically, the values of q   (  ) and q   (  ) are obtained via the softmax normalization: where z   (  ) are z   (  )) are the logits (i.e., output before softmax normalization) of class   from server-model and client-model, respectively.  is the temperature parameter for the -th query sample.In this way, we can ensure that  =1 q   (  ) =  =1 q   (  ) = 1.3.3.2Adaptive Temperature Parameter.Generally, a larger value of   denotes that the client-model focuses more on extracting information from the other classes in C  [19] (i.e., { | ∈ C  ,  ≠   }), denoted as negative classes.Since the classification results can be erroneous in the server-model, we should adaptively adjust the value of   for each meta-task to reduce the adverse impact of extracting misleading information from the server-model.However, although negative classes can inherit useful information for classification, such information is generally noisier when the output probabilities of these negative classes are smaller.Therefore, to estimate the importance degree of each negative class, we consider the maximum output logit for negative classes to reduce potential noise.Particularly, if the probability of a negative class from the server-model is significantly larger than other classes, we can conjecture that this class is similar to   and thus potentially contains the crucial information to distinguish them.Specifically, the temperature parameter   for the -th query sample   is computed as follows: where  (•) denotes the Sigmoid function, and   is the label of   .In this way, the temperature parameter   will increase when the ratio between the largest probability in negative classes and the probability for   is larger.As a result, the client-model will focus more on the negative class information.Then by further incorporating the cross-entropy loss on the query set Q, we can obtain the final loss for the client-model: where L  (Q) is defined as follows: where p   (  ) is the probability of the -th query sample belonging to class   computed by the client-model.    = 1 if the -th query sample belongs to   , and     = 0, otherwise.Moreover,   ∈ [0, 1] is an adjustable hyper-parameter to control the weight of L  .In this manner, the client-model can selectively learn useful knowledge from both the local and global perspectives, i.e., globalto-local knowledge distillation.

Overall Learning Process
With the proposed losses L  and L  , on each round, we can conduct meta-training on each client C () by sampling  meta-training tasks from the local base dataset D ()  .The detailed process is described in Algorithm 1.After  rounds of meta-training on all the clients, we have obtained a model that accommodates comprehensive meta-knowledge for federated few-shot learning.For the meta-test phase, since we have aggregated learned local metaknowledge from each client to the server-model, we can leverage the server-model to generate data representations for classification.Specifically, during evaluation, for each meta-test task T = {S, Q} sampled from local novel datasets {D ()  }  =1 in all clients, we follow the same process as meta-training including fine-tuning, except that the meta-update process is omitted.The output of the client-model will be used for classification.

EXPERIMENTS
In this part, we conduct extensive experiments to evaluate our framework F 2 L on four few-shot classification datasets covering both news articles and images under the federated scenario.

Datasets
In this section, we introduce four prevalent real-world datasets used in our experiments, covering both news articles and images: 20 Newsgroup [28], Huffpost [38,39], FC100 [41], and miniIm-ageNet [53].In particular, 20 Newsgroup and Huffpost are online news article datasets, while FC100 and miniImageNet are image datasets.The details are as follows: • 20 Newsgroup [28] is a text dataset that consists of informal discourse from news discussion forums.There are 20

Experimental Settings
To validate the performance of our framework F 2 L, we conduct experiments with the following baselines for a fair comparison: 2 https://www.huffpost.com/ • Local.This baseline is non-distributed, which means we train an individual model for each client on the local data.The meta-test process is conducted on all meta-test tasks, and the averaged results of all models are reported.• FL-MAML.This baseline leverages the MAML [15] strategy to perform meta-learning on each client.The updated model parameters will be sent back to the server for aggregation.• FL-Proto.This baseline uses ProtoNet [45] as the model in each client.The classification is based on the Euclidean distances between query samples and support samples.• FedFSL [14].This method combines MAML and an adversarial learning strategy [17,44] to construct a consistent feature space.The aggregation is based on FedAvg [36].
During meta-training, we perform updates for the client-model and the server-model according to Algorithm 1. Finally, the servermodel that achieves the best result on validation will be used for meta-test.Then during meta-test, we evaluate the server-model on a series of 100 randomly sampled meta-test tasks from local novel datasets {D () }  =1 in all clients.For consistency, the class split of C  and C  is identical for all baseline methods.The classification accuracy over these meta-test tasks will be averaged as the final results.The specific parameter settings are provided in Appendix C.3.For the specific choices for the encoder and classifier in server-model and client-model (i.e.,   ,   ,   , and   ) and model parameters, we provide further details in Appendix C.1.Note that for a fair comparison, we utilize the same encoder for all methods.

Overall Evaluation Results
We present the overall performance comparison of our framework and baselines on federated few-shot learning in Table 1.Specifically, we conduct experiments under two few-shot settings: 5-way 1shot and 5-way 5-shot.Moreover, to demonstrate the robustness of our framework under different data distributions, we partition the data in both IID and non-IID settings.For the IID partition, the samples of each class are uniformly distributed to all clients.For non-IID partition, we follow the prevailing strategy [21,61] and distribute samples to all clients based on the Dirichlet distribution with its concentration parameter set as 1.0.The evaluation metric is the average classification accuracy over ten repetitions.From the overall results, we can obtain the following observations: • Our framework F 2 L outperforms all other baselines on various news article and image datasets under different few-shot settings (1-shot and 5-shot) and data distributions (IID and non-IID).The results validate the effectiveness of our framework on federated few-shot learning.• Conventional few-shot methods such as Prototypical Network [45] and MAML [15] exhibit similar performance compared with the Local baseline.The result demonstrates that directly applying few-shot methods to federated learning brings less competitive improvements over local training.This is because such methods are not proposed for federated learning and thus lead to unsatisfactory training performance under the federated setting.• The performance of all methods degrades at different extents when the data distribution is changed from IID to non-IID.The main reason is that the variety of classes in each client  results in a more complex class distribution and brings difficulties to the classification task.Nevertheless, by effectively transferring the meta-knowledge among clients, our framework is capable of alleviating such a problem under the non-IID scenario.
• When increasing the value of  (i.e., more support samples in each class), all methods achieve considerable performance gains.In particular, our framework F 2 L obtains better results compared to other baselines, due to our decoupled metalearning framework, which promotes the learning of metaknowledge in the support samples.

Ablation Study
In this part, we conduct an ablation study on FC100 and Huffpost to validate the effectiveness of three crucial designs in F 2 L (similar results observed in other datasets).First, we remove the decoupled strategy so that the client-model will also be sent to the server for aggregation.We refer to this variant as F 2 L\M.Second, we remove the local-to-global knowledge transfer module so that the metaknowledge in the client-model will be effectively transferred to the server-model.This variant is referred to as F 2 L\T.Third, we eliminate the global-to-local knowledge distillation loss.In this way, the client-model cannot leverage the global knowledge in the server-model for learning meta-knowledge.We refer to this variant as F 2 L\A.The overall ablation study results are presented in Fig. 4. From the results, we observe that F 2 L outperforms all variants, which verifies the effectiveness of the three designs in F 2 L. Specifically, removing the design of local-to-global knowledge transfer leads to significant performance degradation.This result demonstrates that such a design can effectively aggregate learned meta-knowledge among clients and thus bring performance improvements.More significantly, without our decoupled strategy, the performance deteriorates rapidly when federated few-shot learning is conducted in the non-IID scenario.This phenomenon verifies the importance of mitigating the disruption from the server in the presence of complex data distributions among clients.we can observe that the performance generally increases with a larger value of   , while decreasing with   approaches 1.The results indicate the importance of transferring learned local metaknowledge, while also demonstrating that the cross-entropy loss is necessary.On the other hand, the performance first increases and then degrades when a larger value of   is presented.That being said, although partial knowledge distillation can enable each client to benefit from the global data, a larger   can potentially lead to more irrelevant information when learning local meta-knowledge.From the results, we can observe that all methods encounter a performance drop in the presence of more clients.Nevertheless, our framework F 2 L can reduce the adverse impact brought by more clients through effectively leveraging the global knowledge learned from all clients.In consequence, the performance degradation is less significant for F 2 L.

RELATED WORK 5.1 Few-shot Learning
The objective of Few-shot Learning (FSL) is to learn transferable meta-knowledge from tasks with abundant information and generalize such knowledge to novel tasks that consist of only limited labeled samples [9,11,46,50,57].Existing few-shot learning works can be divided into two categories: metric-based methods and optimization-based methods.The metric-based methods target at learning generalizable metric functions to classify query samples by matching them with support samples [35,47,55].For instance, Prototypical Networks [45] learn a prototype representation for each class and conduct predictions based on the Euclidean distances between query samples and the prototypes.Relation Networks [47] learn relation scores for classification in a non-linear manner.On the other hand, optimization-based approaches generally optimize model parameters based on the gradients calculated from few-shot samples [24,37,42,54].As an example, MAML [15] proposes to optimize model parameters based on gradients on support samples to achieve fast generalization.In addition, LSTM-based metalearner [42] adjusts the step size to adaptively update parameters during meta-training.

Federated Learning
Federated Learning (FL) enables multiple clients to collaboratively train a model without exchanging the local data explicitly [16,22,30,48,59,64].As a classic example, FedAvg [36] performs stochastic gradient descent (SGD) on each client to update model parameters and send them to the server.The server averages the received model parameters to achieve a global model for the next round.FedProx [32] incorporates a proximal term into the local update of each client to reduce the distance between the global model and the local model.To deal with the non-IID problem in FL, recent works also focus on personalization in FL [1,3,13,49].For instance, FedMeta [6] incorporates MAML [15] into the local update process in each client for personalization.FedRep [7] learns shared representations among clients.Moreover, FedFSL [14] proposes to combine MAML and an adversarial learning strategy [17,44] to learn a consistent feature space.

CONCLUSION
In this paper, we study the problem of federated few-shot learning, which aims at learning a federated model that can achieve satisfactory performance on new tasks with limited labeled samples.Nevertheless, it remains difficult to perform federated few-shot learning due to two challenges: global data variance and local data insufficiency.To tackle these challenges, we propose a novel federated few-shot learning framework F 2 L. In particular, we handle global data variance by decoupling the learning of local meta-knowledge.
Then we leverage the global knowledge that is learned from all clients to tackle the local data insufficiency issue.We conduct extensive experiments on four prevalent few-shot learning datasets under the federated setting, covering both news articles and images.The experimental results further validate the superiority of our framework F 2 L over other state-of-the-art baselines.

A NOTATIONS
In this section, we provide details for the used notations in this paper and their corresponding descriptions.for each client C () in {C () }  =1 in parallel do Update server-model on S ,  with Eq. ( 5) and Eq. ( 14); The server sends back averaged parameters of server-model to each client; 12: end for

C REPRODUCIBILITY C.1 Model Details
In this section, we introduce the specific choices for the encoders and classifiers in both server-model and client-model (i.e.,   ,   ,   , and   ).
C.1.1 Server-model Encoder   .For the server-model encoder, we adopt different models for news article datasets and image datasets.In particular, for news article datasets 20 Newsgroup and Huffpost, we leverage a biLSTM [20] with 50 units as the server-model encoder.For the image datasets FC100 and miniImageNet, following [43,51], we utilize a ResNet12 as the server-model encoder.Similar to [29], the Dropblock is used as a regularizer.The number of filters is set as (64,160,320,640).
C.1.2Client-model Encoder   .Considering that the client-model is required to process the entire support set in a meta-task for learning local meta-knowledge, we propose to further utilize a set-invariant function that takes a set of samples as input while capturing the correlations among these samples.In practice, we leverage the Transformer [52] as the client-model encoder   to process the entire support set:

C.2 Baseline Settings
In this section, we provide further details in the implementation of baselines in our experiments.
• Local.For this baseline, an individual model is trained for each client over the local data.Specifically, we use the same architecture of encoders in our framework to learn sample representations.• FL-MAML.For this baseline, we leverage the MAML [15] strategy and set the meta-learning rate as 0.001 and the fine-tuning rate as 0.01.The encoders are the same as our framework.
• FL-Proto.For this baseline, we follow the setting in Pro-toNet [45] with the same encoders in our framework.The learning rate is set as 0.001.• FedFSL [14].For this baseline, which combines MAML and an adversarial learning strategy [17,44], we follow the settings in the public code and set the learning rate as 0.001.The adaptation step size is set as 0.01.

C.3 Parameter Settings
For our framework F 2 L, we set the number of clients as 10.The number of training steps  in each client is set as 10, and the number of training rounds  is set as 200.Moreover, the meta-learning rates   and   are both set as 0.001 with a dropout rate of 0.1.
The fine-tuning learning rate    is set as 0.01.We leverage the Adam [26] optimization strategy with the weight decay rate set as 10 −4 .During the meta-test, we randomly sample 100 meta-test tasks from novel classes C  with a query set size |Q| of 5.In order to preserve consistency for fair comparisons, we keep identical meta-test tasks for all baselines.The loss weights   and   are both set as 0.5.The default value of  is set as 10.

Figure 1 :
Figure 1: The two challenges of federated few-shot learning as an example in Google Photo Categorization: local data insufficiency and global data variance.
()  = {(, ) ∈ D () :  ∈ C  } and a novel dataset D ()  = {(, ) ∈ D () :  ∈ C  }.In the few-shot setting, the evaluation of the model generalizability to novel classes C  is conducted on D ()  , which contains only limited labeled samples.The data samples in D ()  will be used for training.Then we can formulate the studied problem of federated few-shot learning as follows: Definition 1. Federated Few-shot Learning: Given a set of  clients {C () }  =1 and a server S, federated few-shot learning aims to learn a global model after aggregating model parameters locally learned from D ()  in each client such that the model can accurately predict labels for unlabeled samples (i.e., query set Q) in D () with only a limited number of labeled samples (i.e., support set S).

Figure 2 :
Figure2: The illustration of our decoupled meta-learning framework. denotes the client-model, which will be locally kept in each client. denotes the server-model, which will be aggregated and sent to the server.

Figure 3 :
Figure 3: An illustration of the overall process of our framework F 2 L. Specifically, each client receives the server-model from the server at the beginning of each round.To perform one step of local update, each client first samples a meta-task (2-way 2-shot in the illustration), which consists of a support set and a query set, from the local data.Then the server-model and the client-model will both compute output for the support samples and query samples.After that, the server-model and the clientmodel are updated via mutual information maximization and knowledge distillation, respectively.Finally, the server-model is sent back to the server for aggregation, while the client-model is locally preserved by each client.case of  ∈  ( ), which means the -th and the -th samples share the same class,  (h   |h   ; ) can be considered as the confidence of client-model regarding the class of the -the sample.Therefore, it should reflect the degree to which the sample representation h   is relevant to its class.Utilizing the client-model classification output (i.e., normalized class probabilities) for the -th sample p   ∈ R  , we can compute  (h   |h   ; ) as follows:

Figure 4 :
Figure 4: Ablation study of our framework on FC100 and Huffpost.I- (or N-) denotes the setting of 5-way -shot under IID (or non-IID) distributions.M denotes the decoupled framework, T means the local-to-global knowledge transfer, and A demotes the global-to-local knowledge distillation.

4. 5 . 1 Figure 5 :Figure 6 :
Figure 5: The results with different values of   and   on Huffpost under the non-IID setting.

4. 5 . 2
Effect of Client Number.In this section, we study the robustness of our framework under the scenario with a varying number of clients.In particular, we keep the total training data unchanged, which means with more clients participating in the training process, each client preserves fewer training samples.As a result, the training performance will be inevitably reduced.Specifically, we partition the total training data into  = 1, 2, 5, 10, 20, and 50 clients.Note that  = 1 denotes the setting of completely centralized training.The results on FC100 with 1-shot and 5-shot settings are presented in Fig 6(we have similar results for other datasets and omit them for brevity).

7 :
Update client-model on Q ,  with Eq. (4) and Eq.(the updated parameters of server-model to the server; 11: where h   (or h   ) denotes the representation of the -th sample in S learned by the server-model encoder   (or client-model encoder   ).With the Transformer, the representations learned by the clientmodel can effectively capture the correlations among samples in the entire support set S for learning meta-knowledge.C.1.3Server-model Classifier   and Client-model Classifier   .The classifiers   and   are both implemented as a fully-connected layer, where the output size is |C  | for   and  for   , as described in Sec.3.1.
We conduct experiments on four few-shot classification datasets covering both news articles and images under the federated scenario.The results further demonstrate the superiority of our proposed framework.
.That being said, the class set of samples in meta-training tasks is a subset of C  , while the class set of samples in meta-test tasks is a subset of C  , which is distinct from C  .The main idea of federated few-shot learning is to preserve the consistency between meta-training and meta-test so that the model can learn meta-knowledge from clients for better generalization performance to novel classes C  .
Since the client-model is finetuned on the support set S of the meta-task T , we can leverage the classification results of the client-model to estimate  (h   |h We denote  ( ) as the set of sample indices in the support set S that shares the same class as the -th sample (including itself), i.e.,  ( ) ≡ { :   =   ,  = 1, 2, . . .,  }.Here, we first set  (h   |h ; ) log( (h   |h   ; )) + log  .  ; ).  ; ).  ; ) = 0 for all  ∉  ( ), since we assume the client-model can only infer representations from the same class.Intuitively, in the Although we can similarly leverage the classification results of the server-model, such a strategy lacks generalizability.This is because the server-model aims at classifying all base classes instead of the  classes in each meta-training task.We instead propose to estimate  (h |h   ; ).Next we elaborate on how to estimate  (h   |h   ; ).Then if we further apply the ℓ 2 normalization to both h   and h   , we can obtain ∥h   − h   ∥ 2 2 /2 = 1 − h   • h   ; ) equals a constant , the term   ; ) • log()/ in Eq. (

Table 2 :
Notations used in this paper.Detailed training process of our framework F 2 L. Input: A set of  federated clients; a local update objective L  for server-model; a local update objective L  for client-model; number of training rounds  ; number of local training steps .Output: A trained server-model  and a unique client-model   for each client C () in {C () }  =1 .1: for  = 1, 2, . . ., do