Navigating Alignment for Non-identical Client Class Sets: A Label Name-Anchored Federated Learning Framework

Traditional federated classification methods, even those designed for non-IID clients, assume that each client annotates its local data with respect to the same universal class set. In this paper, we focus on a more general yet practical setting, non-identical client class sets, where clients focus on their own (different or even non-overlapping) class sets and seek a global model that works for the union of these classes. If one views classification as finding the best match between representations produced by data/label encoder, such heterogeneity in client class sets poses a new significant challenge-local encoders at different clients may operate in different and even independent latent spaces, making it hard to aggregate at the server. We propose a novel framework, FedAlign1, to align the latent spaces across clients from both label and data perspectives. From a label perspective, we leverage the expressive natural language class names as a common ground for label encoders to anchor class representations and guide the data encoder learning across clients. From a data perspective, during local training, we regard the global class representations as anchors and leverage the data points that are close/far enough to the anchors of locally-unaware classes to align the data encoders across clients. Our theoretical analysis of the generalization performance and extensive experiments on four real-world datasets of different tasks confirm that FedAlign outperforms various state-of-the-art (non-IID) federated classification methods.


INTRODUCTION
Federated learning [31] has emerged as a distributed learning paradigm that allows multiple parties to collaboratively learn a global model effective for all participants while preserving the privacy of their local data.It brings benefits to various domains, such as recommendation systems [24,27,49], ubiquitous sensing [17,18,40] and mobile computing [16,19,48].
Existing federated classification methods [10,20,21,25,28,45,46,53] typically assume that the local annotations at each client follow the same set of classes; however, this assumption does not hold true in many real-world applications.For example, a smartwatch company wants to build a human activity classifier for all activity types, as shown in Figure 1(a).Although their smartwatch users as clients could experience almost all types of daily activities, each user may only opt to report (i.e., annotate) a subset of activities.Another example is a federated medical diagnosis system, which attempts to infer all types of diseases of a patient for comprehensive health screening.Physicians and specialist groups with different expertise can participate in this federated learning system as clients.As one can see here, different specialists will only offer disease annotations within their domains, even if a patient may have several types of diseases at the same time.This makes the class sets at many clients non-identical and even non-overlapping.
We aim to lift this assumption and work on a general and rather practical federated learning setting, non-identical client class sets, where clients focus on their own (different or even nonoverlapping) class sets and seek a global classification model that works for the union of these classes.We denote the classes that are not covered in the local annotations as locally-unaware classes.Note that each client can have local data whose true labels are among the locally-unaware classes.Also, the classification task here can be either single-label or multi-label.When it is multi-label, the local data might be only partially labeled due to the locally-unaware classes.Therefore, this new setting is more general and challenging than the missing class scenario [23] which assumes the single-label  scenario and no local data is from locally-unaware classes.
The non-identical client class sets pose a significant challenge of huge variance in local training across different clients.As shown in Figure 1(b), one can view classification as a matching process between data representations and label representations in a latent space.Because of the non-identical client class sets, locally trained classifiers are more likely to operate in drastically different latent spaces.Moreover, when the class sets are non-overlapping, it is possible that the latent spaces at different clients are completely independent.This would result in inaccurate classification boundaries after aggregation at the server, making our setting more challenging than non-IID clients with identical client class sets.
We propose a novel federated learning framework FedAlign, as shown in Figure 2, to align the latent spaces across clients from both label and data perspectives as follows: (1) Anchor the label representations using label names.We observe that the natural-language class names (i.e., label names) often carry valuable information for understanding label semantics, and more importantly, they are typically safe to share with all parties.Therefore, we break the classification model into a data encoder and a label encoder as shown in Figure 2, and then leverage the label names as the common ground for label encoders.The server initializes the label encoder with pretrained text representations, such as word embedding.The label encoder will be then distributed to different clients and updated alternatingly with data encoders during local training and global aggregation, mutually regulating the latent space.
, where   is the input data and It is possible that some data samples   ∈ D  do not belong to any of the classes in C  , i.e., ∀ ∈ C  :  , = 0. Backbone Classification Model.Let Z ⊂ R  be the latent feature space and Y be the output spaces.Generally, the classification model  can be decomposed into a data encoder  : X → Z parameterized by  and a linear layer (i.e., classifier) ℎ : Z → Y parameterized by  .The data encoder  generates representations for input data.Then, the classifier ℎ transforms the representations into prediction logits.Given an input   , the predicted probability given by  is (  ; ,  ) =  (ℎ( (  ;  );  )), where  is the activation function.We discuss two types of classification tasks as follows.Single-Label Multi-Class Classification.In this setting, each sample is associated with only one positive class.In other words, the classes are mutually exclusive.We use softmax activation to get the predicted probability.The class with the maximum probability is predicted as the positive class.Let (  ; ,  )  denote the predicted probability of   belonging to class .During training, the crossentropy loss is used as the loss function: Multi-Label Classification.In this setting, each sample may be associated with a set of positive classes.For example, a person may have both diabetes and hypertension.The sigmoid activation is applied to get the predicted probability.Each element in the predicted probability represents the probability that the input data   is associated with a specific class.The final predictions are achieved by thresholding the probabilities at 0.  cross-entropy loss is used as the loss function: Federated Learning.Consider a federated learning system with  clients.The server coordinates  clients to update the model in  communication rounds.The learning objective is to minimize the loss on every client, i.e., min ,

1
∈ [ ] L  (,  ).At each round, the server sends the model parameters to a subset of clients and lets them optimize the model by minimizing the loss over their local datasets.The loss at client  is: At the end of each round, the server aggregates the model parameters received from clients, usually by taking the average.

THE FEDALIGN FRAMEWORK 3.1 Overview
The pseudo code of FedAlign can be found in Algorithm 1. Learning with FedAlign framework consists of the following steps: (1) Label name sharing and label encoder initialization: Before training, the server collects the natural language label names from the clients.The server initializes the label encoder's parameters  0 via pretrained text representations, such as word embedding.We expect more advanced techniques like pretrained neural language models could make the learning converge even faster, but we leave it as future work.

Label Name-Anchored Matching
The vanilla model described in Section 2 learns feature spaces merely based on local training data with numerical label IDs.However, with non-identical client class sets, local models at different clients are likely to form different and even independent feature spaces, making the classification boundaries aggregated at the server inaccurate.To better align the feature spaces, we leverage the semantics of label names as a common reference to anchor class representations.The natural language label names carry valuable information for understanding label correlations.For example, in behavioral context recognition, the activity of "lying down" is likely to indicate the person is "sleeping", and the possible location of the activity is "at home".Such knowledge about label correlations not only exists in the datasets to investigate, but can also be mined through analyzing the semantics of label names.
Incorporating Label Encoder to Classification Model.We replace the classifier in a conventional classification model with a label encoder as shown in Figure 2. Let W be the set of natural language label names with respect to C, and Z be the latent feature space.The new classification model  =  •  consists of two branches: a data encoder  : X → Z parameterized by  and a label encoder  : W → Z parameterized by .The • is the operation to get dot product.The label encoder takes the label names   ∈ W as inputs and maps them into representations  (  ; ).
Prior knowledge about label semantics can be inserted into the label encoder by initializing it with pretrained label embeddings.
Inspired by existing works that learn semantic word embeddings based on word-word co-occurrence [2] and point-wise mutual information (PMI) [15,36], we use an external text corpus related to the domain of the classification task to extract knowledge of label co-occurrence and pretrain label embeddings for initializing the label encoder.The pretraining details can be found in Appendix.
Representation Matching.Given an input   , the model uses the data encoder to generate its representation  (  ;  ).Then, it takes the dot product of the data representation and every class representation.This way, it calculates the similarity between the input data and classes.An activation function  is applied to the dot product to get the predicted probabilities of   : The choice of activation function is the same as defined in Section 2.

Alternating Encoder Training.
With the new model design, we rewrite the learning objective in Equation 3 as: The two encoders are two branches in the model.We want the representations obtained by one encoder to regulate the training of the other while preventing mutual interference.Therefore, at each local update step, we first fix the parameters in the label encoder and update the data encoder.Then, we fix the data encoder and update the label encoder.Let , and , be the parameters of the local data encoder and label encoder at -th update step in -th round and  be the learning rate.The parameters are updated as:

Anchor-Guided Alignment for
Locally-Unaware Classes Due to the lack of label information of certain classes to support supervision, the training at each client is biased toward the identified classes [28,51].
Then, the client annotates samples for the locally-unaware classes C  based on the distances.Samples with the closest distances to the class anchor  (  ;  ( ) ) are annotated as positive samples of class .Similarly, samples that are farthest from  (  ;  ( ) ) are annotated as negative samples of .The number of samples to be annotated depends on the percentile of distances.We define two thresholds, τ ( ) and τ ( )  , as the  1 -th and  2 -th percentile of the distances over all samples for annotating positive and negative samples respectively.The client annotates the samples whose distances are less than τ ( )  as positive samples (i.e., ỹ ( ) , = 1) and those with distances greater than τ ( )  as negative samples (i.e., ỹ ( ) , = 0).Figure 3(a) shows an example of selecting positive samples for two classes.The dataset for alignment after the -th round is as follows: For single-label classification, we add another constraint that a sample whose true label is not in C  is annotated as a positive sample of class  ∈ C  only if  is the closest to it among all classes.Alignment.The annotations for unaware classes are then used to guide the alignment at client .We add an additional loss term to the local learning objective.The loss over D ′ ( ) is as follows: where ℓ ′ represents the loss function with the same choice as defined in Equation 1 and 2. A slight difference is that ℓ ′ here is summed over C  .Finally, the local learning objective is to jointly minimize Equation 5 and 10, i.e., min , [L  (, ) + L ′ ( )  (, )].

ANALYSIS ON GENERALIZATION BOUND
In this section, we perform an analysis of the generalization performance of the aggregated model in federated learning.
Denote D as the global distribution on input space X, and D as the induced global distribution over feature space Z.Similarly, for the -th client, denote D  as the local distribution and D be the induced image of D  over Z.We review a typical theoretical upper bound for the generalization of global hypothesis [25,35,53]: Theorem 4.1 (Generalization Bound of Federated Learning).Assume there are  clients in a federated learning system.Let H be the hypothesis class with VC-dimension .The global hypothesis is the aggregation of ℎ  , i.e., ℎ = 1 ∈ [ ] ℎ  .Let L (ℎ) denote the expected risk of ℎ.With probability at least 1 − , for ∀ℎ ∈ H: where L (ℎ  ) is the empirical risk on the -th client given  observed samples,  HΔH (•, •) is the A-distance that measures the divergence between two distributions based on the symmetric difference with respect to H,   is the risk of the optimal hypothesis over H with respect to D and D  ,  is the base of the natural logarithm.
where  ∈ [0, 1] is the weight of the original distribution, which is decided by the number of empirical samples added.Let H be the hypothesis class with VC-dimension .The global hypothesis is the aggregation of ℎ  , i.e., ℎ = 1  ∈ [ ] ℎ  .With probability at least 1 − , for ∀ℎ ∈ H: where L (ℎ  ) is the empirical risk on the -th client given  * ( * >  ) observed samples,  is the base of the natural logarithm.
By combining the local dataset with pseudo-annotated samples, FedAlign increases the sample size i.e.,  * >  , thus the last term of the bound becomes smaller.Second, given that the selected samples are in proximity to the anchors which are derived by the ensemble of the empirical distributions across all clients, the distribution derived via class anchors would exhibit lower divergence from the global distribution compared to the original local distribution i.e.,  HΔH ( D′  , D) <  HΔH ( D , D).The proof and more details are given in Appendix.Therefore, FedAlign can achieve a better generalization bound than traditional methods [31], suggesting a strong potential for performance improvement.

EXPERIMENTS 5.1 Datasets
We conduct experiments on 6 datasets covering 4 different application scenarios and both single-label and multi-label classification problems.Table 1  patient has a specific medical condition or is at risk of development.The task is to annotate medical codes from clinical notes.We start with the MIMIC-III database [8] and follow the preprocessing method in [33] to form the benchmark MIMIC-III 50-label dataset.The classes span 10 categories in the ICD-9 taxonomy 2 .We construct MIMIC-III-10 by partitioning the dataset into 10 clients following the same strategy as in ES-5.(3) Human Activity Recognition.The task aims at identifying the movement or action of a person based on sensor data.We start with the PAMAP2 [39] dataset, which collects data of physical activities from 9 subjects.We construct the PAMAP2-9 dataset by regarding each subject as a client.For each client, we randomly select 5 classes to be its locally-identified classes.(4) Text Classification.We use the Reuters-21578 R8 dataset [3], which consists of news articles classified into 8 categories.We construct R8-8 by randomly partitioning the data into 8 subsets and assigning one subset to each client.For each client, we randomly select 3 classes to be the identified classes.More details about data preprocessing are described in Appendix.

Compared Methods
We compare FedAlign with classical [31] and state-of-the-art federated learning methods for non-IID data [10,20,21] as follows.
• FedAvg [31] is a classical federated learning algorithm where the server averages the updated local model parameters in each round to obtain the global model.i.e., adding a small number of new parameters to the pretrained language model.The adapters are transferred and aggregated, while the other layers remain fixed at all parties.Evaluation Metrics.Due to label imbalance, we adopt both accuracy and F1-score to evaluate the performance.They are often used as benchmark metrics for the datasets and tasks in our experiments [5,37,39,41].We calculate the metrics for each class and report the macro-average.All experiments are repeated 5 times with a fixed set of random seeds for all compared methods.Train/Test Split.We set aside a portion of the dataset for testing the global model.Hyperparameters.For the compared methods, we try different values for the hyperparameters  in FedProx and MOON, and  in FedRS, that are often adopted in the previous papers [20,21,23].
The values are displayed alongside the method name in Table 2.

Main Results and Analysis
Multi-Label, Non-overlapping Client Class Sets.Table 2 shows the results.As one can clearly see, FedAlign always yields better performance than the baseline methods.Remarkably, with nonidentical client class sets, the three state-of-the-art algorithms designed to deal with non-IID data (i.e., FedProx, MOON, and Scaffold) do not guarantee improvement over FedAvg (e.g., Scaffold loses to FedAvg on ES-5).In addition, although FedRS and FedPU are designed for missing class scenarios, their mechanisms are specifically  tailored for single-label classification.In the context of multi-label classification, the label of one class does not indicate the labels of other classes, and the weight update of a class is solely influenced by its own features.Therefore, the scaling factors in FedRS and the misclassification loss estimation in FedPU become ineffective.Single-Label, Non-identical but Overlapping Client Class Sets.FedAlign outperforms the baselines on both applications.The non-IID problems that FedRS and FedPU aim to tackle (i.e., missing class scenario, and positive and unlabeled data) are slightly different from ours.Although they show improvements over FedAvg and methods designed for the typical non-IID setting (i.e., FedProx, MOON, and Scaffold), FedAlign shows better performance compared with FedRS and FedPU in the problem of non-identical client class sets.Performance w.r.t.Communication Rounds.Figure 4 shows the test performance with respect to communication rounds.FedAlign shows its advantage from the early stages of training.This indicates the pretrained text representations provide good initialization for the label encoder to guide the alignment of latent spaces.We do notice a decrease in the F1-score of FedAlign on ES-25 during initial rounds.This can be attributed to the noise in pseudo annotations for locally-unaware classes due to the undertrained encoders.However, as the training progresses, the quality of the pseudo annotations improves, leading to enhanced performance.

Ablation Study
We conduct an ablation study to evaluate the contribution of each design in FedAlign.First, we evaluate the performance of the method without alignment for locally-unaware classes (denoted as FedAlign w/o AL).The classification model consists of a data encoder and a label encoder and the framework conducts alternating training of the two modules.Second, we evaluate the performance of the method without the semantic label name sharing (denoted as FedAlign w/o SE).In this case, the dataset for alignment is formed by annotating the samples according to prediction confidence given by the latest global model.For locally-unaware classes, samples with high prediction confidence are pseudo-annotated, and  Since the model aggregation method in FedAlign is based on FedAvg (i.e., averaging the model parameters), we also compare Fe-dAvg as the baseline method.Table 3 shows the F1-scores.We notice the performance decreases when removing any of the designs.This suggests the designs in FedAlign all contribute to improvement, and combining them can produce the best performance.

Sensitivity Analysis
Participating Clients Per Round.The number of participating clients in each round (i.e., |  |) has an effect on the speed of convergence [22].We vary |  | from 1 to 5 and compare FedAlign with all baseline methods.The comparisons in F1-score are shown in Figure 5(a).We observe that FedAlign can always outperform the baseline methods under different values of |  |.
Local Training Epochs.We vary the local training epochs from 1 to 5 and compare the performance of FedAlign with all baseline methods.The comparisons are shown in Figure 5(b).We see that FedAlign has consistently better performance than the baselines.Distance Threshold for Selecting Samples for Unaware Classes.In Section 3.3, we set the threshold for assigning labels to samples for locally-unaware classes based on distance percentiles.To test the robustness of FedAlign to this hyperparameter, we vary the threshold for annotating positive samples by using different percentiles (95 to 99.9). Figure 5(c) shows the result.We see that FedAlign   only needs a very small amount of pseudo annotations to have significant improvements over FedAvg.Notably, samples closer to the class anchors exhibit a higher likelihood of being accurately annotated, providing better guidance for alignment.[50] to cluster the class representations and sort the label names based on the assigned clusters.We visualize the cosine similarities of a subset of the classes as shown in Figure 7(a), where brighter colors indicate higher similarity.The observed similarity patterns in the class representations conform with our knowledge about what contexts of daily activities often happen together or not.For example, the representations of the classes, "toilet" and "bathing", "meeting" and "with co-workers", "gym" and "exercise" have higher similarity, while they have less similarity with other classes.To provide a reference for ground truth, we calculate the PMI of labels based on their co-occurrence in the centralized dataset to indicate how strong the association is between every two classes.We show the results in Figure 7(b).The brighter the color, the higher the PMI (i.e., the two classes have a stronger association).The order of the classes is the same as in Figure 7(a).We observe the two figures display similar patterns of associations among classes.Although the class sets of different clients are non-overlapping, the label encoder trained via FedAlign successfully captures associations among classes across clients.

RELATED WORK
Federated Learning with Non-IID Data.One of the fundamental challenges in federated learning is the presence of non-IID data [9].The reasons and solutions to this challenge are being actively explored.Common solutions involve adding local regularization [10,20,21], improving server aggregation [25,45,46], and leverage public dataset [25] or synthesized features [28,53] to calibrate models.These methods tackle more relaxed non-IID problems that assume clients have the same set of classes.As shown in our experiments, these baselines show marginal improvements over FedAvg when the clients have unaware classes.Some recent works [23,26] consider the problem of clients having access to only a subset of the entire class set.For example, FedRS [23] addresses the case where each client only owns data from certain classes.FedPU [26] focuses on the scenario where clients label a small portion of their datasets, and there exists unlabeled data from both positive (i.e., locally-identified in our terminology) and negative (i.e., locally-unaware) classes.The problem settings differ from ours.Moreover, these methods are specifically tailored for single-label classification, where the presence of one class indicates the absence or presence of other classes.When applied to our problem, they demonstrate less improvement compared to FedAlign.Label Semantics Modeling.In tasks where some of the label patterns cannot be directly observed from the training dataset, such as zero-shot learning [11], it is hard for the model to generalize to unseen classes.To deal with the problem, several methods are proposed to leverage prior knowledge such as knowledge graphs [44] or model semantic label embedding from textual information about classes [14,29,38,47].For example, Ba et al. [14] derived embedding features for classes from natural language descriptions and learned a mapping to transform text features of classes to visual image feature space.Radford et al [38] used contrastive pretraining to jointly train an image encoder and a text encoder and predict the correct pairings of image and text caption, which helps to produce high-quality image representations.Matsuki et al [29] and Wu et al [47] incorporate word embeddings for zero-shot learning in human activity recognition.These methods show the potential of using semantic relationships between labels to enable predictions for classes not observed in the training set, which motivates our design of semantic label name sharing.

CONCLUSIONS AND FUTURE WORK
We studied the problem of federated classification with non-identical class sets.We propose the FedAlign framework and demonstrate its use in federated learning for various applications.FedAlign incorporates a label encoder in the backbone classification model.Semantic label learning is conducted by leveraging a domain-related corpus and shared label names.The pretrained semantic label embeddings contain the knowledge of label correlations and are used to guide the training of the data encoder.Moreover, the anchorguided alignment enriches features for unaware classes at each client based on global class anchors and reduces the discrepancy between local distributions and global distribution.These two designs are a key to mitigating client variance in FedAlign, which addresses the challenge of non-identical class sets.We show that FedAlign improves the baseline algorithms for federated learning with non-IID data and achieves new state-of-the-art.
It is worth mentioning that FedAlign can work when the clients can only share the label IDs by assuming label names are unknown and randomly initializing the label encoder.Of course, advanced techniques like neural language models can be applied to generate and enrich the label representations, and we leave it as future work.Moreover, for future directions, we consider more general system heterogeneity where the participants have different network architectures, training processes, and tasks.We plan to extend our study to make federated learning compatible with such heterogeneity.the induced global distribution over the feature latent space Z.For the -th local domain, denote D  as the local distribution and D be the induced image of D  over Z.A hypothesis ℎ : Z → {0, 1} is a function that maps features to predicted labels.Let g be the induced image of  over Z.The expected risk of hypothesis ℎ on distribution D is defined as follows: Let   denote the risk of the optimal hypothesis over hypothesis class H that has minimum risk on both D and D  distributions, i.e.,   = min ℎ∈H (L (ℎ) + L  (ℎ)).
We define distance functions for measuring the divergence between two distributions with respect to the hypothesis class.First, given a feature space Z and a collection of measurable subsets A of Z, define A-distance between two distributions D and D′ on Z as:   3 Proof can be found in prior work [25,35,53].).
Therefore, the upper bound of the expected risk with the mix-up distribution is lowered.□ The problem setting of non-identical client class sets.

Figure 1 :
Figure 1: Illustrations of our problem setting and unique challenge of misaligned latent spaces across clients, using a behavioral context recognition system where users have different preferences in reporting (i.e., annotating) labels.

( 2 )
Connect the data representations via anchors of locally-unaware classes.During local training, we regard the global class representations as anchors and utilize data points that are close/far enough to the anchors of locally-unaware classes to align the data encoders.Specifically, as shown in Figure2, at each client, we annotate local data based on their distances to the anchors and add another cross-entropy loss between the pseudo-labels and the model predictions.Such regularization encourages the data encoders to reside in the same latent space.Our theoretical analysis shows that FedAlign can achieve a better generalization bound than traditional federated learning methods, suggesting a strong potential for performance improvement.Experiments on four real-world datasets, including the most challenging scenario of multi-label classification and non-overlapping client class sets, confirm that FedAlign outperforms various stateof-the-art (non-IID) federated classification methods.Our contributions are summarized as follows: • We propose a more general yet practical federated classification setting, namely non-identical client class sets.We identify the new challenge caused by the heterogeneity in client class sets -local models at different clients may operate in different and even independent latent spaces.• We propose a novel framework FedAlign to align the latent spaces across clients from both label and data perspectives.• Our generalization bound analysis and extensive experiments on four real-world datasets of different tasks confirm the superiority of FedAlign over various state-of-the-art (non-IID) federated classification methods both theoretically and empirically.

Figure 2 :
Figure 2: Overview of FedAlign framework.The label names are leveraged as a common ground for label encoders to anchor class representations.During local training, the two encoders perform alternating training to mutually regulate the latent spaces.The global class representations are regarded as class anchors.Pseudo-labels are assigned to partially-unlabeled local samples for unaware classes based on their distances to the anchors.An additional cross-entropy loss for unaware classes is added to the local learning objective to reduce the divergence between global and local distributions.

( 4 )
Model aggregation: The server aggregates the parameters of client models into global parameters.Pretraining text representations and label encoder initialization in (1) are conducted only once at the beginning.Steps (2)-(4) repeat for  rounds until the global model converges.During local training in (3), each client  ∈ S  conducts the following steps: (a) Select samples for unaware classes via class anchors: Client  forms a dataset D ′( )  for locally-unaware classes C  by using the latest class representations as anchors and computing the distances to the data representations.(b) Alternating training of two encoders: Client  freezes the label encoder and updates the data encoder.Then, it freezes the data encoder and updates the label encoder.(c) Model communication after local updates: Client  sends the updated model parameters to the server.

Figure 3 :
Figure 3: (a) illustrates how positive samples are annotated for locally-unaware classes based on distances to class anchors.(b) shows the effect of matching and alignment.

Figure 3 (
b) illustrates the effect of these two losses.

Theorem 4 .
1 applies to the traditional algorithm FedAvg[31], we observe two factors that affect the quality of the global hypothesis: the divergence between the local and global distributions  HΔH ( D , D) and the sample size  .Then, we discuss the generalization bound when FedAlign introduces empirical distributions for locally-unaware classes to align latent spaces.Corollary 4.1.1(Generalization Bound of Federated Learning with Mix-up Distributions).Let D ′  denote the distribution added for aligning the -th client.Define the mix-up distribution D *  to be a mixture of the original local distribution D  and D ′  : D For MIMIC-III and R8, we use the data split provided by the dataset.For the other datasets, we use 20% of the data for testing and distribute the rest of the data to clients for training.Federated Learning Setting.For ES-5, ES-15, ES-25, PAMAP2-9 and R8-8, we run  = 50 rounds.For MIMIC-III-10, we run  = 100 rounds as it takes longer to converge.The number of selected clients per round is |  | = 5 and the local epochs  = 5.Note that we conduct sensitivity analysis in Section 5.6 and show the conclusion of the results is robust to the value of |  | and .

Figure 4 :
Figure 4: Performance w.r.t.communication rounds on six datasets.The results are averaged over 5 runs.
Performance w.r.t.Distance Threshold

Figure 6 :
Figure 6: Data representations generated by two local models and the global model on the testing set of PAMAP2-9.

( a )
Similarity of Class Representations (b) PMI of Labels in Centralized Dataset

Figure 7 :
Figure 7: (a) shows cosine similarities among class representations of ES-25 learned via FedAlign.(b) demonstrates the PMI of labels in the centralized dataset as a reference of ground truth.Brighter colors indicate higher similarity/PMI.

Furthermore, given a
particular hypothesis class H, define A HΔH = {Z ℎ ΔZ ℎ ′ |ℎ, ℎ ′ ∈ H}, where Δ operation is the symmetric difference in the sense of set operation.Define the HΔH-divergence between two distributions D and D′ on Z as:  HΔH ( D, D′ ) =  A HΔH ( D, D′ ).

Theorem A. 1 (
Generalization Bound of Federated Learning3 ).Assume there are  clients in a federated learning system.Let H be the hypothesis class with VC-dimension .The global hypothesis is the aggregation of ℎ  , i.e., ℎ = 1  ∈ [ ] ℎ  .With probability at least 1 − , for ∀ℎ ∈ H:L (ℎ) ≤ 1  ∑︁ ∈ [ ] L (ℎ  ) + 1  ∑︁ ∈ [ ][ HΔH ( D , D) +   ] L (ℎ  ) is the empirical risk on the -th client given  observed samples,  is the base of the natural logarithm.Corollary A.1.1 (Generalization Bound of Federated Learning with Mix-up Distributions).Let D ′  denote the distribution added for adapting the -th client.Define the new distribution D *  to be a mixture of the original local distribution and the adaptation distribution, i.e., D *  =  D  + (1 − )D ′  , where  ∈ [0, 1] is the weight of the original distribution decided by the number of empirical samples added.Let H be the hypothesis class with VC-dimension .The global hypothesis is the aggregation of ℎ  , i.e., ℎ = 1  ∈ [ ] ℎ  .

label update data encoder . freeze update label encoder
label .freeze CE loss for unaware classes
[12,30]rest and farthest samples from the anchors are annotated.An additional loss term is added to the local optimization objective to reduce the distribution mismatch.Compared with common practices of pseudo-labeling[12,30]which assign labels based on model predictions, the annotations assigned by our anchor-guided method are independent of the biased classifier and are thus more reliable.Deriving Class Anchors.When the client receives the parameters of the label encoder  ( ) at -th round, it uses the latest label encoder to derive the global class anchors: { (  ;  ( ) )|  ∈ W}.Selecting Samples for Locally-Unaware Classes.Client  uses the received data encoder to generate representations of its local To mitigate such drift, we further exploit the global class representations to assist the alignment for locally-unaware classes.Since we formulate the classification problem as a matching between representations of classes and local data at each client, the class representations produced by the global label encoder can reflect the global distribution.Therefore, we regard the global class representations as anchors and use them to identify features for unaware classes at each client.Specifically, at the beginning of each round of local training, the client measures the distances between class anchors and local data representations.data: { (  ;  ( ) )|  ∈ X  }.Then, the client calculates the cosine distance from every class anchor to the local data in latent space: offers an overview and the details are as follows.(1)Behavioral Context Recognition.The task is to infer the context of human activity.ExtraSensory [41] is a benchmark dataset for this task.The classes can be partitioned into 5 categories (e.g.location, activity, etc.).Based on ExtraSensory, we construct 3 datasets with non-overlapping client class sets.ES-5: We set 5 clients and every client only has annotations from a different category (i.e., one category to one client).Training samples are then assigned to clients according to their associated classes.Since ExtraSensory is a multi-label dataset, we assign samples based on the most infrequent class among multiple labels to ensure each locally-identified class will have at least one positive sample.To make this dataset more realistic, we always assign all data of a subject to the same client.ES-15 and ES-25: We increase the number of clients to 15 and 25 to further challenge the compared methods.We start with the 5 class groups as ES-5 and iteratively split the groups until the number of class groups matches the number of clients.During every split, we select the group with the most classes and randomly divide it into two sub-groups.Every class group is visible and only visible to one client.One can then apply a similar process as ES-5 to assign training samples to clients.(2) Medical Code Prediction.Medical codes describe whether a

•
[10]rox[21]enforces a  2 regularization term in local optimization which limits the distance between global and local models.•MOON[20]addsacontrastiveloss term to maximize the consistency of representations learned by the global and local models and minimize the consistency between representations learned by the local models of consecutive rounds.•Scaffold[10]maintainscontrol variates to estimate the update directions of global and local models.The drift in local training is approximated by the difference between the update directions.This difference is then added to the local updates to mitigate drift.

Table 1 :
[7]]set statistics.The imbalance factor refers to the ratio of the smallest class size to the largest class size.For a fair comparison, we use the same model setting for all compared methods.The data encoder is based on the Transformer architecture[43]with one encoder layer.There are 4 attention heads, and the dimension of the feedforward network is 64.The label encoder is a single hidden layer neural network.The dimension  of representations is 256.Since the size of the label encoder is equivalent to the classifier layer in the conventional classification model, there is no extra overhead during model communication in FedAlign.Additionally, when considering future work involving the use of advanced neural language models as the label encoder, we can train only the adapter module[7],

Table 3 :
F1-Score (% Averaged Over 5 Runs) of Ablation Study [42]alization of Feature Latent Spaces.We visualize the learned data representations in PAMAP2-9.We generate the data representations on the testing set by the global model and the local models of two participating clients after 50 communication rounds.The locally-identified classes at the two clients are {walking, running, cycling, ironing, rope jumping} and {walking, lying, sitting, standing, vacuum cleaning} respectively.There are one overlapping class and four client-exclusive classes per client.We use t-SNE[42]to project the representations to 2-dimensional embeddings and compare the learned representations by FedAvg and FedAlign.In order to see if the representations generated by different client models are aligned by classes, for each algorithm, we gather the data representations generated by the client models and the global model together to perform the t-SNE transformation.The visualization is shown in Figure6.We position them in the same coordinates.When training via FedAvg, we observe that the data representations of the same class generated by the two local models are likely to fall into different locations in the latent space.This suggests that the latent spaces of the two clients are misaligned, leading to less discriminability among data representations from different classes in the global latent space after model aggregation.On the contrary, when training via FedAlign, the data representations of the same class generated by the two local models have similar locations in latent space.In addition, the data representations learned by FedAlign have clearer separations than those learned by FedAvg.Similarity Among Class Representations.We then analyze the similarities among the class representations of ES-25 learned via FedAlign.Recall that ES-25 is the multi-label classification task where the class sets at different clients are non-overlapping.We use the label encoder from the global model trained after 50 rounds to generate class representations.For a clear view of group similarities, we apply Spectral Clustering