Mitigating Group Bias in Federated Learning for Heterogeneous Devices

Federated Learning is emerging as a privacy-preserving model training approach in distributed edge applications. As such, most edge deployments are heterogeneous in nature i.e., their sensing capabilities and environments vary across deployments. This edge heterogeneity violates the independence and identical distribution (IID) property of local data across clients and produces biased global models i.e. models that contribute to unfair decision-making and discrimination against a particular community or a group. Existing bias mitigation techniques only focus on bias generated from label heterogeneity in non-IID data without accounting for domain variations due to feature heterogeneity and do not address global group-fairness property. Our work proposes a group-fair FL framework that minimizes group-bias while preserving privacy and without resource utilization overhead. Our main idea is to leverage average conditional probabilities to compute a cross-domain group \textit{importance weights} derived from heterogeneous training data to optimize the performance of the worst-performing group using a modified multiplicative weights update method. Additionally, we propose regularization techniques to minimize the difference between the worst and best-performing groups while making sure through our thresholding mechanism to strike a balance between bias reduction and group performance degradation. Our evaluation of human emotion recognition and image classification benchmarks assesses the fair decision-making of our framework in real-world heterogeneous settings.


Introduction
Federated learning (FL) is a privacy-preserving machine learning (ML) technique wherein local models are trained on decentralized edge devices (clients) and subsequently aggregated at the server to form a global model.This approach alleviates the need for raw data transfers and ensures data privacy, making it particularly well-suited for applications with privacy sensitivities, such as medical diagnosis [21,38,59], next-character prediction [67], activity recognition [17,56,66], and human emotion recognition [14,46,72], where preserving data security is imperative.Despite its merits, there is a growing concern regarding FL models, as they exhibit exceptional performance for certain groups while simultaneously underperforming for others (e.g., providing accurate image captioning for pristine group images than noisy group images as shown in Figure 1).A group categorizes data based on attributes such as race, gender, class, or label [7].
Group biases and discriminatory practices threaten societal well-being, undermining public confidence in ML models and their applications [7].Research shows racial bias in electronic health records, especially in medical analysis, potentially causing treatment disparities for minority groups [68].Biased models often result from label heterogeneity in non-IID data across clients, as discussed in works like [52,57], arising from diverse label distributions tied to data collection device environments.For example, certain geo-regions may have varying label distributions, reducing training data volume for specific groups [8,30].
Our work highlights feature noise heterogeneity as a significant source of group bias in FL models, stemming from varied noise-influenced features due to domain differences, especially in heterogeneous devices [48].Heterogeneity leads to distinct feature distributions in local client data.For example, low-quality sensors on some devices introduce distortion like Gaussian noise, resulting in different feature distributions compared to high-quality sensor devices [47].This inherent feature noise causes shifts in group data moments, which are statistical properties such as mean and variance within a group in a dataset [35], influencing biased model outcomes.female dressed as a princess in a white dress male wearing a half sleeve t-shirt and a chef apron female dressed as a bride in a white dress male wearing a half sleeve t-shirt and a vest Feature heterogeneity is due to environmental or device-specific factors such as low resources.
Figure 1.Illustrating the adverse effects of feature heterogeneity (noise) and its bias impact on image classification data [42] on an example language model (LM) in FL settings.The global LM, engaging in image captioning based on features from multiple clients, shows higher performance for images without distortions compared to those with a shift in feature distributions.This emphasizes the intricate interplay of feature heterogeneity and bias in FL, highlighting the influence of heterogeneous client datasets on the model's outcome.
Previous FL research introduces Disparate Learning Processes (DLPs) to tackle bias and fairness issues.Examples of DLPs include in-processing methods like [9,11,12,15,16,18,23,31,43,44,52,57,61,73,74,76] and Robustness and generalization strategies such as [34,41].In-processing techniques modify learning to include group fairness constraints, while robustness and generalization enhance model resilience in diverse data settings.However, DLPs don't ensure fairness in settings with feature heterogeneity, especially due to feature noise, as they don't address misaligned moments in feature distributions [35].For DLPs that use "reweighting" with importance weights to adjust the model's objective function, their effectiveness relies on suitable importance weight selection [6].Importance weights prioritize specific groups or features during training to mitigate biases and enhance fairness [6].If not chosen carefully or not aligned with genuine sources of bias, these weights can lead to continued unfairness [6].We propose using weights derived from noisy feature data for more efficient debiasing in FL models affected by feature noise.This work introduces learnable importance weights from heterogeneous data features to enhance fairness in training, utilizing the multiplicative weight update (MW) method [3] for better fairness based on feature characteristics, especially considering data characteristics with feature noise.Our approach is inspired by insights from social science, particularly addressing discrimination as a health disparity determinant [36].By incorporating learnable importance weights, we aim to mitigate biases across demographic groups, contributing to a more equitable FL framework.
The efficacy of importance weighting diminishes due to exploding weight norms from the empirical risk scaling with importance weights, especially in large models, risking overfitting [6].To tackle this, we propose using neural network regularization techniques [55] in Multiplicative Weight update with Regularization (MWR) to mitigate group bias.Additionally, methods using importance weighting may introduce unfairness by overly emphasizing poorly-performing groups, potentially reducing the performance of better-performing groups to minimize overall variability [13].To address this issue, we present a heuristic approach for deriving importance weights that mitigate group bias while maintaining a performance threshold for better-performing groups, preventing their performance from dropping below a desirable level.We summarize our contributions below: We ensure that MWR optimizes the performance of the worstperforming group while also keeping the performance of the best-performing group above a desirable threshold.• Implementation and Evaluation: We implement and evaluate the MWR method against existing bias-mitigation techniques on commonly used state-of-the-art image classification FL benchmark datasets (CIFAR10 [37], MNIST [40], FashionMNIST [71], USPS [32], SynthDigits [24], and MNIST-M [24]).Our findings show that MWR outperforms baseline methods, boosting the accuracy of the worst group's performance up to 41% without substantially degrading the best group's performance.
2 Background and Related Work 2.1 Bias in Machine Learning.Bias in ML refers to a model favoring specific individuals or groups, leading to unfair outcomes [51].Common sources of bias in centralized learning include prejudice, underestimation, and negative legacy [1,8,49].Techniques such as preprocessing, in-processing, and post-processing [22,26,33] have effectively mitigated centralized learning bias.However, applying centralized learning techniques in FL is challenging due to privacy concerns, requiring access to features across clients and risking data privacy compromise.

Bias Metrics
In FL, group bias is assessed through three dimensions: 1) aiming for equal opportunities by evaluating the performance discrepancy in True Positive Rates (TPR) between groups [58,69]; 2) optimizing the Worst-case TPR (WTPR) for each group [50,58]; 3) minimizing the standard deviation of TPR (TPSD) to ensure fairness across groups [58,73].The choice of TPR as a performance metric of in assessing group group fairness aligns our approach with recent advancements in bias mitigation literature [58].This decision stems from recognizing the critical importance of fairly detecting true positives, which cannot be addressed solely by relying on accuracy.
While our primary focus is on achieving fairness with a minimax property (optimizing WTPR outcome within each group), we evaluate using various fairness metrics to ensure versatility and broad support.
Client fairness targets the development of algorithms leading to models that exhibit similar performance across different clients [44].On the other hand, group fairness requires the model to perform similarly on different demographic groups [73].Many state-of-the-art fairness techniques in FL, focusing on client fairness and group fairness, use inprocessing methods to modify the learning process or objective function by incorporating fairness constraints [73].
In-processing involves assigning weights to the objective function from different clients or groups during training to balance the influence of the model on different groups or clients.For instance, AFL [52] optimizes the combination of worst-weighted losses from local clients, proving resilient to data with an unknown distribution.q-FFL [44] reweights loss functions to give higher weights to devices with poorer performance, addressing challenges in fair resource allocation in computer networks.TERM handles outliers and class imbalance by tilting the loss function with a designated tilting factor [43]. GIFAIR-FL [73] [48].It is important to note that while we discuss Collaborative Fairness, here does not specifically address mitigating group bias in FL, as these techniques do not inherently focus on improving group performances.
Robustness and Generalization techniques address distributional shifts in user data.For instance, FedRobust [60] trains a model to handle worst-case affine shifts, assuming that each client can express its data distribution as an affine transformation of a global distribution, focusing on group fairness.However, FedRobust requires sufficient data for each client to estimate the local worst-case shift, impacting global model performance when this condition is unmet.Fed-NTD tackles catastrophic forgetting distillation [29] but may not fully handle bias from feature noise.SCAFFOLD [34] addresses client drift in heterogeneous data by estimating update directions.However, SCAFFOLD may not correct moments in noisy feature distributions.In contrast, we use importance weights from noisy features to prioritize disadvantaged groups during training, enhancing fairness by indirectly correcting misaligned moments.

Preliminary Study
This section analyzes group-bias arising from heterogeneous feature distributions within local data across clients.The study utilizes Federated Averaging (FedAvg [45]), a widely adopted aggregation method for training global models in FL.

Experimental Setup
Applications and Datasets.Our study analyzes group-bias across  ∈ {4, 5} clients (computers that simulate the FL  environment, mirroring real-world heterogeneous data collection devices following recent works in FL [30,52,73]) using two deep learning models and two datasets.We employ the ResNet model [28] for CIFAR10 [37] image classification and a Convolutional Neural Network (CNN) on the DIGITS classification dataset, which comprises data from diverse sources with feature shifts.The goal is to replicate real-world FL scenarios with varied client data.We construct the DIGITS dataset by combining data from SynthDigits [24], MNIST-M [24], and MNIST [4].
We select these datasets to compare group-bias with existing bias mitigation techniques in FL.Each dataset is evenly distributed among  clients in the FL framework, ensuring equal allocation of group data points.Clients utilize replicated versions of the original benchmark test set, aligning noise feature distributions between training and test data.
We set all model parameters to match FL parameters for global model convergence under IID data settings, including label and feature noise homogeneity.Client settings include a mini-batch size of 128, a learning rate of 0.01, and 40 (for CIFAR10) and 12 (for DIGITS) training rounds.Heterogeneous Feature Distributions.We add noise to mimic real-world distorted images that fail to share the same feature distribution with the pristine training images [25,62,64].In particular, we add Gaussian noise with a variance greater than or equal to 0.03, consistent with the real-world deployments [48].We create two different distortion levels in each dataset across  clients.For the CIFAR10, three advantaged clients (A, B, C) lack distortions, while the other two disadvantaged clients (D, E) host data with Gaussian noise of variance  ∈ {0.03, 0.07, 0.11, 0.3, 0.4, 0.8, 1.0}.For the DIGITS dataset, two advantaged clients (C, D) lack distortions, while the other two disadvantaged clients (A, B) host data with Gaussian noise.

Key Findings
Non-IID Study.We study the FL model's unfairness by examining how the biased global model treats local groups differently for each client.We measure the TPR performance gap between the best and worst groups using each client's local test data (with a similar distortion level as the training data).Figure 2a shows group-bias in CIFAR10, while Figure 2b illustrates this in DIGITS.The global model's recognition of local groups varies per client, as seen in the discrepancy between their performances.Increasing Gaussian noise on a client amplifies this difference, indicating that heterogeneous local features across clients contribute to group bias.Limitation of Federated Averaging.We empirically investigate how heterogeneous local data distributions affect local model gradients.Post-convergence, we extract gradients from the last linear layer of each local model across two clients.Figure 3 shows histograms of these gradients, highlighting variations across clients with heterogeneous features (3b) compared to more consistent distributions in clients with homogeneous features (3a).In 3a, a Spearman correlation [53]  Our non-IID study underscores the challenges in conventional FedAvg schemes, revealing consistently unfair model behavior across distinct applications and datasets.This problem emphasizes the need for bias mitigation methods to alleviate adverse outcomes, including performance degradation in critical applications like medical contexts and the inability to adapt to dynamic heterogeneous environments.

Methodology
The primary objective of our work is to address group bias resulting from feature heterogeneity across clients, all while preventing the leakage of sensitive data.In this section, we formally define our problem and then present our approach to mitigate group bias without substantially degrading the best group performance.
drawn from distribution  (X, Y, G, C).Here, x  ∈ X represents training images from a total of |X| images,   ∈ Y corresponds to |Y| targets,   ∈ G denotes group membership (from |G| groups) of x  , and   is the client on which (x  ,   ) resides out of |C| clients.Our primary goal is to derive a global model ℎ  (with parameters  ) that mitigates group bias for each client, with following objective: In  [27] is used to train a local model, ℎ    , minimizing the empirical risk of the worst-performing group.On the server side, ℎ    from all clients is received and aggregated into a global model ℎ  .Workflow.We illustrate the end-to-end workflow for training with the proposed approach in Figure 4.
❶ In our setup, the server selects all the available clients in each round to avoid the effect of client sampling bias [10,70,77].Then, the server distributes copies of the global model to the clients.
❷-❹ Each client computes the mixture of group likelihoods, denoted as  (  |x  ) (specifically,  (  |x , )).In § 4.2, we outline the privacy-preserving computation details of this denominator, occurring once at the beginning of FL.After each round, clients communicate the local model and local  (  |x , ) for all groups (only in the first round) to the server.
❺ After clients submit their local models and local  (  |x , ), the server uses FedAvg to aggregate the local models and generate an updated global model.Additionally, the server computes a mixture of group likelihoods for all groups using local likelihoods (emphasizing that this computation occurs once at the beginning of FL).
❻ Each client performs local training after distributing updated global model copies and a mixture of likelihoods for all groups.The training involves using our approach MWR to adjust group importance weights based on the mixture of likelihoods for all groups ( § 4.3).
❼ Each client computes the performance threshold for the best group and compares it with the best group performance to evaluate MWR's effectiveness in mitigating group bias without compromising the best group performance ( § 4.5).

Enabling Privacy-preserving Group Fairness
Our approach centers on weighting empirical risks with group importance weights,    , as shown in Equation 1. Calculating these weights is straightforward in centralized learning [20], where a global data view is available.However, In FL, lacking this global view is not trivial.We must estimate    while safeguarding client data privacy.Our solution addresses this by approximating the denominator of    ( (G =   |X)) through a process involving a mixture of group likelihoods across clients.Suppose G = 1, ...,  represents groups across clients in FL.Each client   employs a multiclass logistic linear regression probabilistic model [2] to predict the likelihood of an input sample x , belonging to a specific group   .The model is defined as ] , where  , (x , ) [  = ] is a multinomial probability mass function [39].Each client uses the softmax function  , (x , ) to obtain group membership probabilities ensuring that these probabilities are positive and sum up to one.Clients share their group likelihood estimates with the server.The server then computes each group's global average likelihood using per-client group average likelihood estimates and the law of total probability.For an event space { 1 ,  2 , ...,  |C| } with ( Here To solve the group bias problem, we modify the MW algorithm and transform it into a constrained optimization problem to improve the performance of the the worst-performing group.Algorithm 1 details the workings of the MW algorithm.We assign each client with groups and a set of |G| classes for the underlying application during the local learning process.The optimization constraints comprise decisions made by both the local and global models for groups assigned to clients, ensuring fairness in group classification.Using image features in the training dataset, we validate constraint satisfaction in each local training iteration and identify suitable groups.We then associate decisions made by each local model with a group empirical risk that quantifies how well a decision made by the local model satisfies the constraints.Over time, we minimize the overall risk of the global model by ensuring that each local model incurs a low per-group risk.This involves tracking the global weight for each group and randomly selecting groups with a probability proportional to their importance weights    .In each iteration, we update    using the MW algorithm, multiplying their numerator (G =   |G) with factors dependent on the risk of the associated group decision.This update is performed while maintaining the denominator  (G =   |G) fixed as in (G=  |X) , which penalizes costly group decisions.

Ensuring Optimality through Regularization
The MW algorithm maximizes worst-group performance by scaling the empirical risk and deep neural network weights.However, the weight magnitude does not ensure optimal risk function convergence [6].In our setup, model parameters  are trained with cross-entropy loss and stochastic gradient descent (SGD) [5] optimization, converging toward the solution of the hard-margin support vector machine1 in the direction   | |  | | [65].Introducing weight to the loss function may introduce inconsistencies in the margin.Instead of directly applying importance weighting to the empirical risk, we aim to minimize the following objective for each client : Since the optimization problem with importance weighting is vulnerable to scaling weights and biases, we introduce regularization to the norm of    to increase the margin and mitigate the risk of its enlargement due to scaling, forming the basis of our Multiplicative Weight update with Regularization (MWR) algorithm.

Bias Mitigation without Degrading
High-Performing Groups While MWR ensures group fairness, importance weighting approaches may exbibit unfairness by disproportionately focusing on the worst-performing groups, potentially degrading the performance of the best-performing groups in an attempt to reduce the variance in estimating their contributions to the overall performance [13].Practically, an algorithm for bias mitigation should achieve fairness without significantly degrading the performance of best-performing groups.To address this, we propose a heuristic approach to reweighing the likelihood (group importance weights) associated with each data point belonging to group G =   in the dataset.Suppose we have a set of unnormalized importance weights  1 ,  2 , ...,   corresponding to  data points in a dataset, where each data point has an associated importance weight, we normalize these weights for each group by computing ŵ1 , ŵ2 , ..., ŵ |G| using: The rationale behind Equation3 is to distribute emphasis evenly among different groups, preventing a scenario where a single group dominates the estimation due to an excessively high importance weight.Through weight normalization, we ensure that each group's contribution aligns more closely with its true importance or representation within the dataset.

Satisfying Performance Thresholds
Finally, we establish a performance threshold for the best true positive rate (BTPR) to mitigate group bias without significantly compromising the BTPR .We denote BTPR for a client   as   ,  and WTPR as   ,  .We define the threshold for the best TPR as   ℎℎ .Our fairness enforcement objective aims to minimize the gap between the best and worst-performing groups while maintaining a specified level of TPR performance, as follows: Here   is a parameter governing the trade-off between group fairness and performance.Inequality in 4 scales the difference between BTPR and WTPR by   and compares it to the difference between the BTPR and the threshold.For each client, we rearrange the inequality in 4 to obtain the minimum BTPR threshold as expressed in equation 5.

Evaluation
This section evaluates our MWR group-bias mitigation technique on four image classification datasets (CIFAR10, DIGITS, MNIST, and FashionMNIST).We benchmark our approach against standard bias mitigation techniques in FL.

Experiment Testbed
Our evaluation setup uses the same number of clients, data partitioning scheme, and other learning components (such as learning rate, train/test split, batch size, epochs, rounds) described in §3.1 unless stated otherwise.
Baseline.We evaluate our approach across four key categories, scrutinizing both bias reduction and overall model performance.The FL baseline category (FedAvg) represents a conventional learning scheme in FL.In the FL bias-reduction category, we include methods such as AFL [52], TERM [43], and GIFAIR-FL [73].These methods employ empirical risk reweighting to mitigate bias and adapt the global model to diverse local data distributions.The FL heterogeneity category (FedNTD [41]) specifically addresses performance loss in FL models arising from data heterogeneity by managing global model memory loss.In the FL robustness category (SCAFFOLD [34]), the focus is on enhancing the resilience of FL models against outliers and noisy data, thereby mitigating the impact of irregularities in specific device local datasets.To ensure a fair evaluation across all baselines, we meticulously calibrate hyperparameters across datasets, guaranteeing the convergence of the global model.
Hyperparameter Tuning for MWR.We use the same experimental setup as FedAvg, AFL, FedNTD, TERM, GIFAIR-FL, and SCAFFOLD.However, to apply MWR update algorithm per-group loss, we set the value of   (see Algorithm 1) to different values in the set {0.01, 0.02, 0.001, 0.009, 0.0001} based on the level of Gaussian noise in data partitions.Finally, MWR uses an 1 regularization parameter of 0.00001 for all datasets.

Efficacy and Robustness Analysis
We now assess the efficacy and robustness of our MWR group-bias mitigation technique with the baselines.

Effect on Group
Bias.We assess the efficacy of MWR's group-bias mitigation through: (i) evaluating the best-and worst-group performance (TPR), (ii) analyzing the TPR group variance per client, and (iii) examining the TPR discrepancy per client.This evaluation is conducted on four datasets, incorporating low-grade distortion to simulate prevalent real-world heterogeneity [30].Table1 presents the TPR, TPRSD, WTPR, and BTPR perfromance scores across various bias mitigation techniques and datasets.Notably, among these techniques, MWR stands out by achieving a significantly fairer outcomes for groups.We can see that our algorithm substantially decreases TPRSD across most clients while maintaining a consistently high TPR.Importance weighting, especially when derived from features characteristics, is powerful in mitigating biases caused by feature noise.If the bias is primarily driven by certain features, assigning appropriate weights to these features can help the model focus on relevant information and reduce the impact of noisy features, resulting in more consistent and equitable predictions.
Although AFL and FedNTD occasionally outperform MWR in some instances concerning the TPRSD metric as can be seen in DIGITS dataset's client4 and MNIST dataset's clients4 and 5, the differences between the results are marginal.Importance weighting is sensitive to distribution shifts in the feature space.If there are instances where the distribution shifts significantly, the importance weights may not be as effective.On the other hand, techniques such as FedNTD, through knowledge distillation, seem to be more robust to feature noise as it involves transferring knowledge from a more complex model (teacher) to a simpler one (student), potentially leading to better generalization and lower standard deviation in true positive rates across groups.Additionally, it becomes evident from Table 1 that MWR results in an increased WTPR for the group with the smallest TPR, accompanied by the smallest TPRD among the evaluated bias mitigation techniques.
Importance weights derived from image features captures the distinctive characteristics of different groups more effectively than other methods.This adaptability is crucial in mitigating bias since it tailors the mitigation strategy to the specific features and challenges present in each group.Despite TERM appearing to outperform our proposed method for the minimax group fairness metric (WTPR) in CIFAR10 dataset's clients 1, 2, and 3, this can be understood as a consequence of the reduction in TPR among privileged clients lacking local data with distortions.This reduction elevates the lower TPR among disadvantaged clients affected by distortions Importantly, the differences between the results are marginal, indicating a closely competitive performance between the methods despite this disparity while elevating the group-fairness among clients.Takeaway: MWR ensures fairness across groups and maintains predictive accuracy by using importance weights that prioritize the worst-performing groups.Its key strength lies in maintaining fairness without sacrificing performance, achieved through even distribution of importance weights among different groups.

Robustness of Bias Mitigation.
In our previous analysis, we added low-grade Gaussian noise to mimic noise in edge device images [47].To further test MWR's resilience against increased feature heterogeneity, we raised noise levels in segmented datasets like CIFAR10, MNIST, DIGITS, and Fashion-MNIST to variances of 0.11, 1.10, 1.00, and 0.4, respectively.Model performance evaluation used the same fairness metrics as before.Table 2 displays TPR, TPRSD, WTPR, and BTPR scores across various bias mitigation techniques and datasets, exploring high-grade distortion scenarios in local data.Consistent with our earlier findings, MWR delivers significantly fairer outcomes across diverse groups.The table shows MWR reduces TPRSD across most devices while maintaining high TPR.Compared with Table 1, MWR increases WTPR for the lowest TPR group, resulting in minimal TPRD among bias mitigation techniques.This enhancement in WTPR for disadvantaged groups minimally affects high-performing groups' performance.
Although some bias mitigation techniques may slightly outperform in TPRSD and WTPR fairness metrics, this often occurs at the expense of decreased TPR in privileged clients not affected by distortions.However, this decrease compensates for an increase in lower TPR among disadvantaged clients.Despite these differences, the results remain closely competitive among methods, indicating similar performance despite disparity, while simultaneously improving group fairness among clients.
Takeaway.our robustness analysis suggests that MWR stands out as a robust and fair approach even in scenarios with highgrade heterogeneity, showcasing its effectiveness in mitigating bias across diverse datasets and client groups.

Privacy Analysis
This section explores how differential privacy affects group fairness and performance in MWR, particularly in scenarios where local group probability distributions  (G =   |x , ) are shared with the server to compute importance weights.Differential privacy is crucial for preserving privacy in client metadata, preventing disclosure of sensitive details like group selection probabilities.We use the MNIST and FashionMNIST datasets for our privacy budget analysis, maintaining consistency in experimental setups and various learning components as detailed in §3.1.We introduce different levels of Laplace noise, denoted by , to local probability distributions.An  value of 0.00 represents perfect differential privacy in the implementation of MWR.
Figures 5 to 8 show the impact of varying levels of Laplace noise () on group-fairness metrics (WTPR, TPRSD, and TPRD) and group performance (TPR) in MWR, addressing bias in local data with different levels of feature noise.In Figures 5a to 7b, we see that using a privacy budget ( ∈ 0.0, 0.4, 0.8) for metadata exchange maintains fairness metrics similar to deploying MWR without privacy ( −→ ∞) on MNIST and FashionMNIST.This is evident from minimal variations in WTPR, TPRSD, and TPRD across all clients (with high and low feature heterogeneity) under all privacy budgets.Moreover, the privacy budget ensures fairness while preserving the best and worst TPR performance.This aligns with the fairness guarantee of MWR, as the privacy budget values ( ∈ 0.0, 0.4, 0.8) fall within a range that provides algorithmic fairness, as noted in [1].Our privacy analysis underscores that our method ensures client privacy through differential privacy on shared metadata without significantly affecting bias or accuracy.Takeaway.MWR demonstrates the feasibility of preserving sensitive information while effectively reducing group bias.

Fairness Budget Analysis
MWR incorporates a fairness budget, denoted as   , to regulate importance weight adjustments for fairness.This control mechanism in MWR adjusts importance weights based on past group performance (group loss) for fairness metrics.We assess the impact of   on group fairness metrics (WTPR, TPRSD, TPRD) using MNIST and FashionMNIST datasets, setting   to different values (−0.009, −0.003, −0.001, −0.0002).Tables 3 and 4 show how the fairness budget   affects both group fairness and group performance (TPR) with MWR.
Increasing   values improve fairness guarantees, leading to better WTPR, TPRSD, and TPRD due to faster convergence and adaptation to fairness issues.Conversely, lower   values result in more gradual adjustments, slowing down the algorithm's fairness improvements.This experiment is

Figure 2 .
Figure 2. Varied noise levels in CIFAR10 and DIGITS datasets.The notation "Noise = " denotes the introduction of Gaussian noise with variance , " specifically applied to clients  and  in CIFAR10 and clients  and  in DIGITS.
of 0.46 indicates strong correlation and uniformity among clients with IID features.Conversely, in Figure 3b, clients with non-IID features show a correlation of −0.14, suggesting dissimilarity.

14 Figure 3 .
Figure 3. Gradient distribution in a fully connected layer on the CIFAR10 dataset.The red and blue bars depict the local gradient distribution on client 1 and client 2, respectively.In (a), the distribution of local gradients is demonstrated across the two clients in IID settings.In (b), the distribution is shown in non-IID settings, with the introduction of Gaussian noise with variance  (noise = ) on non-IID clients.Algorithm 1 MW group-fairness in Federated Learning

Figure 4 .
Figure 4. Overview of the proposed approach.

Figure 5 .Figure 6 .
Figure 5. Examining the performance trade-off in   concerning privacy and accuracy across various levels of differential privacy (DP) noise factors on FashionMNIST.In (a), a base Gaussian noise with a variance of 0.3 is introduced to all methods, while in (b), Gaussian noise with a variance of 0.4 is applied to all methods.

Figure 7 .Figure 8 .
Figure 7. Analyzing the privacy-bias trade-off in   across differential privacy (DP) noise levels on FashionMNIST.(a) introduces a base Gaussian noise with a variance of 0.3, and in (b), Gaussian noise with a variance of 0.4 is applied.Shaded areas represent deviation represented by TPRSD.

Table 1 .
Performance evaluation of bias mitigation techniques across various datasets and benchmark models under low-grade noise.Symbols used: ↑indicates that higher values are more desirable, while ↓ indicates that lower values are more desirable.For each client across each benchmarks in a particular dataset * signifies the best TPRD; ⊙ designates the best TPRSD; • represents the best WTPR; and ▷ indicates the best BTPR .(Note: On DIGITS dataset,training involves only 4 clients, reflecting its composition of merely 4 heterogeneous datasets.)

Table 2 .
Performance evaluation of bias mitigation techniques across various datasets and benchmark models under low-grade noise.Symbols used: ↑indicates that higher values are more desirable, while ↓ indicates that lower values are more desirable.For each client across each benchmarks in a particular dataset * signifies the best TPRD; ⊙ designates the best TPRSD; • represents the best WTPR; and ▷ indicates the best BTPR .(Note: On DIGITS dataset,training involves only 4 clients, reflecting its composition of merely 4 heterogeneous datasets.)

6
Conclusion and Future WorkThis study explores FL group bias in decentralized, heterogeneous edge deployments, where devices capture data with diverse features often influenced by noise.Our framework, MWR, uses importance weighting and average conditional probabilities based on data features to improve group fairness in FL across varied local datasets.Heterogeneous features in local group data can bias FL models for minority clients, impacting specific groups on those clients.MWR addresses this bias by optimizing worst-performing groups without compromising the best-performing ones compared to other FL methods.While effective, MWR relies on group information to mitigate bias across clients, which can lead to persistent loss discrepancies under severe feature heterogeneity.Future work aims to incorporate methods for estimating and denoising data features to reduce noise without compromising data quality.MWR is highly adaptable and can be extended to complex applications beyond image classification.It can optimize diagnostic outcomes in healthcare datasets, handle multimodal and text-based applications like next-character prediction and image captioning, and mitigate bias in emotion prediction applications within FL settings, ensuring equitable outcomes across diverse groups.noise variance = 0.3 noise variance = 0.4   = −0.003  = −0.009  = −0.003  = −0.00910.92 10.92 10.92 11.11 11.11 11.02 11.02 11.02 11.17 11.17 11.21 11.21 11.21 12.05 12.05 11.35 11.35 11.35 12.44 12.44

Table 4 .
Impact of the fairness budget   on the TPR, TPRD, and WTPR on MNIST.A base Gaussian noise with a variance of 0.8, 1.1 is introduced to MWR in (a) and (b), respectively.↑: Higher is best, ↓: Lower is best.