FLIPS: Federated Learning using Intelligent Participant Selection

This paper presents the design and implementation of FLIPS, a middleware system to manage data and participant heterogeneity in federated learning (FL) training workloads. In particular, we examine the benefits of label distribution clustering on participant selection in federated learning. FLIPS clusters parties involved in an FL training job based on the label distribution of their data apriori, and during FL training, ensures that each cluster is equitably represented in the participants selected. FLIPS can support the most common FL algorithms, including FedAvg, FedProx, FedDyn, FedOpt and FedYogi. To manage platform heterogeneity and dynamic resource availability, FLIPS incorporates a straggler management mechanism to handle changing capacities in distributed, smart community applications. Privacy of label distributions, clustering and participant selection is ensured through a trusted execution environment (TEE). Our comprehensive empirical evaluation compares FLIPS with random participant selection, as well as three other"smart"selection mechanisms - Oort, TiFL and gradient clustering using two real-world datasets, two benchmark datasets, two different non-IID distributions and three common FL algorithms (FedYogi, FedProx and FedAvg). We demonstrate that FLIPS significantly improves convergence, achieving higher accuracy by 17 - 20 % with 20 - 60 % lower communication costs, and these benefits endure in the presence of straggler participants.


Introduction
Federated Learning (FL) [45] is the process by which multiple participants (parties) collaborate to train a common machine learning (ML) model, without sharing data among themselves or with a centralized cloud-hosted machine learning service.FL allows parties to retain private data within their controlled domains.Only model updates are typically shared to a central aggregation server hosted by one of the parties or a cloud service provider.This, along with transformations applied to model updates (e.g., homomorphic encryption [40] and addition of noise for differential privacy [5]) make FL privacy preserving.
Why FL? From a participant perspective, a key goal of FL is to access diverse training data to enhance the robustness of machine learning models by promoting generalization, outlier detection and bias mitigation.FL allows the training of machine learning models using data distributed across multiple devices or edge nodes from different demographies and geographical locations.This distributed nature allows for diverse datasets, as each device may have unique data reflecting various user behaviors, preferences, or contexts.The privacy-preserving aspect encourages a wider range of participants to contribute their data, including those who may be concerned about sharing personal information or those restricted by regulations like HIPAA and GDPR, further leading to a more representative and inclusive dataset.
Benefits of Diverse Datasets: A diverse training dataset provides a broader representation of the real-world scenarios and variations that the model may encounter during inference.By exposing the model to a wide range of data samples, including different classes, variations, edge cases and outliers, the model can learn more generalized patterns and make better predictions on unseen data.Outliers can be valuable in identifying rare events or unusual patterns that may not be present in a homogeneous dataset.Also, including diverse samples that represent different demographics, backgrounds, and perspectives, models can learn to be more equitable and avoid perpetuating bias.It enables the model to make predictions that are more representative and fair for a broader range of individuals.
non-IID Data in FL.The diversity of real-world entities and their private data also implies that FL techniques and algorithms have to be designed to handle non-IID (non Independent and identically distributed) data.Parties not only have different data items, but also often have wildly different types of data items, corresponding to different labels.Several leading FL researchers [84,45] have noted that the presence of IID data is the exception rather than the norm [84].
Intermittent Participation.FL, in general, is characterized by intermittent participation, which means that for every FL round, each party trains at its convenience, or feasibility.This may be when devices are connected to power in the case of mobile phones, tablets and laptops (FL over edge devices); when (local) resource utilization from other computations is low and when there are no pending jobs with higher priority (both in edge and datacenter use cases).The aggregator expects to hear from the parties eventually -typically over several minutes or hours.Parties in FL jobs are often highly unreliable and are expected to drop and rejoin frequently.
Participant Selection: Due to the intermittent nature of parties, existing FL algorithms like FedAvg, FedProx, FedMA, FedDyn, etc.only select a subset of parties in each round, often employing randomization to select parties [84,45].Random selection eventually offers each party an opportunity to participate in the FL job, but does not take into account the type of data present at each party, and does not ensure that parties with diverse datasets are selected in each round.There is also significant empirical evidence that Non-IID data combined with random selection significantly increases the time taken for models to converge [62,33,53] (we reproduce some of these results in Section 5).
There is increasing recognition that random selection is suboptimal for other reasons as well.The selected participant could have compute or communication constraints and might actually not be able to participate in the round.There is existing research on selecting participants for each round based on the ability to participate, amount of data present at each participant, history of reliable participation, and communication constraints [44,51,28].But said research has not considered the label distribution among participants.This is unfortunate because it is vital for model generalization in FL to equitably consider all participants, including outliers.This paper makes the following contributions: • FLIPS, a middleware for effective management of data and platform heterogeneity in FL using clustering techniques based on label distributions.• An algorithm to use the generated clusters to select diverse participants at each round of an FL training job, ensuring that parties are equitably represented while offering each party a fair opportunity to participate.• A private mechanism using trusted execution environments (TEEs) to cluster participants in FL jobs based on the label distributions of their data and identify the diversity in their data in a secure manner.• A thorough empirical evaluation comparing FLIPS with the predominant random participant selection, as well as two other "smart" selection mechanisms -Oort [51] and gradient clustering [29] using two real world datasets, two benchmark datasets, two different non-IID distributions and three common FL algorithms (FedYogi, FedProx and FedAvg).We demonstrate that FLIPS significantly improves convergence, achieving superior accuracy with much lower communication costs, and these benefits endure in the presence of stragglers.

Federated Learning and Heterogeneity
A typical FL setting consists of parties with local data and an aggregator server (when the number of parties is small) or service (e.g., a microservice using Apache Spark to aggregate model updates when number of parties is large) to orchestrate FL.An FL job proceeds over several rounds, also called synchronization rounds.An aggregator typically coordinates the entire FL job, in addition to aggregating model updates and distributing the updated model back to the parties.Co-ordination, includes, agreeing on the following FL job parameters before the job starts: (i) model architecture (ResNet, EfficientNet, etc), (ii) FL algorithm (FedAvg, FedYogi, etc.) and any algorithm specific parameters like minimum number of parties required for each round, (iii) how to initialize the global model (whether random, or from existing pre-trained models), (iv) hyperparameters to be used (batch size, learning rate etc.), and (v) termination criteria, whether the FL job ends after a specific number of rounds, or when a majority of parties decide that the model is satisfactory (e.g., has reached a desired level of accuracy).

Common FL Algorithms
FL algorithms [84] primarily differ in how model updates are aggregated and the mathematical formula (called the OPTIMIZER in machine learning literature) used to apply the aggregated model update to the global model.
For FedAvg [64], the aggregator selects a random subset S (r) ⊂ S of parties for every round r.The aggregator in FedAvg computes the weighted average of all participant updates (gradients): i to compute the global model update x (r) and update the global model (for the next round) m (r+1) using SGD as the OPTIMIZER.This process proceeds for a set number R of rounds or (less typically) until a majority of the parties vote to terminate.The term n i in the weighted average is the number of training samples at party i, and N is the total number of training samples involved in the round, i.e., N = i∈S (r) n i .FedAvg is the first widely-deployed FL algorithm but does not result in an optimized global model when the data distribution is not IID.
When the entire set of parties S is used in every round, and SGD is the OPTIMIZER, we get FedSGD.FedAdam and FedAdagrad are the same FedAvg with Adam [49] and AdaGrad [27] as the OPTIMIZER, respectively.FedProx [55] is a variation of the FedAvg that aims to produce better global models in the presence of non-IID data.FedProx includes a Proximal term in the OPTIMIZER with penalty parameter µ.If F k (x r,k ) is the local loss function at a party k at round r.Then the local loss function in FedProx includes µ which then translates to F k (x (r,k) ) + µ 2 ||m (r) − x (r,k) ||.This brings the model x (r,k) closer to m (r) at each party k.By requiring the updates to be close to the starting point, a big µ could potentially hinder convergence, but a small µ might have no effect.
FedYoGi [51,73], which has been shown to outperform FedAvg and FedProx [51] with non-IID data, uses an adaptive optimizer to update the global model.The server OPTIMIZER maintains a per-parameter learning rate, updated based on the history of gradients gr (r,k) = x (r,k) − m (r) .This allows the server OPTIMIZER to adapt to the local data distributions of the clients.FedYoGi introduces a moving average of gradients term m t = β 1 * m t + (1 − β 1 ) * gr (r,k)  and moving average of squared gradients v t = β 2 * v t + (1 − β 2 ) * (gr (r,k) ) 2 , with 2 momentum hyperparameters β 1 and β 2 .m (r) is updated by m (r) − lr * mt √ vt+eps , where lr is the learning-rate and eps is a small constant to prevent division by 0.

Data Heterogeneity : Dealing with Non-IID Data
Despite advances in FL, the ability of current techniques to deal with variabilities and diversity in real-world data is still limited.The term non-IID (non-independent and identically distributed) data distributions in the context of FL refers to the situation in which the data on each device or node taking part in the FL process is not independently and identically distributed across all devices.This can happen when data is collected from several sources or under various circumstances, resulting in variances in the data and/or label distribution on each device.
In FedAvg, FedProx and FedYoGi at each round r, S (r) parties are sampled randomly from the given pool.|S (r) | is typically small when compared to |S|, typically less than 20% in real deployments [11,37].This will result in some rounds, where parties with similar data distributions are chosen and thus lead to class imbalance, when certain classes of data are underrepresented on certain parties.This leads to model divergence in rounds when parties with diverse data are chosen, and the model to be biased towards the overrepresented classes.To understand why this happens, it is helpful to consider the centralized learning setting, where the training typically makes a pass over the entire dataset in every training round (epoch), and outlier and diverse data is considered in every training epoch.With random selection, outlier data may get omitted continuously for several rounds in FL, especially with typical values of |S (r) |.
To further illustrate the significance, we consider a real-world use-case from Senior Care focusing on smartspaces and assisted living (name hidden for anonymity).One application investigated for FL is Arrhythmia detection using ECG signals [65] from wearables where data exhibits non-IIDness, as more data points are recorded for normal heartbeats.Abnormal heartbeats are recorded in devices worn by people with heart ailments, a small fraction of all the parties in FL [65].With random selection in any FL algorithm, there is always a higher chance of selecting a party with majority Normal beats at any round, biasing the model towards classifying most heartbeats as Normal.

Platform Heterogeneity : Stragglers
In federated learning (FL) deployments in the real world, Platform Heterogeneity plays a significant role in the convergence of the global model.Platform heterogeneity across different parties causes some of them to be stragglers which exhibit intermittent failure.Stragglers are devices that take longer than expected in an FL environment to complete their local training.These gadgets have the potential to stall the FL process and perhaps fail.Stragglers can appear in real-world FL deployments for a variety of reasons.Among the most popular explanations are: • Data transfer between devices may experience delays due to network congestion.As a result, some devices cannot get the information they require to finish a task as quickly as they should.• Stragglers can also be caused by device faults.A gadget might crash or run out of battery life, preventing it from finishing a task.• Devices deployed in challenged settings may be more likely to have restricted resources, such as memory or computing power.This is because the workload of an FL task may be too much for these devices to handle quickly.
One of the major roadblocks in FL is straggler parties, among the S (r) .These parties stall the overall FL training as they do not communicate their models within the given time threshold for local training.This results in underrepresenting the straggler's data while aggregating the global model.

Security and Privacy of FL
Recent research [34,26,96,31,90,97]  HE (e.g., [68,14]) for FL involves parties sharing a common public/private keypair.Parties encrypt the model updates before transmitting them to the aggregator which performs the aggregation computation on the encrypted data.Aggregated model updates can then be decrypted by the parties.HE does not change model utility, but is computationally expensive -two or three orders of magnitude even with the use of specialized hardware [92,81,82], and also results in significant increase in the size of the model update (e.g., 64× for Paillier HE [92] which is sufficient for many FL algorithms).While HE is practical for FL in cross-silo datacenter/cloud settings where its latency and bandwidth requirements can be accommodated, the need for all participants to share a common keypair makes it impractical for large scale settings.
Differential privacy [5,77,79,89] is a statistical technique that adds controlled noise to the model updates before aggregation.This noise ensures that individual updates do not reveal sensitive information about the data used for training, defeating data reconstruction and membership inference attacks.Some techniques also clip the model updates (gradients) to a predefined range before the addition of noise.This further limits the information that can be extracted from the updates.By carefully controlling the amount of noise, differential privacy provides a trade-off between privacy and utility.However, this is non-trivial, and model utility is very sensitive to an optimal choice of hyperparameters which is difficult in large-scale FL settings.
Secure Multi-Party Computation (SMPC) protocols [48,22,23,40,13] allow multiple parties (devices in this context) to perform computations on their inputs while keeping those inputs private.In FL, SMPC is used to aggregate model updates securely.Each party encrypts its update, and multiple parties perform computations on the encrypted updates without revealing the raw data.The result is an aggregated update that can be used to update the global model.Many SMPC protocols also suffer similar drawbacks as HE, with increased communication and computation time, lower in magnitude than HE but still significant, and the need for effective key distribution.
A trusted execution environment (TEE) [1,2] is a secure area of a main processor that provides security features for isolated execution and guarantees the integrity of applications executing within, along with the confidentiality of their data assets.ARM TrustZone [4] and Intel SGX [3] are examples of TEEs.For aggregation in FL, they are attractive because their computational overhead is low, and their computations and software can be audited by participants with the help of attestation services.At least one FL system deployed at scale -Papaya [37] uses TEEs for aggregation.

FLIPS: Design
Participant selection is a key challenge in FL.It is the process of choosing which devices will participate in each round of training, and is predominantly random [55,83,11,84].There is existing research on participant selection to optimize communication costs and computation limitations, which does not consider parties' data distribution and data diversity.For example, [36] models the client selection process as a Lyapunov optimization problem.The authors propose a C2MAB-based method to estimate the model exchange time between each client and the server.They then design an algorithm called RBCS-F to solve the problem.VF-PS [43] is a framework for selecting important participants in vertical FL (VFL) efficiently and securely.It works by estimating the mutual information between each participant's data and the target variable and then selects the most important participants based on their scores.To ensure efficiency, VF-PS uses a group testing-based search procedure.To ensure security, it uses a secure aggregation protocol.VF-PS achieves the target accuracy faster than training a naive VFL model.FedMCCS [6] addresses challenges in using FL with IoT devices.FedMCCS considers the CPU, memory, energy, and time of the client devices to predict whether they are able to perform the FL task.In each round, FedMCCS maximizes the number of clients while considering their resources and capability to train and send the needed updates successfully.FedM-CCS outperforms other approaches by reducing the number of communication rounds to reach the intended accuracy, maximizing the number of clients, ensuring the least number of discarded rounds and optimizing the network traffic.
[67] takes a similar approach as FedMCCS, prioritizing resource availability.Another approach considers data valuation for compensating valuable data owners [87].They use the Shapley value as a fair allocation mechanism that assigns a value to each data source based on its contribution to the model's performance to enhance system robustness, security, and efficiency; such mechanisms could be used for aiding participant selection.[18] proposes Power-of-Choice, a communication-and computation-efficient client selection framework, which randomly selects a fraction of participants in each round, but biases selection towards those with higher local losses.[18] proves theoretically and empirically that this biased selection leads to faster convergence.Oort [51] takes a similar approach.
In this section, we introduce FLIPS, our approach to intelligent participant selection that improves model convergence, addresses diversity in datasets and incurs low communication overheads.First, FLIPS mitigates the above-mentioned class imbalance issue, (Algorithm 1) and improves feature and participant representation in FL by selecting parties that are likely to have dissimilar data at each FL round.The techniques implemented within FLIPS are based on one of the federated learning's core goals -to increase the diversity of data and ensure that the global model in each FL round does not overfit local data.

Finding Similar Parties using Clustering
The objective of FLIPS is to identify sets of similar parties by measuring the label distribution of each party and using it as a semantic representation of a party's local dataset.There are N parties, p 1 , p 2 , . . ., p N , with datasets d 1 , d 2 , . . ., d N and label distributions ld 1 , ld 2 , . . ., ld N , respectively.The label distribution vector at party p i with dataset d i is ld i = {l 1 , l 2 , . . ., l g }, where l j is the number of data points for the label j th present in the party and g is the number of labels in the dataset.The set of label distributions for all N parties is denoted as LD = {ld 1 , ld 2 , . . ., ld N }.
Our next step is to group the label distributions from various parties into non-intersecting subsets that are similar.
Here, we define a similarity metric between subsets that is based on the average distance between all objects in a given subset and the average distance between subsets.Let S m be the set of all possible subsets of LD of size m, where m ϵ [1, N ].Hence, there are N m subsets within S m .Let L i = {ld a , ld b , ....}, be a subset ϵ S m , ∆(L i ) is the average Euclidean distance between all objects in the set L i .Given 2 subsets L i and L j , δ(L i , L j ) is the average Euclidean distance between objects in sets L i and those in L j .
The idea behind finding similar parties is to find k disjoint subsets across all S m , where m ϵ [1, N ], such that: Note that this problem is a subset enumeration problem, where we have to find k subsets across all S m , where the condition is (1).This problem is known to be NP-complete [80].There are several heuristics to solve this problem, we use K-Means [59] clustering to solve this problem which obtains a k-partition, where k is unknown, across all S m , denoted by clusters Here, ω xy is a binary variable that indicates whether the xth datapoint is assigned to the cluster C y , whose centroid is c y .We opted for K-Means due to its simplicity and lower time complexity.K-Means clustering has a time complexity of O(N kI * d), where N is the number of data points, k is the number of clusters, I is the number of iterations, and d is the number of dimensions.This makes it a suitable choice for resource-limited settings.Furthermore, we use kmeans ++ [9] to initialize the centroids in K-Means clustering.This has been demonstrated to scale to millions of data points, i.e., parties [10].However, K-means does require the number of clusters k to be known beforehand, which is a problem in FL.The knowledge about the number of unique label distributions in the parties' datasets is unknown while performing clustering, as each party's data is kept private.To find the optimal number of clusters k, which is analogous to the number of unique label distributions, we use a purity metric called the Davies-Bouldin index (dbi) [24], which is the ratio of the intra-cluster distance to the inter-cluster distance and similar to (1).The optimal k is determined by : where dbi is the Davies-Bouldin index.
When k is small, the clusters cannot accurately represent the unique label distributions, impacting FL performance and cost because there is no equitable representation at each round.When k is large, clustering leads to overfitting; clusters generated are sparse, impacting FL performance and cost.To determine the optimal cluster size in FLIPS empirically, we experiment with different cluster sizes in succession, repeated T = 20 number of times (because K-Means is sensitive to the centroid initialization) and average the dbi for each k ∈ {2, . . ., K}, where K = N .This gives us T different dbi for each cluster size k.The cluster size k for which there is a (first) sharp change in the slope of the curve (elbow point) is chosen as the optimal cluster size.As illustrated in Figure 2, the optimal cluster size at which there is a sharp change in slope for k vs dbi is 10.Hence, we choose 10 as the cluster size for K-Means.

Intelligent Participant selection
Given a set of clusters of parties, C, from the clustering technique described above, FLIPS (Algorithm 1) implements participant selection for a round by choosing one party at a time from each cluster in a round-robin manner until the number of parties required for the round, N r , is reached.Typically in FL training, the number of parties per round N r is fixed across all the rounds.This ensures that N r is spread among as many clusters as possible increasing the diversity of data in the FL training process.It is recommended that N r be a multiple of the number of clusters |C| since N r can then be easily split among the number of available clusters (|C|), ensuring equal representation from each cluster ( Nr |C| ).FLIPS also keeps track of the number of times a party was chosen to ensure that each party within a cluster is given an equal opportunity to participate.In the case where the number of parties per round is less than the number of clusters, not every cluster can participate in every round.So FLIPS additionally tracks the number of times a cluster is selected to participate.
Consider the example of using ECG signals from wearables for Arrhythmia detection, FLIPS will improve label representation by picking parties with normal and abnormal heartbeats at each round, improving the detection rate for arrhythmia and preventing class/label imbalance.To improve participant representation, FLIPS will try to incorporate parties that were not used in the previous training rounds.This will help bring knowledge from a diverse set of participants and will make the global model more robust.This helps solve the data heterogeneity issues in FL.
To mitigate the effect of Stragglers, FLIPS uses the popularly used over-provisioning technique [12].Once we identify the average straggler rate strg.FLIPS overprovisions strg * S (r) parties in the subsequent training rounds.The overprovisioned parties in round r + 1 are selected from the clusters H r sc that the straggler parties in round r were a part of.In this manner, we maintain the representation of all the unique label distributions in FL.This is illustrated in Algorithm 1.

Privacy in FLIPS
There are two additional pieces of private information that needs to be secured in FLIPS, when compared to traditional FL (see Section 2.4).One is the parties' label distribution used for clustering.This information gives away the types of data present at a party, and can be used by other parties to gain a competitive advantage.The other is cluster membership -should a party know which cluster it belongs to, Attacks on FL algorithms based on cluster membership are currently unknown, given that the concept and use in FL is new.But, akin to data and model poisoning, parties could lie about their label distributions to work themselves into smaller clusters.In FLIPS, we treat cluster membership as private information because parties do not need to know this.A party simply needs to know whether it is selected for a round.
Hence, FLIPS needs a mechanism to establish trust and execute clustering privately and securely.For this, we have chosen to use secure enclaves (Section 2.4) over secure multiparty computation (SMPC) and homomorphic encryption (HE).Figure 3 illustrates the end-to-end system design of FLIPS and the associated flow of information.The key components include the FL parties which hold local data and generate local models, an aggregator which acts as the centralized coordinator, the TEE within the centralized aggregator, and an attestation server that verifies the aggregator's TEE.The clustering code is loaded on the TEE and all parties share an attestation server to verify the remote TEE's hardware.Together, this establishes the TEE as a secure enclave.

FLIPS in Context
FLIPS differs from other FL systems, in that it targets participant selection using label distribution clustering.k-means++ clustering based on label distributions has the advantage of being fast and minuscule relative to FL training time, and has been demonstrated to scale [10,66] to millions of data points (participants).Clustering has to be performed only once, as long as the set of participants or the data at participants does not change significantly.Centralizing this clustering and participant selection in a TEE also has the advantage of fitting nicely into the predominant mold of FL, as it is deployed today, based on a single aggregator [45].However, FLIPS can also be used with a distributed aggregator (e.g., [11,37,39,41]) by separating the clustering and participant selection module from the aggregation module.This applies to cloud-hosted aggregation or aggregation using homomorphic encryption or secure multiparty computation.In fact, clustering, participant selection and aggregation are logically separate with clear interfaces, and can be individually hosted and secured.This also implies that FLIPS is as scalable as the underlying aggregation algorithm and the method used to secure it.

Validating FLIPS
Our primary goal is to examine the impact of FLIPS's participant selection on model convergence and accuracy at a reasonable scale.Hence, we implemented and evaluated FLIPS in a distributed cluster environment.All the parties, i.e. participants, were executed as nodes on a cluster for local training as seen in Figure 3.The cluster consists 13 nodes, each with four Nvidia V100 (16GB) GPUs, Intel(R) Xeon(R) Silver 4114 CPU, and ∼ 16 GB RAM per node.We train local models using the GPUs available on this cluster.The aggregator node is executed on a machine with a 2.9Ghz 6-core Ryzen 5 processor with 16GB RAM and 512GB SSD.To enable trusted execution for label distribution clustering, we use the AMD Secure Encrypted Virtualization (SEV) [1] on the aggregator.

Techniques for Comparison
We compare FLIPS with 3 popular participant selection strategies across different datasets and study the impact on convergence in the presence of stragglers.The first is the widely-used random selection [63,55,72] method.This selects each party with the same probability and can lead to class imbalance, where certain classes are underrepresented, as explained in Section 2.2.This may cause the model to be biased towards the overrepresented classes, leading to poor performance.This harms convergence as it takes more training rounds to reach the desired accuracy.
Second, we compare FLIPS with OORT [51].OORT uses the idea that parties with a higher local loss can contribute more to an FL job [42] and introduces a statistical and systemic utility metric for participant selection.The system sorts the parties according to the party utility metric which is a combination of its statistical and systemic utility and selects parties with a higher utility and explores new parties at each training round.
The third strategy is GradClus [29], which uses the idea of clustering gradients from parties to identify parties with similar data.It performs hierarchical clustering over a similarity matrix constructed across gradients from all the parties in the FL job.The gradients assigned in the beginning are random numbers and get iteratively updated as the party gets picked.At the aggregator, GradClus [29] performs hierarchical clustering and S (r) number of clusters are formed.GradClus chooses one party from each cluster randomly.
The fourth strategy is TiFL [16], which groups parties into tiers based on their training performance and selects parties from the same tier in each training round to mitigate the straggler problem.To further solve the non-IID problem, TiFL employs an adaptive tier selection approach to update the tiering on the fly based on the observed training performance and accuracy.

Datasets and models used
We focus on two real-world datasets from the healthcare/senior care domain and two benchmark datasets: • MIT-BIH-ECG-Dataset [65] partitioned across 200 parties trained using a 1-D CNN [56].
The MIT-BIH ECG dataset [65] comprises digitized electrocardiogram (ECG) recordings used for arrhythmia identification.Collected by MIT Biomedical Engineering Department and Beth Israel Hospital, it includes both normal and aberrant rhythms.The dataset is annotated with AAMI labels [85], a widely accepted standard for ECG rhythm classification.These labels define performance criteria, improve algorithm generalization, and include N (normal beats), S(supra-ventricular ectopic beats), V (ventricular ectopic beats), F (fusion beats), and Q (unclassifiable beats).
The dataset predominantly comprises of N beats, necessitating federated learning (FL) to enhance label and participant representation.The dataset is distributed across 200 parties in a non-IID manner.Training involves a 1-D CNN with a learning rate 0.001 and a decay applied every 20 rounds.FL training is limited to a maximum of 400 rounds.
The HAM10000 dataset [19] contains diverse dermatoscopic images of pigmented skin lesions.It includes 10015 images representing important skin cancer diagnostic categories: akiec, bcc, bkl, df, mel, nv, and vasc.The dataset is suitable for training and evaluating machine learning models for automated diagnosis.The nv images are the most abundant, potentially dominating other categories due to their prevalence caused by UV radiation.This non-IID behavior highlights the need for federated learning.The dataset is distributed across 200 parties, and training involves DenseNet121 with a learning rate of 0.001.A decay is applied every 30 FL rounds, and the maximum number of FL rounds is 400.
The EMNIST (Extended MNIST) dataset [20] contains handwritten letters and numbers and is an expansion of the MNIST dataset.The EMNIST dataset contains characters from numerous alphabets, including numerals and letters from the English alphabet.It also has a collection of symbols from different languages.This dataset is frequently used for testing and training machine learning models for tasks like text classification and handwriting recognition.We subsample 10 lowercase characters ('a'-'j') from EMNIST.This federated variant of EMNIST is known as FEM-NIST [55].We train Le-Net-5 [52] model at each party which outputs a class label between 0 (a) and 9 (j).
Fashion-MNIST [88] is a dataset of images of clothing items that are commonly used as a benchmark for machine learning models.It consists of 60,000 training and 10,000 test images, each 28x28 pixels in size and labeled with one of 10 different clothing categories, such as t-shirts, trousers, bags, etc. FL applied in the fashion-MNIST dataset will mimic a personalized customer recommendation system.A model trained on the customer's device or organization using FL will understand the customer's preferences, which can then be used to suggest personalized clothing recommendations.The dataset is distributed across 100 parties and training involves Le-Net5 [52].

Emulating Non-IID data distributions
As in the Tensorflow Federated [12] and LEAF [15] FL benchmarks, we emulate a non-IID setting in our experiments by using different data partitioning strategies.We use Dirichlet Allocation [91] a widely used technique [12,15] to partition a dataset among several parties in a non-IID manner.It samples p ∼ Dir N (α), where α is the control parameter and p l,i becomes the proportion of the number of data points of label l allocated to the party i.An α of 0 corresponds to each party having data corresponding to only one label (which is non-IID at its extreme) and an α ≥ 1 corresponds to an IID distribution.As recommended in other federated learning research [30,71,15], we evaluate FLIPS using two different values of α = 0.3 and 0.6.At each round, we sample 20% and 15% parties which is more than the number of clusters of similar parties formed in FLIPS.

Metrics
In We report the test accuracy of the global model against the test dataset after each round.This test accuracy is computed as:

redictions f or label i T otal number of datapoints f or label i
This is done to mitigate the effect of label imbalance while computing the accuracy for each test set, as each label may have a different number of datapoints.The accuracy numbers reported are an average of 6 runs for each experiment.
We report both (i) the highest accuracy obtained using a specific FL technique within the FL rounds threshold and (ii) the number of communication rounds needed for the global model to reach a target/desired accuracy.The latter depicts how fast any participant selection technique converges.

TEE Clustering Overhead
Clustering label distributions, by itself, is fairly efficient and takes less than one second for all our datasets (≈100ms for HAM10000 dataset with 200 parties and lower for other datasets on a 2.3 Ghz 4 core Intel Core i9 equipped server with 16GB RAM and 512GB SSD).The overhead of using TEEs to perform clustering is approximately 5% (105.4 ms vs. 100.5ms) in the case of AMD SEV on a server running the aggregator.Hence, using TEEs gives us a reasonable way to implement private label distribution clustering for FLIPS.

Data Heterogeneity
Tables 1 -8 summarize our results for FedYogi.Tables 9 -16 summarize our results for FedProx.Tables 17 -24 summarize our results for FedAvg.At a high level, we observe that FLIPScan reach target accuracies much faster and achieve much higher peak accuracy.When considering the number of rounds needed to reach targeted accuracies for the MIT-ECG (60%) and HAM10000 (60%) datasets, we observe that Random selection, TiFL and Gradient Clustering take more than 400 rounds in the case of FedAvg, FedYogi and FedProx.While the performance of Random selection is not surprising considering all the reasons outlined so far in this paper, GradClus's performance is surprising, given that it also clusters parties, albeit using gradients.TiFL's adaptive tiering approach is unable to group the parties with under-represented labels into a single tier, which explains its performance.OORT's statistical utility function does enable it to perform better than Random, TiFL and GradClus.This trend can also be observed from example convergence plots in Figures 5  and 7. From Table 1, we observe that FLIPS reaches target accuracy up to 1.15-1.86×faster, i.e., in fewer rounds when compared to OORT when α = 0.6 and 1.08-2.37×faster when α = 0.3.Hence, when the degree of "non-IIDness" of the data increases (corresponding to decreasing α as explained in Section 4.3), FLIPS performs better.This reduction in the number of rounds not only saves time but also results in much lower communication costs, as a result of having to participate in much fewer rounds.In the case of the HAM10000 dataset in Table 3, the performance benefits of FLIPS is more pronounced.Speedup (fewer rounds) is 1.32-1.52×for α = 0.6 and 1.56-2.10×for α = 0.3.We see a similar trend in the case of FedProx in Tables 9 and 11, with 1.12-2× speedup for α = 0.6 and 1.35-2.14×speedup for α = 0.3 for MIT-ECG and HAM10000 datasets.
Also, in a world where training jobs are rerun for 1-2 percentage point, improvements in accuracy (in absolute terms), Tables 2 and 4 for FedYogi illustrate that FLIPS improves peak model accuracy by > 30 percentage points for MIT-  Even for a "less non-IID" distribution corresponding to α = 0.6 FLIPS improves accuracy by more than 12 and 15 percentage points for MIT-ECG and HAM10000, respectively.These benefits endure when compared to GradClus (≈ 8-30 % point improvements in accuracy) and TiFL (≈ 6 -30 % point improvements in accuracy).They are lower when compared to OORT, but still significant -≈ 3-15 and 2-5 percentage points corresponding to α of 0.3 and 0.6, respectively.A similar trend is seen for FedProx and FedAvg in Tables 10, 12, 18 and 20 where accuracy of FLIPS increases by tens of percentage points vis-a-vis random selection, TiFL and GradClus.Unlike in FedYogi and FedProx, where the peak accuracy of FLIPS is higher than that of OORT, peak accuracies of OORT are closer to FLIPS in the case of FedAvg, and significant in many cases as illustrated in Tables 17-20.
Next, we move on to the FEMNIST dataset.This dataset is more IID in its centralized version, and in Table 5 we observe that for all the party selection techniques, the target accuracy is reached within the threshold of 200 rounds.
The highest accuracy obtained is 86.86 % for FLIPS.In Table 5, we can see that FLIPS achieves the target accuracy 1.3x faster than the most comparable OORT technique for α = 0.3.While for α = 0.6, OORT reaches the target accuracy almost at the same number of rounds as FLIPS .FLIPS is 1.5x -2.9x faster than Random selection, 1.3x -2.7x faster than GradCls and 1.4x -2x faster than TiFL.In Figure 11, we can see that for α = 0.3, FLIPS performs better than all the other techniques, while for α = 0.6, OORT performs similar to FLIPS .
For the Fashion MNIST dataset, FLIPS performs better than all the other techniques this can be seen in Figure7.
All the techniques attain the target accuracy in the given rounds threshold.FLIPS is 1.2x -1.5x faster than Random  45x -2x faster than OORT, 1.13x -1.66x faster than GradCls and 1.73x -2.2x faster than TiFL.FLIPS attains the highest accuracy 85.14 %, which is higher than all the other techniques.This is consistent across all the other FL algorithms.
We also observe that FLIPS achieves the highest model accuracy in all scenarios.This improvement in accuracy can be accredited to the fact that FLIPS improves the accuracy of the underrepresented labels.This can be seen in Figure 13.We can see that FLIPS brings in the most accuracy improvement for the underrepresented labels.

Platform Heterogeneity
We take the best performing techniques FLIPS and OORT and evaluate them under different platform heterogeneity constraints.OORT selects 1.3x the parties in FL at each round to overprovision for straggler parties.
In Table 1 and 2, we can observe that under 10 % Straggler rate OORT is unable to attain the target accuracy even after 400 rounds of training, while FLIPS attains the target accuracy 3 out of 4 times while missing the target accuracy by a mere 2.8 % when α = 0.6 and party % = 20.Under the 20 % straggler rate, the results are similar, OORT cannot reach the target accuracy in all 4 settings, while FLIPSattains the target accuracy 3 out of 4 times.FLIPS attains the accuracy of 74.66 %, which is the highest in presence of 10 % stragglers and 74.20 % in 20 % stragglers.This is approximately 4 percentage points less than the 0 % straggler rate or the ideal scenario.In Table 1 and Table 2, we can see that for 10 % straggler rate, FLIPS is faster by 1.3x -2x than OORT except for the α = 0.6 and party % = 20 condition, and attains 11-38 percentage points higher accuracy than OORT.For the 20 % Straggler rate, FLIPS is faster by 1.2x -2x than OORT except for the α = 0.6 and party % = 15 condition, and attains 2 % -30 % higher accuracy than OORT. Figure 6 also shows that FLIPSis more robust to stragglers than OORT.
For the HAM10000 (Skin Lesion) Dataset, FLIPS outperforms OORT across all straggler rates.This can be seen in Tables 3 and 4. OORT cannot reach the target accuracy in all cases, while FLIPS does in all.FLIPS is 1.2x -1.9x faster than OORT under 10 % straggler rate, while it is 1.1x -2.1x faster than OORT under 20 % straggler rate.FLIPS attains 13 -17 % higher accuracy in classification of skin lesions under 10 % straggler rate and 14 -26 % higher accuracy under 20 % straggler rate.The Figure 8 depicts the better performance of FLIPS under stragglers via convergence curves.
In the case of the FEMNIST dataset in Table 5, FLIPS outperforms OORT and TiFL, when α = 0.6, the performance is comparable to OORT.For α = 0.3, FLIPS is 1.3x -1.4x faster than OORT and 1.15x to 1.85x faster than TiFL, while for α = 0.6, OORT's performance is similar to FLIPS .This is because the data distribution is more IID for the 10 % Straggler rate.For 20 % FLIPS is 1x -1.3x faster than OORT and is 1.15 -1.25x faster than TiFL.Additionally, FLIPS achieves 1.5 -3.5 % higher accuracy than OORT, for a 10 % straggler rate, while for the 20 % straggler rate FLIPS attains higher accuracy in all the cases by 0.7 -2 % as seen in Figure 5.
For Fashion MNIST we observe that FLIPSoutperforms OORT and TiFL in all settings.It achieves 3 -4 % points improvements in accuracy as seen in Table 8 and is 1.3 -2.2 faster than OORT and TiFL.These improvements are consistent across FedAvg and FedProx too.
FLIPS is robust against data heterogeneity and platform heterogeneity (presence of stragglers) where data distributions are non-IID.Using intelligent party selection, we improve the terminal accuracy and reduce the communication overheads in FL by using fewer rounds than other techniques.
To summarize FLIPS provides a middleware support system for FL by: • Improving the terminal accuracy when the communication overheads (FL rounds) are fixed.
• Reducing communication overheads required to attain the desired/target accuracy.
• Reducing the time required for FL to attain a target accuracy by lowering the number of FL rounds required.

Related Work
The Intelligent Participant Selection in FLIPS deals both with platform and data heterogeneity using a data-driven approach to cluster similar parties and outperforms existing techniques like OORT and GradCls.
Data Heterogenenity: Non-IID datasets in FL introduce client drift in the global model, which hampers its convergence.Many solutions exist to reduce client drift and solve data heterogeneity.[47,46] propose a client-drift correction update (m − m p ) between the server model (m) and each party's model m p to mitigate client drift.This improves convergence.[30] introduces a penalty and gradient correction term in the local loss function to account for the local drift to bridge the gap between the m and m p for faster convergence in non-IID settings.[7] performs more optimizations on the local/party level by introducing dynamic regularization terms to bring the global and local models closer, reducing client drift.
Several studies have examined clustering techniques to personalize models in FL by grouping similar parties in FL based on local model similarity [78].[70] discuss clustering local model updates using cosine similarity to identify parties with similar data distributions and address data heterogeneity.They create personalized models for each group of parties after a fixed interval, involving re-clustering periodically.IFCA [32] assigns each party a cluster identity based on their data and tailors model parameters for each cluster using gradient descent to improve model convergence for similar parties.
FedLabCluster [93] performs label presence clustering, aggregating models within each cluster to enhance convergence for models specific to each cluster.Another study [54] investigates sharing encoded data among parties and clustering datasets using K-Means to achieve high clustering accuracy in FL.CMFL [86] dynamically identifies irrelevant local model updates by comparing them to the global model, and discards updates irrelevant to the global model, addressing data heterogeneity in non-IID cases.In a hierarchical federated learning (HFL) setting, [25] improves convergence by selecting parties with the lowest KL divergence between local and edge aggregator's data.FedCBS [94] computes the Quadratic Class-Imbalance Degree from label distributions to choose parties with more balanced grouped datasets, addressing data heterogeneity.However, none of these studies focus on intelligent participant selection or discuss confidentiality approaches in FL.
Platform Heterogeneity: Solutions like Aergia [21] deal with platform heterogeneity by offloading the training of stragglers to other parties with similar datasets.This reduces the time required for federated learning while maintaining accuracy similar to the baseline techniques.FLIPS leverages the knowledge of similar parties to perform participant selection across parties to make a 2-fold improvement across accuracy and training time, using lesser communication rounds in intermittent FL scenarios.[50] solves the straggler issue by using Locality Sensitive Hashing to cluster models/parties and drops duplicate and slow model updates.Further, it requires parties to train locally without knowing whether their models will be used, often wasting local computing resources, which may be undesirable in edge settings.FedLesScan [28] a clustering framework clusters parties into three groups: rookies, participants and stragglers based on the device variations.Parties are selected from these clusters to mitigate the effect of stragglers to reduce training time and cost.FLIPS uses a more data-driven approach to pick to compensate for the straggler parties.[74] uses a mechanism where it selects faster parties in the beginning to attain a target accuracy and then incorporates the slower parties to improve the global model.[69] deals with platform heterogeneity by using model grouping and weighting based on arrival delay to identify stragglers and entropy-based approaches to mitigate adversaries.[75] uses gradient coding to introduce redundancy in model training to mitigate the effect of stragglers.FedAT [17] uses a straggleraware, weighted aggregation heuristic to solve the platform heterogeneity issue.This heuristic assigns higher weights to faster devices, which helps to compensate for the slower devices.FedCS [58] is a communication-efficient federated learning framework inspired by quantized compressed sensing.It compresses gradients at client devices using block sparsification, dimensional reduction, and quantization.Then, it reconstructs gradients at the parameter server and achieves almost identical performance as no compression, while significantly reducing communication overhead.

Towards FLIPS in real deployments
We implement FLIPS in smartspaces and assisted living for older adults, analyzing real-time data from multiple facilities and individuals.We aim to monitor residents, identify health and safety incidents, and detect changes in daily activities, falls, illnesses, and wandering events.By utilizing federated learning (FL) and FLIPS, we ensure robust and timely analysis of personal health records and device data while maintaining data privacy.
A specific area of interest is using ECG data from portable devices to detect arrhythmias and heart irregularities.Machine learning models trained on ECG datasets have shown promise in early detection and treatment of cardiac issues.
With FLIPS, we can train heart rhythm and fall models without compromising sensitive personal data, thus enhancing both privacy and model robustness.We have partnered with a senior-care community consisting of approximately 50 facilities, which serves as a trusted entity for storing confidential resident information.These communities act as aggregators and trusted parties, facilitating label distribution in a FLIPS deployment.
Additionally, we focus on the detection and localization of falls in assisted living facilities using data from cameras, acoustic sensors, and wearable tags with accelerometers, gyroscopes, and location-based sensors.Our efforts involve developing robust event detection models to handle device heterogeneity and variations in fall risk [8].Privacy concerns are addressed by ensuring secure collection, communication, and storage of remotely obtained data to prevent unauthorized access and misuse.
While FLIPS employs federated and decentralized learning, the clustering aspect is centralized and demonstrated using a centralized aggregator in Section 3. We chose this approach because it aligns with the most commonly used architecture in real-world federated learning systems, such as Google FL [13,11], IBM FL [60,38], FATE [57], and Facebook FL [37].The centralized aggregator offers simplicity, statelessness, fault tolerance, and ease of storing FL job data and accounting information in fault-tolerant cloud object stores or key-value stores.In case of aggregator failure, data can be recovered, and aggregation can be resumed from the last round.Communication and aggregator failures are easily recovered by requesting retransmission of lost model updates from the parties, assuming each party securely stores its local model.

Conclusions and Future Work
In this article, we present FLIPS: Federated Learning using Intelligent Participant Selection, which improves label and participant representation in FL to improve convergence, attain higher accuracy and reduce communication overheads.
Our empirical evaluation indicates that FLIPS on an average speeds FL algorithms by 1.2× -2.9× and improves terminal accuracy by 5-30 percentage points when compared to the existing techniques.
We are currently exploring the following research directions: (1) personalization in Federated Learning (FL), (2) handling changing data distributions (3) decentralized clustering of label distributions and (4) FLIPS and adversarial robustness.Personalization in FL involves adapting to individual users' or devices' unique requirements and characteristics.Instead of a centralized dataset, we plan to train the model using data from similar parties or devices separately, allowing for personalized models that account for specific patterns and differences in each party's or device's data.Additionally, we plan to address the issue of changing data distributions, which is relevant in IoT settings with streaming data that can introduce shifts in data distributions.FLIPS , will capture these changes and train robust models while optimizing resource usage.We anticipate that as datasets and AI-based analytics techniques continue to expand, the ability to handle multiple requirements like privacy, performance, and latency will become crucial, and FLIPS represents a step in that direction.
Decentralized FL architectures rely on secure multi-party computation (SMPC) with or without homomorphic encryption to ensure model update privacy.However, their higher computational requirements make them less popular.
To implement FLIPS using SMPC, Algorithm 1, clustering must be computed using an SMPC protocol.Participant selection can be achieved through leader election, with the leader implementing the FLIPS selection protocol and other parties auditing the process.Finally, we are interested in investigating the interplay between diverse datasets in FL and techniques used for adversarial robustness -whether such techniques [61,95] exclude valid underrepresented data classes.We also plan to investigate changes needed to make label distribution clustering work with adversarially robust FL.

Figure 1 :
Figure 1: Overview of Federated Learning At each FL job round, each participant trains a local model using its local data.The local model is initialized with the global model parameters received from the aggregator.The local training process typically involves several iterations or epochs (this is part of the hyperparameters agreed at the start) to improve the model's performance on the local dataset.After local training, the participants generate model updates, which typically consist of the updated model parameters or gradients.These updates capture the local knowledge learned from the device's data.The aggregator combines these updates, and applies the aggregated update to construct the global model using the optimizer.The new global model parameters are sent to the participants selected for the next round.The FL process typically involves multiple rounds of local training, model updates, aggregation, and distribution.This iterative process allows the model to be refined and improved over time by leveraging the collective intelligence of the participating devices.

Figure 2 :
Figure 2: Elbow point determination for optimal k

1 …Figure 3 :Figure 4 :
Figure 3: End-to-end integrated system design for Private Clustering in FLIPS

6 Figure 12 :
Figure 12: Convergence plots on Fashion MNIST dataset in the presence of stragglers, FL Algorithm: FedYoGi

Figure 13 :
Figure 13: Convergence curves on underrepresented labels for ECG and HAM10000 datasets parties used per cluster H c ← MIN-HEAP() // clusters used H r s ← {}, H r sc ←MAX-HEAP(), count strg ← 0, Stragglers = F alse // track straggler parties, their clusters and counts Initial model m 1 , Parties per round N r for c ∈ {1, 2, ..., C} do c.picks ← 0, INSERT(H c , c), h ← MIN-HEAP()for p ∈ {1, 2 ...c} do p.picks ← 0, INSERT(h, p) H[c.id] ← h for r ∈ {1, 2, ..., R} do S (r) ← {} for i ∈ {1, 2, ..., N r } do c ← EXTRACT MIN(H c ), H ← H[c.id], p ← EXTRACT MIN(H ) INCREEMENT(p.picks,1),INSERT(H, p) INCREEMENT(c.picks,1),INSERT(H c , c) Select unique parties :S (r) ← S (r) {p} if Stragglers then for i ∈ {1, 2, . .., int(strg * N r )} do c ← EXTRACT MAX(H r sc ), H ← H[c.id] // choose cluster with most stragglers c : p ← EXTRACT MIN(H ) not in H r s // pick a non-straggler part in c S (r) ← S (r) {p} SEND m (r) to each i ∈ S (r)for RECV model update x FL, datasets are local to the parties, which partition them into local training and test sets.To compare FLIPS and other techniques for this study, we use a global test set consisting of data corresponding to all the labels in the distributed dataset.The datapoints in this dataset are unknown to any party in our emulation.Hence it is a valid test set.This test set also helps us evaluate the techniques at each training round to get a closer look at convergence.Generally, global test sets are not used in FL practice; just while designing FL algorithms.We maintain the global test set inside the aggregator's TEE in our implementation.

Table 2 :
MIT ECG Dataset: highest accuracy attained within the rounds threshold

Table 4 :
HAM10000 (Skin lesion) dataset : highest accuracy attained in the rounds threshold

Table 6 :
FEMNIST dataset : highest accuracy attained in the rounds threshold

Table 8 :
Fashion MNIST dataset : highest accuracy attained in the rounds threshold

Table 10 :
MIT ECG Dataset: highest accuracy attained within the rounds threshold

Table 12 :
HAM10000 (Skin Lesion) dataset: highest accuracy attained within the rounds threshold

Table 14 :
FEMNIST dataset: highest accuracy attained within the rounds threshold

Table 16 :
Fashion MNIST Dataset: : highest accuracy attained within the rounds threshold

Table 18 :
MIT BIH ECG Dataset: : highest accuracy attained within the rounds threshold

Table 22 :
FEMNIST Dataset: : highest accuracy attained within the rounds threshold

Table 24 :
Fashion MNIST Dataset: : highest accuracy attained within the rounds threshold