MOHAWK: Mobility and Heterogeneity-Aware Dynamic Community Selection for Hierarchical Federated Learning

The recent developments in Federated Learning (FL) focus on optimizing the learning process for data, hardware, and model heterogeneity. However, most approaches assume all devices are stationary, charging, and always connected to the Wi-Fi when training on local data. We argue that when real devices move around, the FL process is negatively impacted and the device energy spent for communication is increased. To mitigate such effects, we propose a dynamic community selection algorithm which improves the communication energy efficiency and two new aggregation strategies that boost the learning performance in Hierarchical FL (HFL). For real mobility traces, we show that compared to state-of-the-art HFL solutions, our approach is scalable, achieves better accuracy on multiple datasets, converges up to 3.88 × faster, and is significantly more energy efficient for both IID and non-IID scenarios.1


INTRODUCTION
Federated learning (FL) trains machine learning (ML) models using the local data available on edge devices and then aggregates the updated local models in the cloud to obtain a global model.For instance, Federated Average (FedAvg) [23] simply averages the updated local model parameters to obtain a new global model.Many improvements to the original FedAvg algorithm [23] have been proposed to boost the learning in FL systems, e.g., FedProx [17], FedMax [5], FedNova [28] to mention a few.Challenges like hardware, model, and data heterogeneity can negatively impact the learning and communication in FL [16] since in real-world scenarios all types of heterogeneity appear naturally.In particular, the impact on communication is the most important one [16,20], especially since the edge devices may struggle to communicate the trained model back and forth with the cloud.Additionally, hardware heterogeneity impacts the devices communication with dierent connectivity technologies (e.g., Wi-Fi, 4G or 5G, etc.).Finally, real data distributions can make the learning process harder to converge, hence requiring more communication rounds until a certain accuracy threshold is reached.This induces an even bigger impact on communication as each participating device needs to download and upload the local model multiple times.
Another line of work brings the computation even closer to the edge by using hierarchical FL to reduce the communication overhead of FL.Hierarchical FL (HFL), rst pioneered in [18], uses edge Access Points (APs) as intermediary aggregation points before transmitting the edge models to the cloud for global aggregation.This enables edge devices to use higher communication speeds with local APs (instead of communicating directly with the cloud).
Previous HFL solutions [1, 18] make every device communicate with the same pre-assigned AP during all communication rounds.This limiting assumption does not consider the physical distance between devices and its assigned AP, hence it directly impacts the capacity and energy consumption spent in communication.Other HFL solutions consider the distance between the device and the AP to select the AP [22], but ignore the real mobility patterns of the various devices (i.e., devices are assumed to be stationary and uniformly distributed in a given area).Other HFL solutions consider devices that can randomly change their current AP with one of their neighbors [7].Such state-of-the-art approaches consider that any AP has only two neighboring APs to which devices can randomly connect.This assumption is not realistic since the APs can have any number of neighbors and the device mobility is not dictated by xed probabilities, but by human behavior.Works such as [1] consider an area that is too small to be realistic (i.e., 750m⇥750m) for the deployment of edge devices and APs.Even though the authors of [25] consider a larger area of 2km⇥2km, their focus is on simulated mobility and fully cooperative learning.Even with such solutions, there are still important problems associated with real device mobility that remain unaddressed.As mentioned in [20], the assumption that all data owners are willing to participate in the FL process anytime and anywhere is not realistic.We illustrate in Fig. 1 how some device C starts training at 1PM, has battery depleted at 2PM, and then becomes available again at 3PM.Due to the aforementioned limitations, current methods would not consider device C for aggregation; however, in order for devices to be able to learn anytime and anywhere, this is one of the rst research questions that needs to be addressed.
Recent works consider a uniform distribution of locations the devices can be at, i.e., Points of Interest (PoIs), and continuous availability of all participating devices in the FL process.However, this is not realistic since, as it can be seen in Fig. 2, the real-world distribution of the PoIs looks nothing like a uniform distribution.The presence of hubs can be seen especially in cities where Downtown and specic areas are more frequented by people.Having a realistic experimental validation is thus crucial to get us closer to learning anytime and anywhere.
To address these limiting factors, we propose a Mobility and Heterogeneity-Aware Dynamic Community Selection (MOHAWK) framework for Hierarchical Federated Learning that combines a dynamic community selection algorithm with real device mobility under heterogeneous environments using two new aggregation techniques which are essential for energy eciency and scalability.As shown in Fig. 1, we provide a solution to adapt the learning process to missing and reappearing devices (such as device C at 2PM) which we shall refer to as dynamic edge aggregation.We also propose a selective global aggregation technique which only aggregates the model weights from the APs that aggregated any devices since the last global aggregation.Our dynamic edge aggregation enables more devices to participate with their local updates in the learning process, while the selective global aggregation results in a faster global convergence and scales to any number of APs.Our solution enables devices to learn continuously during the day, whenever they are available, spending less energy for communication, converging faster, and achieving better accuracy than other state-of-the-art HFL solutions.
The contributions of the paper are as follows: • Mobility and Heterogeneity-Aware Dynamic Community Selection (MOHAWK): A framework that combines a dynamic community selection algorithm for energy-ecient communication in mobile FL systems with two new aggregation strategies (i.e., dynamic edge aggregation and selective global aggregation) that boost the learning performance under heterogeneous scenarios.• Federated Learning using Real Devices Mobility: MO-HAWK aims to include as many devices in the learning process as possible, thus providing a more inclusive (i.e., fair) environment which guarantees that less energy will be wasted on training models that are ultimately not aggregated.To the best of our knowledge, this paper is the rst to account for real devices mobility for FL.• Empirical Validation: We show that MOHAWK converges up to 3.88⇥ faster and is more scalable than state-of-the-art HFL solutions on MNIST, EMNIST, CIFAR10 and CIFAR100, while being more energy ecient.• A Hardware Prototype: We provide real energy measurements on a hardware prototype with 36 Raspberry Pi devices that show up to 2.24⇥ less average energy wasted per device for training local models compared to state-of-the-art HFL solutions.To summarize, we provide a new energy-ecient solution for HFL which considers real mobility and availability of edge devices.The remainder of the paper is organized as follows: Section 2 discusses relevant prior work.In Section 3, we present our proposed approach.Section 4 shows our experimental results (both simulation and hardware prototype), while Section 5 summarizes our main contributions and outlines directions for future work.

RELATED WORK 2.1 Hierarchical Federated Learning
HierFAVG [18] considers a scenario with 50 devices and 5 APs, optimistically assuming that at every communication round every AP will handle exactly 5 devices.Looking at Fig. 2, this is a strong limitation.Abad et al. [1] consider 28 users uniformly distributed across a circular area with radius 750m; they x 7 APs, each having 4 devices during each communication round.No device mobility is considered, and the results are provided only for CIFAR10 under the IID scenario with FedAvg used for aggregation.Hier-local-QSGD [19] is proposed for HFL with quantization using 20 clients and 4 edge servers, each server having 5 devices during each communication round.Hierarchical Federated Edge Learning (HFEL) [22] addresses resource allocation optimization and edge association problem solving.The authors consider up to 60 devices distributed randomly over a 500m x 500m area with up to 25 edge servers.However, HFEL does not consider any device mobility and thus their solution is purely a resource allocation optimization.We call the methods above stationary since all devices have a pre-assigned AP to communicate with.As mentioned in [2], the methods proposed in [1,22] are appealing for computation ooading, but have limited applicability in the context of real mobile devices where there is almost no possibility to organize them.
Mobility-Aware Cluster FL (MACFL) [7] tries to relax the stationary scenario by allowing devices to move to a neighboring AP based on a xed probability, and assuming each AP has only two neighbors.MACFL also uses 50 devices and 5 APs for their experiments on MNIST, just like HierFAVG [18].We call MACFL pseudo-mobile because it allows some devices to change their APs, but has the same assumption as HierFAVG, i.e., all devices are available at every communication round.In Federated Attentive Message Passing (FedAMP) [12], the authors propose a heuristic version of FedAMP called HeurFedAMP which considers a self-attention hyperparameter to control the weight of each message sent from a client to the cloud.This self-attention hyperparameter uses the cosine similarity between the local model and the global model and adjusts the weight of the local model in the aggregation based on how dierent the local model is from the global model.Inspired by HeurFedAMP, MACFL [7] uses the same attention scheme for every device and edge aggregation to help the learning process with clients randomly changing their APs.
To address the limitations of stationary and pseudo-mobile stateof-the-art solutions, we are the rst to consider real mobility data for HFL together with realistic setups of devices and APs.Unlike [1, 7,18,19,22], we do not assume that all devices are available at every communication round, as they get disconnected due to issues like battery depletion or temporary loss of signal.

Mobility Models and Scalability
Ochiai et al. [25] propose a fully distributed cooperative FL system organized by nodes that are physically nearby, i.e., nodes communicate with other nodes if they are within the radio range, thus producing opportunistic contacts for learning.The authors of [25] simulate mobility using Random Waypoint (RWP) mobility [3] and Community Structured Environment (CSE) mobility [24].In RWP the nodes are devices that walk around a given area, spending some time at dierent locations, while in CSE the nodes are devices assumed to be part of a few communities and they move from one community to another.For RWP, the authors in [25] use an area as large as 2km⇥2km and assume 100m as the radio range for communication.Ochiai et al. also show in [25] that the large area for RWP requires a longer time for more contacts among the nodes to obtain enough accuracy.Another realistic approach in [27] takes into account the real-world vehicle trace dataset to create Figure 3: Device availability and normalized device availability during the month of May 2020.We observe a clear increase in the number of devices available for aggregation since our solution (i.e., the green line) is always on top, while other (state-of-the-art) methods consider fewer devices for aggregation.Also, we note cases with no devices available (e.g., May 3, 4, 5, etc.).
a FL system that considers the delay as a learning parameter, due to high mobility of vehicles.The authors use the Mobile Century Dataset [9] which contains 77 vehicles and traces approximately 20 miles of Interstate 880, but only 10 agents are selected during every communication round.
In contrast to [25], instead of using simulated mobility, we use the Foursquare real-mobility dataset [8].We consider the top 1,000 devices that appear most times during the month of May 2020 within the metropolitan area of interest based on Foursquare data.The devices are smartphones associated with people moving around, e.g., walking or driving.From these 1,000 devices for any given time step, there are about 76 devices present on the map, almost the same as the total number of devices used by [27]; this shows the scalability of our work.

Hardware Validation
Real hardware experiments are rarely reported in the FL literature.For instance, Luo et al. [21] use 20 Raspberry Pi 4 and 10 NVIDIA Jetson Nano devices and measure the average computation and communication time, without power or energy measurements.Clus-terFL [26] uses a prototype built with 7 NVIDIA Jetson TX2 and 3 NVIDIA Jetson AGX to evaluate the impact of dynamic network conditions concluding that the real-world 4G LTE has substantially lower and more unstable bandwidth compared to Wi-Fi and Ethernet.Our proposed hardware prototype is equipped with 36 Raspberry Pi 3B+ devices, each of them having its own dedicated Smart Power 2 device to measure energy consumption.

METHODOLOGY 3.1 Device Availability and Distribution
Previous works [1, 7,18,19,22] consider a uniform (random) distribution of devices that are always available for all their experiments.However, as shown in Fig. 2, it is unrealistic to assume a uniform distribution of PoIs over the entire area of interest.Besides considering unrealistically small areas of deployment (e.g., 500m⇥500m [22]), the uniform distribution of devices also limits the usefulness of hierarchical approaches.We observe in all cities, a higher concentration of clients and their devices in some areas (e.g., Downtown), and a sparser distribution of devices on the outskirts of a city.In Fig. 2 we show 37,994 PoIs where users may go, e.g., McDonald's, Starbucks, parks, shopping malls to name a few.We consider all PoIs to be APs since all of them have at least one private Wi-Fi network available.So, it is natural when a client goes to such a PoI to not rely on its own cellular data and connect to the Wi-Fi network available at that location.Thus, for HFL we use dynamic connection between every device and its closest AP every time the device is available.
Device availability is also a big issue for FL systems, yet it is even less discussed in the literature.In Fig. 3, we show the availability of the top 1,000 devices that appear most times during the month of May 2020 from a total of 12,866 devices.As it can be seen, we have the lowest number of devices active at night and peak numbers of active devices during the afternoon.For both PoIs and device availability, we use data collected from Foursquare [8] over an area of 17.5km⇥17.5km.Previous HFL solutions consider that, at any given time, all APs have a xed number of devices connected, i.e., all devices are available at any time.As seen in Fig. 3, from 1,000 devices, we may end up having at most 76 devices present at the same time: sometimes, we may not have any devices available at all.To enable HFL anytime and anywhere, solutions should be robust to such dramatic variations in availability.
Since current FL solutions are oblivious to device availability issues, they typically aggregate at time C + 1, all devices that have been trained at C, assuming they are all still available.However, not  only that some particular devices may not be present at C + 1, but they may come back at a later time (e.g., C + 2), case in which the device will not be considered for aggregation.This implies such a device wastes energy training a model that is ultimately not used for aggregation.We show in Section 4.4 how much energy is wasted on such devices that start training at time C and do not aggregate since they are not present at C + 1, but reappear at some other future times.We alleviate this issue by considering for edge aggregation all devices that started training since the last global aggregation.In Fig. 3, we show how our solution considers up to 39.78% more devices than state-of-the-art solutions.For the normalized device availability, we consider 1.0 as the maximum number of devices available for aggregation at every time.We note that quite a few times all the plot lines go to zero, denoting times when no device out of a total of 1,000 devices considered is actually available.

Communication Model
Recent ndings in [29] show that even in 2022, traditional technologies like 4G LTE and Wi-Fi 4/5 are still used by the majority of mobile users due to their more mature deployment and stable performance.This is why we use for modeling the power characteristics of LTE and Wi-Fi taken from [11].The data transfer power model (best t) parameters from [11] are summarized in Table 1.Assuming the upload throughput is C D [Mbps] and the download throughput is C 3 [Mbps], we have the power level [mW] for upload as % D = U D C D + V and for download % 3 = U 3 C 3 + V (as shown in [11]), where U D and U 3 are the power model parameters and V is the base power when throughput is 0 (see Table 1).Considering `the model size [Mb] and 2><< [Mb] the extra bits required for communication using the FL framework, we compute the total communication energy [mJ] for a device as in Eq. 1 for a complete communication round, i.e., when the device receives the model and uploads the updated model back to the AP:  For realistic LTE connection throughputs, we consider the popular mobile carriers for their upload and download speeds, as summarized in Table 2. To set the minimum and maximum C D and C 3 , we consider the distance X 8,9 between an AP 8 and a device 9. Given a random selection of 100 APs, we compute for the entire month of May 2020, the mean X <40= and standard deviation X BC3 distance between all devices and their selected APs.If a device is further than 100m from its AP, we select C 3 and C D based on Eq.2, where C <8= and C <0G are taken from Table 2: where <0?:

MOHAWK Framework
In FL, we want to solve an optimization problem of the form: where 5 8 is the loss function of device 8 evaluated on the local dataset ⇡ 8 , D is the set that contains all devices and | • | denotes the number of elements of a set.We solve the optimization problem in Eq. 3 over time.We show in Fig. 4  Select U 8 using Eq. 4 ù Dynamic community selection  )A = )A \ ( U ù Remove all aggregated devices 8 from )A where where f is a hyperparameter and cos(G, ~) = <G,~> is the cosine similarity function.Inspired by FedAMP [12], we use a weighting based on the cosine similarity function to better address the mobile nature of the devices which may lead them to change the AP they connect to at every communication round they are available, see [7].After a device 8 gets aggregated, it is removed from )A (Line 22 in Alg. 1) and )A gets reset every global aggregation (Line 29 in Alg. 1).Selective Global Aggregation Finally, the device availability issue propagates to the AP level, since some APs, at certain time steps may not have any devices to aggregate; hence, we ask how we can perform the global aggregation in such a scenario.Since some APs will not have any update for the cloud, it makes sense to not aggregate them, preventing any communication with the cloud and thus saving energy and capacity for communication.We name this aggregation strategy selective global aggregation and we implement it using the subset of APs A ✓ A that performed at least one edge aggregation since the last global aggregation.We perform a selective global aggregation after : 2 dynamic edge aggregations.In Line 17 from Alg. 1, we save the APs that will be aggregated in the cloud, while in Line 28 we perform the selective global aggregation as follows: where l are the global model weights, f is the same hyperparameter from Eq. 5, and 2>B is the cosine similarity function.Some APs (e.g., from a dense area such as Downtown) may have aggregated many devices, while others may have aggregated very few devices.In order to not diverge too far from the global model, we weight the contributions of each AP based on the cosine similarity with the current global model weights.The cloud requests from all APs their updates, but, in the end, only the APs that performed at least one dynamic edge aggregation will actually send their updated models to the cloud for aggregation.
To summarize, we start with dynamic community selection and local training for : 1 local epochs.Then, we go to the next time step C + 1, and since all devices changed their position and/or availability, we perform dynamic community selection for the available devices.On their newly selected APS, we run dynamic edge aggregation.We repeat this process : 2 times.Finally, we perform selective global aggregation with the APs that did at least one dynamic edge aggregation since the last selective global aggregation and send the updated global model to all APs U.

PERFORMANCE EVALUATION 4.1 Experimental Setup
We perform experiments using the Foursquare dataset [8] for the entire month of May 2020.Time C starts at May 1 BC 2020, 12:00AM UTC and ends at May 31 BC 2020 11:00PM UTC; we consider every hour from the month of May, summing up to 744 total time steps, thus having 744 communication rounds in total (i.e., 31 days, each with 24 hours).We consider the dierence between two consecutive time steps C and C + 1 to be 1 hour.From 12,866 available devices, we select the top 1,000 that appear most times and from 37,994 APs we randomly select only 100 APs.We run each experiment three times with dierent seeds and report average values.All experiments are run using two GPU servers with 4⇥A6000 GPUs, 64 core AMD Threadripper PRO 3995WX CPU and 512GB RAM each.On each device, we use a simple convolutional model composed of one convolutional layer with 32 lters and MaxPooling, two convolutional blocks, each with two convolutional layers with 64 lters followed by MaxPooling and, nally, a fully connected layer with 512 neurons.
We use both independent and identically distributed (IID) and non-IID settings, similar to [10,15], by controlling the U parameter from the Dirichlet distribution.We set U = 100 for the IID scenario and U = 0.1 for the non-IID scenario.We randomly sample from each class the number of images dictated by the Dirichlet distribution.Similar to [10,15], we use 500 images per device for MNIST [14], CIFAR10 [13] and EMNIST [6] and we use 2500 images per device for CIFAR100 [13].This enables in the IID case around 50 images per class for MNIST and CIFAR10, around 8 images per class for EMNIST and approximately 25 images per class for CIFAR100.
For HierFAVG, we consider (for all time steps) the same device-AP conguration, i.e., we x all devices with a pre-assigned AP.This forces every device to connect only to its pre-assigned AP irrespective of the distance between them (just as in [18]).For MACFL, at every time step, a device has a 50% probability of moving to one of the "neighboring" APs.We create a neighborhood of APs to match the setup from [7].We connect % 8 to % 8±1 such that every AP has two neighbors.Since APs are randomly selected for three dierent runs from approximatively 37,994 possibilities, % 8 could end up very far from % 8±1 .Since MACFL does not consider the real distance between the APs, the "virtual" assignment of APs is indeed matching the experimental setup provided in [7].We consider the following hyperparameter values: f = 0.1 (from Eq. 5 and Eq. 6) and learning rate 0.01.Following some ablation studies discussed in Section 4.3, we choose batch size 8 and : 1 = 5 for all experiments.

Empirical Results
Learning performance.The design choices of MOHAWK make our method better than other state-of-the-art HFL approaches.As shown in Table 3, for all four datasets running HFL (every hour for an entire month) we achieve very similar accuracy values for MNIST and better accuracy for all other datasets.We observe in some cases that the drop between IID and non-IID scenarios is smaller for MOHAWK than for HierFAVG or MACFL.For CIFAR10 and : 2 = 2, the drop in accuracy between IID and non-IID for HierFAVG is 7.89%, for MACFL is 7.73%, while for MOHAWK is only 4.77%.This shows that MOHAWK has a better robustness against data heterogeneity.In our experiments, we observe that, for the same hyperparameters, when we increase : 2 , the performance degrades for both baselines and MOHAWK.This conrms that even for real mobility and availability of devices, the ndings from [7,18] still hold: frequent edge aggregations (e.g., : 2 = 2) are benecial to the learning performance in HFL.
In Table 4, we show how much faster MOHAWK converges to a certain accuracy threshold when compared to the baselines.As highlighted in Table 4, for any given : 2 values MOHAWK manages to speedup convergence at least by 1.43⇥ and up to 3.88⇥.The reason for such good convergence rates against state-of-the-art HFL solutions is the adaptation to real mobility and availability.By using dynamic edge aggregation and selective global aggregation, MOHAWK has the upper hand in every scenario.This also shows how real-world mobility and availability of devices impacts the current state-of-the-art HFL.Thus, we show the importance and need for mobility-aware HFL solutions like MOHAWK.6: Global test loss for MOHAWK and various baselines.We observe that for very small : 2 values, i.e., : 2 = 2, the global test loss may have a bigger variability since the number of aggregated APs is very low.The baselines aggregate all 100APs, even those that did not train at all, at the expense of a slower, but more stable convergence.In Fig. 5, we show the global test accuracy for both IID and non-IID scenarios (: 2 = 5 for all datasets).Overall, we can see higher accuracy levels and faster convergence over all datasets and data heterogeneity scenarios.In Fig. 6, for all : 2 values and data heterogeneity scenarios, we show the global test loss for MNIST and CIFAR10.We observe that for very small : 2 values there is more variability in the global test loss for MOHAWK due to the selective global aggregation.Since at night there are very few to no devices available, we aggregate just a few devices and then update the global model based on a small number of APs (since only a few of them do edge aggregations).This issue disappears with higher : 2 values since we allow more time for APs to perform edge aggregations (and hence, be considered for the selective global aggregation).The faster convergence due to the selective aggregation is also clearly visible for all : 2 values in Fig. 6.
Communication eciency.For simulation, we estimate the energy consumption for communication using the models described in Section 3.2.In Table 5, we observe the increase in device availability compared to existing methods.When real-world mobility and device availability are present, our dynamic edge aggregation considers up to 39.78% more devices for aggregation (: 2 = 10).We can also see the dynamic community selection reduces the average distance between APs and the devices that connect to them by 7.74⇥.This forms communities of tightly grouped devices and their AP, thus allowing for faster communication between them.As it can be seen in Table 6, compared to state-of-the-art HFL solutions, MOHAWK improves the communication eciency for HFL under mobile and heterogeneous environments by up to 3.87⇥.In Fig. 7, we show that to achieve a given accuracy threshold, the amount of energy spent for communication is drastically reduced (up to 3.87⇥) compared to current state-of-the-art approaches.

Ablation studies
Batch size variation.We perform an ablation study with : 2 = 5 to determine the best batch size to run MOHAWK with.For this, we x : 1 = 1 to run the ablation experiments faster.We evaluate four dierent batch sizes on all datasets.As can be seen in Table 7, the best batch size by far, for all datasets, is the batch size of 8.
Variation of : 1 and : 2 .Using : 1 = 1 results in lower accuracies for CIFAR10 and CIFAR100 so we followed up with another ablation Table 7: Ablation study for MOHAWK accuracy [%] using dierent batch sizes for : 1 = 1 and : 2 = 5.We observe that the best results over all datasets are for batch size 8.

IID (U = 100)
Non-IID (U = 0.1) study.We explore for MNIST and CIFAR10 dierent : 1 local epochs and : 2 dynamic edge aggregations to see which values work better.The jump in accuracy improvement from : 1 = 1 to : 1 = 5 proves to be much larger than the jump in accuracy between : 1 = 5 and : 1 = 10 (see Table 8).Since we consider real edge devices and the   Scalability of MOHAWK.We show in Table 9, using larger numbers of APs and devices, how scalable is MOHAWK.We observe similar performance in terms of accuracy on CIFAR10 for both IID and non-IID settings, while all other HFL solutions have a decrease in accuracy as we increase the number of APs.This is because MOHAWK uses selective global aggregation, which makes it robust to variabilities in the number of APs considered.

Hardware Prototyping and Validation
As can be seen in Fig. 8, we designed and built a custom testbed with 36 Raspberry Pi 3B+ devices.Each of the Raspberry Pi devices is connected to a Smart Power 2 device for real-time power and energy measurements.We run FL using a GPU server and communicate through wireless using a local router.We measure the total amount of energy spent in a 36 device experiment using 20APs.We use the GPU server to run the global server and the 20 APs.
We consider a device is wasting energy if it is training a model which is not ultimately aggregated.Thus, such devices are not improving the learning performance of the FL system, but are wasting their already limited resources (e.g., battery, memory).To the best of our knowledge, no current HFL state-of-the-art methods account for this kind of wasted energy.This is why, on our hardware prototype, we run only HierFAVG as a baseline.We use : 1 = 1 for MNIST and : 1 = 5 for CIFAR10, running only the IID setting on both datasets.As seen in Table 10, MOHAWK achieves up to 4.25⇥ less energy wasted (on average) per communication round and up to 2.24⇥ less energy wasted (on average) per device.The total energy wasted over the entire experiment is reduced up to 2.56⇥; this shows how much more energy-ecient MOHAWK is in real scenarios.The energy measurements represent the energy spent on both training and communication while performing FL.

CONCLUSION
We have proposed a Mobility and Heterogeneity-Aware Dynamic Community Selection algorithm (MOHAWK) for mobile federated learning systems.Our approach takes into consideration the real devices mobility and selects the closest access point for every device to connect; this leads to signicant reduction in the energy consumption for communication.To improve the learning performance, we have proposed two new aggregation strategies, namely, dynamic edge aggregation and selective global aggregation, that increase the number of devices aggregated at every time step by up to 39.78%; this also helps the global model learn up to 3.88⇥ faster, while also achieving a higher accuracy, on average.
Limitations and future work.The current communication model can be improved in several ways, e.g., by considering channel scheduling, and multi-hop networks of APs.For dynamic edge aggregation, we currently consider all available devices regardless of the quality and security vulnerability of their local model.A more robust and secure selection of devices could improve the overall performance.All these ideas are left for future work.

Figure 1 :Figure 2 :
Figure 1: Our proposed MOHAWK framework selects at every time step dynamic communities of devices and their closest Access Point (AP).Within 100m range of an AP the devices have Wi-Fi connection, otherwise LTE.Dynamic edge aggregation allows devices like device C (at 2:00PM) that may disappear due to battery depletion to be reconsidered for learning the next time they appear, as long as the global model from the cloud was not updated in the meantime.Selective global aggregation only aggregates in the cloud the APs that aggregated at least one device since the last global aggregation (e.g., % 2 and % 3 ).
1) Due to availability variations, a device may only download the model at time C and upload it at a later time C + _ when it becomes available again.The total energy spent for communication at time C is ⇢ C>C0; = % 3 ⇤ (`+ 2><< )/C 3 when the device only downloads the global model, while it becomes ⇢ C>C0; = % D ⇤ (`+ 2><< )/C D at time C + _ when the device is only uploading the local model.

Figure 4 :
Figure 4: Sequence diagram showing the device-AP-cloud communication, with : 1 local epochs, : 2 = 2 dynamic edge aggregations, and one selective global aggregation at dierent time steps.
is a function that maps linearly the distances within the range of X <40= ±X BC3 to the throughput speeds [C <0G , C <8= ].If a device is within 100m of its selected AP, we use a Wi-Fi speed of C D = C 3 = 1000 [Mbps].We use the communication model during simulation to compute the energy consumption for communication (see Section 4.2).

5 :
how the main steps of MOHAWK are performed at dierent time steps.We assume that device 8 selects AP U at time C = 0.The rst step is to download the global model weights l from the cloud on AP U.Then, the AP model weights (at time C = 0) denoted by l U (C = 0) are downloaded by the device 8 connected to AP U. The device 8 performs : 1 local epochs of stochastic gradient descent (SGD) on its local dataset.We assume the time it takes edge devices to train a model takes much longer than the communication of the model between the devices, APs, and the cloud.At C = 1 we Algorithm 1 Mobility and Hardware-Aware Dynamic Community Selection (MOHAWK) 1: Initialize global weights l with random weights and download them on all APs l U l, 8U 2 A 2: Initialize time C = 0, set of APs for aggregation A = ; and set of devices that trained )A = ; 3: for communication round 2 = 1, 2,..., ⇠ do 4: for each device 8 2 D(C) in parallel do ù For every device 8 available at time C if 2 == 1 then 6: for assume device 8 selects the same AP U.Then, device 8 sends to AP U the updated local model weights l 8 at the next time step C = 1.The AP then performs one dynamic edge aggregation to update its own weights and then sends the updated weights l U (C = 1) to device 8.The device 8 performs again : 1 local epochs of SGD.At the next time step C = 2, the AP receives the updated local weights l 8 and performs another dynamic edge aggregation, i.e., : 2 = 2. Since : 2 = 2, the AP U sends the updated model to the cloud for a selective global aggregation.The cloud sends the updated global model weights l back to AP U, which sends the updated AP model weights l U (C = 2) to device 8 and then the process repeats itself.Dynamic Community Selection At every time C, we have only a subset of devices D(C) ⇢ D available; due to real device mobility and availability, we have |D(C)| ⌧ |D|.Thus, we need to adapt the AP selection process to work dynamically for all APs U 2 A, where A is the set containing all APs.As seen in Line 6 in Alg. 1, the rst communication round begins with the dynamic community selection for each available device 8 2 D(C) by solving the following optimization problem:

Figure 5 :
Figure 5: Accuracy results using IID and non-IID settings for : 2 = 5.MOHAWK converges faster and obtains a higher accuracy value in both IID and non-IID scenarios, for all datasets.

Figure
Figure6: Global test loss for MOHAWK and various baselines.We observe that for very small : 2 values, i.e., : 2 = 2, the global test loss may have a bigger variability since the number of aggregated APs is very low.The baselines aggregate all 100APs, even those that did not train at all, at the expense of a slower, but more stable convergence.

Figure 7 :
Figure 7: Average energy spent for communication [Joules] until reaching an accuracy threshold of 97% on MNIST and 60% on CIFAR10, with dierent : 2 values under data heterogeneity constraints.For MNIST, IID with : 2 = 2, MOHAWK achieves the required accuracy threshold A using up to 3.87⇥ less energy for communication compared to the baselines B and C. Similar considerations for CIFAR10 IID (D) and non-IID (E).

Figure 8 :
Figure 8: Hardware prototype with 36 Raspberry Pi 3B+ (outer semicircle of devices) and 36 Smart Power 2 devices (inner semicircle) used for real-time power and energy measurements.

Table 2 :
[4]LTE per user throughput ranges for dierent carriers[4].We consider a mixed range from all carriers.
2( U 2is the Euclidean distance between AP U and device 8, and U 8 is the selected AP for device 8.We denote with ( U the set of all devices connected at AP U.
devices that started training their model at time C to aggregate their updates when they become available again, at C + _, if only edge aggregations were performed during the time _ that has passed.

Table 4 :
Number of communication rounds required to achieve a threshold accuracy (Acc.thresh.)for both IID and non-IID settings.Overall, MOHAWK uses up to 3.88⇥ less communication rounds.

Table 5 :
Number of devices available at every time step and the average distance between devices and their selected APs.MOHAWK improves the device-AP distance by 7.74⇥, while including up to 39.78% more devices in the learning process.

Table 6 :
Average energy spent for communication per client to achieve a certain accuracy threshold on MNIST and CIFAR10 (in simulation) for both IID and non-IID settings.Overall, MOHAWK uses up to 3.87⇥ less energy for communication to achieve a given accuracy threshold, i.e., 97% for MNIST and 60% for CIFAR10.

Table 8 :
Ablation study for MOHAWK accuracy [%] using dierent : 1 and : 2 values with batch size 64.We observe a big jump in accuracy from : 1 = 1 to : 1 = 5, hence we use : 1 = 5 in the main experiments for both MNIST and CIFAR10.

Table 9 :
Ablation study on MOHAWK scalability for larger, more realistic numbers of devices and APs on CIFAR10 with : 1 = 5 and : 2 = 5.Overall, we observe MOHAWK provides similar performance in all scenarios, while other approaches have large performance reductions.
some low-budget devices may take up to one hour to train 5 local epochs, hence this also accounts for realistic training times.

Table 10 :
Hardware prototype experiment for : 2 = 5 in the IID setting using : 1 = 1 for MNIST and : 1 = 5 for CIFAR10.The wasted energy is the energy a device spends training and communicating with the cloud without the cloud actually considering the local model for aggregation.We observe MOHAWK not only uses far less energy overall, but it also wastes less energy.