Personalized Federated Hypernetworks for Multi-Task Reinforcement Learning in Microgrid Energy Demand Response

As sensors pervade the built environment, they have fueled the advance of data-driven models that promise greater efficiency for microgrid management. However, this has raised concerns over data privacy and data ownership. The paradigm of federated learning has emerged in supervised learning to address these issues, but work on federated RL is relatively rare, and focuses on training global models that do not take into account the heterogeneity of data from different microgrids. We develop the first application of Personalized Federated Hypernetworks (PFH) to Reinforcement Learning (RL). We then present a novel application of PFH to few-shot transfer, and demonstrate significant initial increases in learning. PFH has never been demonstrated beyond supervised learning benchmarks, so we apply PFH to an important domain: RL price-setting for energy demand response. We consider a general case across where agents are split across multiple microgrids, wherein energy consumption data must be kept private within each microgrid. Together, our work explores how the fields of personalized federated learning and RL can come together to make learning efficient across multiple tasks while avoiding the need for centralized data storage.


INTRODUCTION
As Reinforcement Learning (RL) is brought to bear on pressing societal issues such as the green energy transition, the types of environments that RL must perform well in may display characteristics exotic to classical RL environments.Real applications at scale may require privacy guarantees which are not provided by modern multi-agent RL algorithms as they may train on privileged or corporate data [22,27,36]; any app that personalizes an RL agent to individual users must take care to protect their privacy by not storing all their data in a central server.Real world applications will also likely feature heterogeneous tasks; every user, robot, energy system will have different traits that cannot be accounted for by "one size fits all" algorithms.
One approach toward privacy preservation by decentralizing data servers within supervised learning is federated learning [34].Federated learning algorithms train a global model from gradient updates sent by individual clients training on their own data, which is never sent to the central server.An extension of federated learning technique is personalized federated learning using hypernetworks (PFH, Shamsian et al. [32]), which allows for behavior tailored to individual heterogeneous tasks by splitting the model into a global common component (i.e. the hypernetwork), and a local individual component (a local network generated by the hypernetwork), which is tailored to each client.This task specialization allows for learning common features together in the global component while allowing for learning client-specific knowledge in the local component.
As previous work in federated RL [5,25,28,42] does not extend to personalized models, federated RL currently comes at a significant performance cost.We present a novel application of PFH to RL in a realistic power systems setting that requires both decentralized data and heterogeneity in agents to accommodate diverse, sensitive environments.An RL controller optimizing hourly transactive energy pricing has been shown to optimize energy usage [3,19,35,38] by incentivizing consumers, at the scale of groups of buildings (microgrids) or office workers within buildings, to shift demand to different times of day.By guiding consumers to defer energy demands to hours when solar is especially active, it is possible to drop a building's carbon impact to 48% of normal operation through RL price-setting [16], which could have massive implications for grid sustainability.However, RL can be extremely data hungry; prior transactive control attempts required about 80 years of training data [3].
Controls that learn pricing have minimal room for error, as deployment costs can mount.Thus, learning from a distributed base of test cases is useful for faster learning.
To increase the amount of data available, we consider multiple RL agents, each managing their own (slightly different) microgrid through energy prices and collecting data in parallel.This microgrid environment is a multi-task, multi-agent setup in which the management of each microgrid, through prices, constitutes a task.We characterize our problem as multi-agent because we have multiple RL agents optimizing a shared reward (total profit), and multi-task because optimization of profit in each of the different microgrids presents tasks that are related but also independent due to differences in size, number of batteries in each building, etc.We hypothesize we can accelerate training by incorporating data from multiple microgrids with different characteristics.Learning to set prices using data from multiple microgrids (source tasks) also opens the door to few-shot learning in new microgrids (target tasks), wherein we learn to generate near-optimal prices for a microgrid very quickly.
However, energy data is data in which privacy concerns are paramount.It is our hope to contribute to privacy protection by aggregating learning, not data, to one central source.Not only would keeping data of buildings' energy consumption at one central location present a major privacy concern if this central machine is compromised, but message passing of the raw data could present an additional source of vulnerability.Although each microgrid might have access to the data of a few buildings at a time, the scale of damage would be much larger if data was stored in a central server across multiple microgrids.
We now present a hypothetical setting in which our architecture would be useful.One could imagine a hacker being able to learn when the hypothetical company CovertAI trains their new 80 quintillion parameter language model CPT-5 from the energy consumption of CovertAI's compute warehouses.The hacker could sabotage power lines at the right moment to erase learning gains.They may then turn their attention to residential neighborhoods.Here, they could figure out when people are not home from the energy consumption of domestic buildings, timing a theft; they could also disaggregate energy signals to learn the appliances the homeowner has or glean sensitive health information if medical devices produce noticeable patterns in energy consumption.
Applying PFH to the energy application remedies both of these competing issues.PFH allows for decentralized learning such that data is never aggregated to a central source, and accounts for heterogeneous tasks by generating RL agents individualized to each microgrid's size, number of solar panels, batteries, etc.We demonstrate that PFH learns the underlying factors that define an environment by applying PFH to the microgrid price-setting problem, where we observe increases on the scale of millions of dollars in total microgrid profit (reward) over federated and local learning.We also demonstrate how PFH can be used for few-shot transfer learning for new local agents entering the system by reporting drastic training speed-ups (>100x) when transferring from source tasks to target tasks.Thus PFH drastically increases the feasibility of RL for energy price-setting.
In previous work in multi-agent RL (MARL), hypernetworks have been used to combine local agent networks into a global model (i.e.Q-Mix); however they were used to generate new mixing networks and not to generate new policies [27].Other works have used federated learning before in RL [5,25,28,42], but focus on learning global models, not personalized models for heterogeneous tasks.

Novelty
Methodologically, our paper is novel in its adaptation of a state of the art federated learning algorithm to RL.To our knowledge 1 , we are the first to explicitly apply personalized federated learning to multi-task, multi-agent RL when centralized learning and joint action-values are unavailable.Application-wise, our paper is novel in its improvement in energy demand response across heterogeneous microgrids.We hope our work highlights an important microgrid environment to the RL community, helps establish the use of PFH within RL, and allows for RL to address problems where learning speed and decentralized data are fundamental.

DEFINITIONS
Energy Demand Response is a technique used by grid operators to incentivize consumers to shift demand to times when it is better for grid stability/climate emissions, such as when solar energy peaks.Demand response has the same function as grid-level batteries would in easing the volatility of wind and solar energy and is seen as an important tool in the energy transition [4].
A microgrid is defined as a small group of buildings that transacts energy with each other through some energy aggregator, governed by an hourly energy pricing scheme.One may imagine they are situated close together with respect to not only geography but also the wiring topology of the grid, making trading within the microgrid preferable to trading with the grid.We will refer to groups of microgrids as "microgrid clusters ". 2A prosumer is an entity that both consumes and produces energy, like a building with rooftop solar.
We wish to disambiguate between multi-task and multi-agent for the reader's convenience.We use them in the conventional sense: multi-task relates to multiple, related settings (in our case slightly different MDP's in each different microgrid) whereas multi-agent refers to multiple different policies.
A hypernetwork is a neural network that outputs the weights of another neural network.
A federated algorithm is one that does not require communication or storage of raw data samples to a central server.Although some data can still theoretically be recovered from non-data communication like models [9], we view robustness against this type of privacy breach as out of the scope of our work.

RELATED WORK
We position our literature within an ecosystem of work related to transactive pricing in microgrids.A price-setting RL agent was first shown to help an energy aggregator improve demand response and generate a profit [3].Since then a number of works have explored the issue [12,29,33,43], with some work exploring different configurations of RL.
We wish to provide an example of federated learning."Distributed Selective Stochastic Gradient Descent", i.e.DSSGD [34], is an interesting example which deserves further exploration from the interested reader.DSSGD has each local model exchange select parameters and gradient updates with the central server.In contrast, FedAvg [23] just averages all local model gradient updates and syncs all local model parameters.In personalized federated learning, there have been techniques other than PFH for federated learning of personalized models such as Moreau envelopes [37], multi-task learning [18], personalization layers [6], and local representations [20].However, [32] showed PFH performs better than several of these algorithms in the supervised learning setting.
Existing multi-agent environments are often solved through multi-agent RL algorithms like MADDPG [22], VDN [36], Q-Mix [27], and MAAC [15], but these all require a central machine to be able to access data from all the agents during training.Works using multi-agent RL for energy trading generally use some variant of these centralized methods [10,13,26].Other works use federated hypernetworks for multi-task setups, but not those in RL.
The combination of the two fields, federated multi-agent reinforcement learning, has focused mainly on learning global models, not personalized models for heterogeneous tasks [5,17,25,28,40,42,48].Decentralized multi-agent reinforcement learning does learn personalized models [45,46], but it may be difficult to scale up a decentralized system such that each agent can benefit from the experiences of all the others without large communication costs.This is not as much of an issue for federated learning as communication only needs to occur between clients and a server rather than clients and all their peers.Although decentralized systems have their benefits, we focus mainly on federated systems in this work.
We note that while federated learning takes significant steps toward privacy preservation, it does not completely guarantee privacy, as works have shown that the transmission of gradients can allow one to recreate some private data [14,24,44].Thus, while we note that our work guarantees privacy to the extent of other works within the field of federated learning [2,41,47], one should apply the term privacy-preservation with the same caveats to our work as to the rest of the field.

METHODS 4.1 Learning environment
The MicrogridLearn [3] environment is an OpenAI Gym [8] environment used to study RL-set pricing in prosumer aggregations.Specifically, the environment is structured such that an RL agent and an energy utility both broadcast a day's worth of hourly buy and sell prices; ì   , ì   ∈ R 24 and ì   , ì   ∈ R 24 , respectively, to a microgrid's simulated prosumers, who choose at the beginning of the day which hours they will transact with the RL agent and which hours they will transact with the energy utility.We chose this action space for the RL agent according to the description of Mi-crogridLearn in [3].Each prosumer is an office building composed of a year's worth of historical data and user-defined, non-negative battery and photovoltaic capacities. 3At every step, i.e. one day where all 24 hours are considered, every prosumer solves a convex optimization optimizing their battery charging/discharging, ì  + , ì  − , to maximize their individual profit; i.e.: (1) Where ì   , ì   are inflexible energy generation and consumption, respectively, of the prosumers, ⟨, ⟩ is a dot product, and the min and max are taken elementwise.The first term, i.e. the element-wise maximum, is thus the gross profit from energy each prosumer sells, and the second term, i.e. the element-wise minimum, is the gross expenditures from energy each prosumer buys.Please note that every vector here may be considered a 24 hour vector, and that opposing actions are exclusive (i.e.sell and buy, or charge and discharge) so individual prosumers cannot both buy and sell energy during the same hour.Thus, entries in the sell vector in which the prosumer is buying are represented by 0's.Note that different prosumers may make different transactive choices depending on their battery and photo-voltaic capacities.By ensuring that each prosumer has the ability to transact with either the utility or the microgrid, we incentivize the microgrid to output prices that are better than the utility, guaranteeing a better experience for prosumers under this microgrid structure.One important simplification we have made is that we model human behavior as fixed in ì   relative to the price signal; we do not expect humans to change behaviors (e.g., eating lunch at a different time to take advantage of cheaper energy prices).We only model how distributed batteries could be automatically controlled to maximize the prosumer's profit.
In an environment with this collection of prosumers, the RL agent solves an MDP defined by state space  := ( ì , where ì  is the day's solar prediction 4 and the ì  /, −1 are prosumer buying and selling energy from the previous day.The RL controller emits actions  := ( ì   , ì   ) ∈ R 24+24 .The agent seeks to maximize a long term discounted reward defined by its individual profit, i.e.: We abuse the ì  notation to define ì  / as the total amount of energy bought from or sold to the RL agent hour to hour.
Because the agent is an aggregator that does not generate its own energy, the profit the agent generates comes from the difference in price between the energy it buys from prosumers at that timestep and the energy it sells to prosumers at that timestep.Any excess supply or demand is transacted with the energy utility.In this way, the environment neatly models a realistic transactive system.C. The hypernetwork takes as input an environment embedding vector and outputs weights for an RL controller.The RL agent takes as input buy/sell prices from the utility and outputs buy/sell prices to the buildings in the microgrid the agent manages.The RL agent sends back a gradient update to the hypernetwork, which uses the update to compute the gradient update for the hypernetwork's own weights.

Reinforcement Learning
We use Proximal Policy Optimization [30] (PPO), a popular actorcritic based algorithm, to train all of our RL agents to solve the MDP introduced in 4.1 because PPO is reliable and highly performant.Note that both algorithms introduced in 4.3 and 4.4 are agnostic to local policy architecture; one could use any gradient-based model.

Federated Learning
In order to learn a shared model between multiple microgrids with decentralized data, we turn to federated learning.McMahan et al. [23] presented what is now the most popular federated learning scheme: Federated Averaging (FedAvg).FedAvg is simple to implement.Denote the parameters for the policy for microgrid  at timestep  as   .All the  0 are initialized with the same weights, so  10 =  20 = ... =  0 , etc. Then each policy trains on its own microgrid for  local steps, producing a new  ′  for each microgrid, which has adapted to be better at price-setting in microgrid  than the original  0 .All the  ′  are transmitted back to a central server, where they compute the shared model for the next iteration by averaging all the  ′  .
Then the local models train on their own, send trained models back to a central server, and repeat.Only the parameters   ' are communicated with the central server, never any data.Note that in our setup, every client participates in the weight exchange process, not just a sampled subset of the clients.While FedAvg is a simple algorithm that performs well in supervised learning, it learns a global policy for all the price-setting agents.In our case, a global model is not ideal as microgrids may have different energy consumption/supply behaviors.

Hypernetworks for Personalized Federated Learning (PFH)
To learn a shared model that is still able to personalize to individual microgrids, we turn to hypernetworks for personalized federated learning.[32].Personalized federated learning algorithm has found great success in supervised learning, beating FedAvg and personalized federated learning approaches based on meta-learning [11], Moreau Envelopes [37], and Personalization Layers [6].However, personalized federated learning has never been used before for RL.Now we will describe PFH more formally.Please refer to Fig. 1 for a visual of the algorithm, and Algorithm 1 for pseudocode.Consider again   ∈ R  as an  dimensional vector denoting the parameters of the policy for microgrid  at timestep .A hypernetwork is a neural network that outputs the parameters of another neural network.We will have one global    ∈ R  → R  parameterized by   .   takes as input an environment embedding vector   ∈ R  , which is learned for each environment along with the hypernetwork.We initialize  0 =   0 (  ) ∀ ∈ [ Since the hypernetwork outputs neural networks conditioned on the environment, it creates RL agents that are personalized to each microgrid.This is a federated learning algorithm as it only communicates parameters with the central server instead of data.

Diversity and optimal use of PFH
One factor that could affect the relative performance of PFH is the heterogeneity of the scenario.A homogeneous scenario (imagine a cookie-cutter residential neighborhood) could be suitable for federated learning methods due to similarity in behavior.In contrast, an extremely heterogeneous scenario (imagine mixed-use city blocks with night-life, shopping, and residential real estate) could have wildly different energy demands, which may be better learned by individual local networks without any mechanism to share learning.We hypothesize PFH will perform competitively in some average of these two extremes.If local environments are diverse yet share similar underlying mechanisms, PFH will be able to fit to local conditions while sharing information on common trends.

EXPERIMENTAL SETUP 5.1 Simulating Diverse Microgrids
Because each microgrid is defined by a distribution of photovoltaics and battery sizes, we propose a simple way to tweak the amount of diversity in a system.We sample photovoltaic and battery sizes from normal distributions, changing the variance  2 as the diversity parameter, and round outcomes to the nearest integer 6 We sample from N ( = 100,  = 10) for low diversity cases, N (100, 30) for medium diversity and N (100, 50) for high.We note here that we have chosen the low, medium, and high cases such that 95% of samples (i.e. 2 standard deviations around the mean) in the high case hit realistic bounds in the environment; i.e., 0 (an obvious lower bound) and 200 7 .We report a 95% confidence interval over 5 trials for each experimental result.

Baselines
We compare PFH against FedAvg and two other baselines.First, we observe what happens with no RL control at all; the microgrid aggregator outputs prices that are exactly the same as the utility's.We assume buildings choose to meet half their energy demand/surplus with the utility and half with the aggregator.Our second baseline is the approach used in Agwan et al. [3]: training all the local RL controllers with only their own data: no central model or intermicrogrid communication.These two baselines, no RL and local control, are designed to highlight the added value of RL 8 to the task of price-setting for energy demand response in microgrids, and the added value of having some central model that aggregates learning across multiple microgrids, respectively.

Hyperparameter Selection
We select hyperparameters for each algorithm by conducting sweeps with parameters proposed by Bayesian hyperparameter optimization [31] trying to maximize the mean reward across all agents.
We select hyperparameters for our local agent baseline by conducting a hyperparameter sweep for each level of diversity, as each level has a different distribution of microgrids each local agent must learn to optimize for.Since each RL agent is completely independent in this approach, we do not sweep on the number of microgrids with the same level of diversity, as the microgrids are created by sampling from the same distribution.Thus we run 3 (simple diversity, medium, complex) hyperparameter sweeps for our local RL baseline.Each sweep consisted of 50 runs, where hyperparameters were determined through Bayesian hyperparameter optimization [31] trying to maximize the mean reward across all agents.
For FedAvg and PFH, we run a hyperparameter sweep for each combination of diversity (low, medium, high) and microgrid amount (5,10,20), because it is possible that the characteristics of a successful centralized learner change with these two parameters.For example, a larger central learner may perform well with more microgrids since it is learning from more data.Thus for each algorithm 6 As we are sampling from a "hyper" distribution to instantiate houses, the means of the distribution are not as important as the variances in instantiating diversity. 7200 is a realistic upper bound in both solar panels and batteries: 200 solar panels would require an area of 60 x 70 ft, which bounds the square footage of many commercial roofs, and 200 batteries would be a realistic upper bound of entities not engaging in commercial grid services. 8The most common non-RL methods for microgrid price-setting are iterative pricing methods (IP) [21,39] in which buildings "bargain" with microgrids to reach equilibrium prices.We exclude these baselines because they require each building to develop their own demand forecasts.This requirement raises the computational barrier for entry by an order of magnitude.For comparison, if we had 10 microgrids with 10 buildings each, local RL requires training 10 models (10 microgrids), PFH and FedAvg requires 11 (10 microgrids + 1 central model), and IP requires 100 (10 buildings x 10 microgrids).Agwan et al. [3] also showed RL results in less volatile pricing curves and better performance compared to IP.
Table 1: Hyperparameter Sweep Bounds: All sweeps swept over the first 6 hyperparameters.FedAvg sweeps additionally swept over "AFL # of Local Steps".PFH sweeps additionally swept over the last 7 parameters.Due to the high dimensionality of the sweep, we used Bayesian hyperparameter optimization.We refer to four types of distributions: Uniform, Int Uniform, Log Uniform, and Int Log Uniform.The "Int" type distributions simply quantize the underlying distribution (e.g. if  is sampled from a uniform distribution, ⌊⌋ is returned by an int uniform distribution).The Log Uniform distribution samples uniformly over the log of the value, so the probability of sampling  1 is the same as  2 , etc. we ran 3 (simple, medium, complex) x 3(5 microgrids, 10, 20) = 9 hyperparameter sweeps.The sweeps for FedAvg had 50 runs with the same Bayesian hyperparameter optimization approach as the local baseline.used 100 runs for PFH, because it had ≈ double the number of parameters to explore.

Parameter
To minimize the effect of outliers, we use the hyperparameters from the third highest performing run from each sweep.Detailed parameter bounds for hyperparameter sweeps are in Table 1.

Multi-Task Transfer
An interesting feature of our hypernetwork-based setup is the potential for multi-task learning and few-shot transfer learning.The optimization problem of setting prices for each microgrid can be viewed as an individual task.Since the hypernetwork should learn some common strategies for each task, we tested whether it can generalize to unseen tasks with little training.To test this hypothesis, we simply take a hypernetwork that has trained for 10,000 days to manage a microgrid cluster with 20 microgrids of medium diversity and train the hypernetwork to manage a new microgrid cluster of 20 microgrids with the same level of diversity.By pretraining our hypernetwork on 20 varied source tasks, we hope to encode enough knowledge applicable to the new target tasks to make few-shot transfer learning possible.We will refer to such a pretrained hypernetwork as a Few-Shot PFH.
6 RESULTS AND DISCUSSION 6.1 PFH accelerates learning in medium diversity microgrid clusters Fig. 2 shows average daily profit gained by each microgrid in a microgrid cluster with 5, 10, and 20 microgrids, with varying amounts of diversity.The middle column of Fig. 2 shows PFH is more efficient and profitable for a microgrid cluster than a microgrid cluster under a FedAvg or local control scheme.As shown in Table 2, PFH results in up to $8,500,000 of additional cumulative profit after 10,000 days over the local control baseline in a microgrid cluster with 20 microgrids.However, this advantage does not carry over to cases of small or large diversity.For less diverse scenarios, PFH was comparable or less profitable than FedAvg or local control.For more diverse scenarios, local control was generally more profitable.The number of microgrids in microgrid clusters also did not seem to have much effect on learning speed here.

FedAvg recovers local performance at best
Curiously, our results indicated that FedAvg did not improve the management of a microgrid cluster over a collection of local agents.We had expected FedAvg to perform better in the homogeneous case, and to scale with the number of agents, but neither effect appears in our results.Although FedAvg may perform well in supervised learning [23], it may not extend well to RL.We explain FedAvg's poor performance as follows: unlike supervised learning, RL requires exploration and already suffers from non-IID data.When aggregating learning across different heterogeneous environments, this issue of learning from non-IID data may have been exacerbated, slowing down learning.Furthermore, the setting of federated learning may have made the RL algorithm more sensitive to hyperparameters, as the set of hyperparameters that works for all tasks is likely smaller than the set that works for any one task.We conducted an extensive hyperparameter sweep that is documented in 5.3 to account for this issue with combining federated learning with RL.Meanwhile, the hypernetwork is able to learn how to build We report 95% confidence intervals over 5 trials, and smoothed moving averages to make trends clearer.
RL policies that are less sensitive to these hyperparameters because it outputs agents personalized to each task.

Few-shot learning capability scales with microgrid cluster size
When we tried the same experiment with hypernetworks that were trained for 10,000 days on 5 microgrid management tasks and 10 tasks in Fig. 3.B and tested on 5 and 10, respectively, we saw significantly smaller boosts in the mean reward over groups of new tasks with fewer training tasks.The smaller scale of benefit was expected given a multi-task learning strategy with fewer source tasks and data.Indeed, when trained on 5 tasks, there was hardly any initial training speedup.Starting from 10 tasks, we observed a large initial boost (although not as large as with 20.) Rather strikingly, Few-Shot PFH pretrained on 5 and 10 tasks converged to lower reward curves than even the baseline PFH (i.e., a randomly initialized hypernetwork.)With 20 tasks, we saw both a large initial boost in training speed and no adverse impact on long term training.We hypothesize the fewer microgrid source tasks provided, the more information is stored in the environment embedding, which makes the hypernetwork brittle to new environments.Thus in the 5 and 10 case, the net has not learned enough shared dynamics in the other parameters to generalize to new settings.In the case of 20 and above, we expect that enough shared dynamics are learned that the net can generalize.The range of training speed benefits we observed suggested the potential in some configurations for a Few-Shot PFH to quickly adapt to new tasks depends on how many tasks it was initially trained on.

Diversity is a major indicator of PFH success
Our original hypothesis in 4.5 was that PFH would perform best with medium diversity, as there were still some shared dynamics between tasks for a shared model to take advantage of, but still enough diversity for value to exist in learning personalized models.Likewise, we hypothesized FedAvg would work well with small diversity, since the tasks of managing similar microgrids should be similar to each other.We saw in Fig. 2 that our hypothesis about PFH appears correct, since PFH outperforms other algorithms when microgrids have medium diversity, no matter the number of microgrids, while it is competitive with the local agents baseline in cases of small and large diversity.FedAvg performed at best competitively with PFH and the local agents baseline in all scenarios, and even seemed to drop in performance as the number of agents increased.It is possible that although FedAvg performs well in supervised learning [23], it does not extend well to RL because unlike in supervised learning, RL has an aspect of exploration; one could imagine that each microgrid's local agent tries to explore in different directions, causing gradient updates that are not well conditioned when simply averaged together.The performance discrepancy is consistent with a finding from a recent work [7] that found Fe-dAvg has trouble learning causal relationships even in supervised learning, despite good overall performance.We suggest that a more intelligent design to consolidate gradient updates is needed, such as the domain-aware data augmentation approach used in [7], or the hypernetwork if we do not have domain knowledge that can be used for data augmentation.Because the hypernetwork outputs agents that are personalized to each task, it is able to learn from the aggregated gradient how to build RL policies that have different exploration behaviors.

LIMITATIONS AND FUTURE WORK
Technically, our work is limited in several ways.We presented a "goldilocks" zone in which PFH outperforms other methods, but as we tested only in simulation, it is unclear where this goldilocks zone would appear in the real world.Second, we protect privacy by only communicating parameters, but it is possible to reconstruct data from parameters for some models [9].We would like to address these two issues in future work.
7.1 Future Work We also wish to further investigate the "cost" of privacy in terms of the negative impact this constraint may have on training time and thus on cumulative aggregator profit.It is difficult to estimate the performance impact of incorporating federated learning to diverse RL-controlled microgrids without additional experiments as works examining federated learning have concluded that the performance impact is highly dependent on the diversity of the client data.In supervised learning, [23] showed that FedAvg performs about as well as a centralized model with no performance guarantees on the CIFAR-10 image classification task when the data distributions on each client are similar (IID).However, [49] showed that as the distance between client data distributions drift further apart (increasing diversity), the performance of FedAvg decreased drastically, reducing accuracy by up to 55% on highly diverse clients.For a true apples-to-apples comparison, we would need additional experiments with a centralized MARL algorithm in our microgrid demand response setting, such as MAAC [15].7.1.3Vertical Hierarchical Integration.In the future, PFH may enable further exploitation of the hierarchical nature of price setting for energy demand response.The energy grid can be imagined as a hierarchical tree, with buildings responding to energy prices set by microgrids, which respond to prices set by city utilities, which respond to prices from state utilities, etc.We may even have IoT devices adjusting demand to energy prices set at the building level.At any level of the energy grid, the task is the same: set prices for agents beneath you to elicit a demand response.In this work we looked at one level of this hierarchy, but the methods we used could be applied to other layers of the hierarchy as well.One could imagine a hypernetwork that learns from price-setting agents at every level of the hierarchy, and can be used to rapidly initialize agents to manage any new entrants to the energy grid.

DISCUSSION OF SOCIETAL IMPACT
What are potential negative societal effects of our work?Overall, negative effects to prosumers are limited, as the focus of our work is in protecting consumer information.Furthermore, prior work demonstrated that the presence of an aggregator consistently reduced energy costs for consumers.
The act of setting prices in systems may raise fairness concerns.If initial training microgrids are biased towards wealthier residents, the PFH may initialize new policies with pricing that benefits consumption habits of wealthier clients but not poorer clients.A vivid illustration may be seen in the types of prosumers who are best poised to benefit from economic aggregation: prosumers with large solar panels and batteries are able to shield themselves from or profit off of high prices by consuming their own energy, and may fully charge their batteries when prices are low.Prosumers with smaller or no storage capabilities do not have this luxury, and thus are more vulnerable to the negative effects of price fluctuation.

CONCLUSION
We seek a federated mechanism for improved training speeds on profit-driven energy aggregation in a microgrid cluster to take steps toward privacy preservation.To this end, we are the first to demonstrate a PFH for RL to output local model gradient updates and show improved training times.We hypothesize that PFH shines when the setting is diverse enough to differ meaningfully between systems, but not so diverse that system behavior diverges.We prove our hypothesis and demonstrate the efficacy of PFH for few shot learning approaches.

Figure 1 :
Figure1: Microgrids and PFH: A. We imagine a prosumer that can, at each hour of the day, choose to sell energy surplus or purchase unmet energy demand from the larger utility or to the microgrid aggregator.The microgrid aggregator's energy buy/sell prices are determined by an RL controller.B. A Hypernetwork for Personalized Federated Learning (PFH) receives gradient updates from RL controllers and sends back weights.C. The hypernetwork takes as input an environment embedding vector and outputs weights for an RL controller.The RL agent takes as input buy/sell prices from the utility and outputs buy/sell prices to the buildings in the microgrid the agent manages.The RL agent sends back a gradient update to the hypernetwork, which uses the update to compute the gradient update for the hypernetwork's own weights.

Figure 2 :
Figure 2: RL Agent Performance The performance of the RL price-setting agent as a function of the number and diversity of the microgrids in the microgrid cluster .Performance is measured by looking at the average daily profit gained by each microgrid.We report 95% confidence intervals over 5 trials, and smoothed moving averages to make trends clearer.

Fig. 3 .
Fig.3.A shows the hypernetwork adapted to a new set of microgrid management tasks extremely quickly.On average, within ≈ 1.5 months (42 days), each new microgrid achieved ≈ $380 in daily profit, which is about the daily profit of the local agents baseline after 13 years (5000 days) of training.The original, randomly initialized PFH required 3000 days to achieve similar performance.Thus, Few-Shot PFH achieved a 119x speedup over local agents and 71.4x over a randomly initialized PFH over the first 1.5 months.Within 7 months (210 days), Few-Shot PFH achieved a daily profit of $565: 44% higher profit than the local agents ever achieve.A randomly initialized PFH required ≈ 22 years (8000 days) to achieve similar performance: a 38x speedup in the first 7 months of training.Cumulatively, having a pretrained PFH on 20 microgrids saves ≈ $1,500,000 over the course of training on the new microgrid management tasks compared to a randomly initialized PFH.

Figure 3 :
Figure 3: PFH Enables Few Shot Learning.We report 95% confidence intervals over 5 trials, and smoothed moving averages to make trends clearer.A. Mean microgrid profit of PFH pretrained on 20 microgrids learning to manage 20 new microgrids ("Pretrained PFH"), compared to randomly initialized PFH ("Baseline PFH") and the local agents baseline ("Local Baseline"), over training days on the new microgrids.B. Mean microgrid profit of PFH pretrained on 5, 10, and 20 microgrids on a new set of microgrids, over a longer time than A. C. A plausible scenario in which PFH may need to quickly adapt to new microgrids.

Table 2 :
Cumulative profits above base utility pricing after 10,000 days, in hundred thousands.
[1].1GeneralizationtoOther Environments.First, we would like to explore other environments to determine whether the "goldilocks" phenomenon is unique to the MicrogridLearn environment.Our hypothesis is that PFH performs best with clients of "medium" diversity, and we have proven it for the MicrogridLearn environment.However, it is still an open problem how to ascertain where this zone of "medium" diversity would be for different tasks.7.1.2Improveprivacyfurther.We would like to combine our PFH training procedure with differential privacy measures like those in Abadi et al.[1]to further impede reconstruction of training data.For example, we can train our hypernetwork with differentially private stochastic gradient descent, which clips and adds noise to gradients computed at each step of training, to ensure that if an attacker were to intercept a gradient step they would not be able to reconstruct the distribution of training examples that created it.