Heterogeneous Multi-Agent Reinforcement Learning for Grid-Interactive Communities

Homogeneous Multi-Agent Reinforcement Learning (MARL) is well studied in games, robots, and simulations. What has not been fully explored is the effectiveness of Heterogeneous MARL in the building space. Heterogeneous MARL has been proven to be more effective than Homogeneous MARL in terms of performance in games. Heterogeneous MARL also has the added benefit of being a more realistic simulation because no two buildings can be expected to react in the same way. Here, we implement the MARLlib library with the CityLearn environment to analyze the benefits of Heterogeneous MARL and compare them to homogeneous agents in a small scale proof of concept.


INTRODUCTION
Multi-Agent Reinforcement Learning (MARL) has seen significant advancements in recent years with applications spanning various domains: from gaming [10] to driving [8].MARL involves multiple learning agents interacting with an environment and each other, learning from their experiences to optimize a collective goal.The complexity of MARL arises from the interactions between agents.There are two types of policies that can be applied in MARL: homogeneous and heterogeneous.
We define homogeneous and heterogeneous agent policies in the same way Kuba et al defines it.Homogeneous agent policies share the same action space and policy parameters while heterogeneous agent policies do not [4].By sharing the same action space and policy parameters, homogeneous policies are less applicable to real life situations because it assumes [4] all agents can view and understand the same action space.Heterogeneous policies are more aligned with real life scenarios because each agent would act with differing degrees of ability.
In this paper we investigate performance of heterogeneous and homogeneous MARL agents for battery energy storage system (BESS) control in the CityLearn Gym environment for control benchmarking in demand response [11].This application presents a unique set of challenges and opportunities, as the agents must balance the need for energy conservation with the demands of the grid and the building occupants.

METHODOLOGY
We used MARLlib as the framework to run all of the experiments.MARLlib is a Multi-Agent Reinforcement Learning Library based on Ray and one of its toolkits RLlib [3].It is meant to provide convenient environment wrappers, agent level algorithm implementation, and a flexible mapping strategies.For heterogeneous agents, we used the algorithmic implementations for HAPPO [4] and HATRPO [4], while for homogeneous agents, we used MAPPO [13], and MA-TRPO [7] as provided by MARLlib because it was the most recent and well supported framework for heterogeneous policies.
We adapted the CityLearn environment to the MARLlib framework to simulate the buildings.CityLearn is an OpenAI Gym environment that was created for the benchmarking of rule-based control, model predictive control, and reinforcement learning control algorithms for demand response studies [9].It has building, electric heater, heat pump, thermal energy storage, battery energy storage systems, and photovoltaic energy models as part of its observation space.This environment is necessary to simulate the data set.This work is only concerned with controller electric storage.
For the implementation we use a real-life dataset.It is s one-year time series from 17 zero-net energy buildings that covers the August 1, 2016 to July 31, 2017 period.This dataset was used in the NeurIPS 2022 -CityLearn Challenge [1], the third edition of The CityLearn Challenge [5,12].Details on the implementation and definitions can be found in [6].We have chosen to isolate 2 of these buildings (Building 1 and Building 2) to efficiently benchmark performance.
The experiment was ran with each of the listed algorithms to a) test the MARL implementation in CityLearn and b) the effectiveness of heterogeneous over homogeneous policies.

RESULTS
Our preliminary results are shown in Figs 1-3.In Fig. 1 we can observe that the average reward is the lowest (best) for the heterogeneous HATRPO algorithm, though the homogeneous MATRPO appears to be second best.It appears that the difference in performance cannot solely be attributed to the agent types, and more in-depth analysis is required to understand their differences.The loss functions shown in Figs. 2 and 3 confirm the same tendencies and are also qualitatively similar to our prior work [2].Overall, we demonstrate successful integration of MARLlib with CityLearn.Timestep Reward (negative is better)