TUNEOPT: An Evolutionary Reinforcement Learning HVAC Controller For Energy-Comfort Optimization Tuning

HVAC systems account for the majority of energy consumption in buildings. Efficient control of HVAC systems can reduce energy consumption and enhance occupants’ comfort. In the existing literature, energy-comfort or cost-comfort co-optimization frameworks commonly involve manual tuning of the balancing coefficient between energy and comfort through parameter tuning by an expert. Nevertheless, achieving the optimal balance between energy usage and occupant comfort remains challenging. This limitation restricts the generalizability of different formulations across various scenarios or testing on different environments. In this paper, we propose an implicit evolutionary Reinforcement Learning (RL) approach to learn and adapt the trade-off parameter of an energy-comfort optimization formulation. We have developed a predictive comfort-energy co-optimization formulation for controlling the setpoint of a building. The RL agent utilizes a novel guidance-induced random search method to learn the energy-comfort trade-off coefficient and guide the optimization formulation. The reward function of the RL model is energy productivity (comfort over energy consumption). To evaluate the feasibility of our proposed approach, we conducted experiments on a real-world testbed - i.e., an apartment unit. Our feasibility study shows that the proposed approach can learn an optimal control parameter and reduce energy consumption by 24.3% while decreasing comfort by only 1% compared to the baseline.


INTRODUCTION
Heating, Ventilation and Air Conditioning (HVAC) systems account for the majority of the energy use in buildings.Their control strategies seek to maintain occupants' comfort while accounting for efficient energy use.Optimization techniques are commonly studied and used for efficient control of HVAC systems [6,10].These approaches can encode domain-specific constraints and can handle problems with several decision variables [4].Although these methods are well established with profound theoretical foundations, optimization formulations, once built, typically do not adapt to changing real-world conditions, such as occupants' differences and seasonal variations.This rigidity limits the flexibility of optimization approaches.As a common approach, optimization formulations are employed to minimize energy while balancing the trade-off between an energy use index and a measure of occupants' comfort.Typically, a trade-off coefficient between objective function terms is manually set through parameter tuning and used during operation.For instance, Kim [9] developed a model predictive controller to operate HVAC systems by considering individual thermal preferences.The model used a constant,    , to balance energy cost and thermal discomfort.It was shown that different    values could affect the energy-comfort trade-off that could be leveraged for demand response.Research has demonstrated that the trade-off coefficient directly impacts controller performance by favoring either energy or comfort.However, the effective strategies for configuring such weight coefficients remain to be determined.Typically, they are tuned by experts to ensure the resulting controller achieves high energy efficiency with limited impact on occupants' comfort.On the other hand, Reinforcement Learning (RL)-based approaches have been promising due to their ability to handle uncertainty and to continuously adapt to changing conditions.Examples of RL applications in HVAC control can be found in [3,5].The main themes of RL-based controllers for HVAC systems in the literature center around RL models for shaving energy peak, utilizing passive thermal storage of buildings, and learning building thermodynamic behavior through interactions with environments [8,13].Most common approaches in RL often use Q-learning or actorcritic-based methods to learn the optimal policy [13].Unlike previous RL frameworks, in this paper, we propose a novel evolutionary search (ES) algorithm with a guidance function based on state-action-trajectory data, which only accesses the environment through interactive samples (reward, states, etc.).The proposed approach combines the synergistic strength of optimization-based and RL-based approaches to adaptively learn the parameters of an optimization model using an RL agent, referred to as TUNEOPT (TUNE-OPTimization).Hence, instead of using the RL agent directly for taking actions, we use it to learn the parameters of an optimization model, while the control actions are taken by the optimization model.At its core, TUNEOPT leverages a predictive optimization formulation with the objective of minimizing energy consumption and maximizing occupant comfort while considering the HVAC system constraints.The RL agent guides the optimization formulation to maximize an energy productivity measure (comfort over energy).The proposed approach has been tested on a real-world apartment unit, and the results are compared against a baseline controller.

METHODOLOGY
Figure 1 shows the TUNEOPT framework.In this framework, the RL agent adapts the  value to optimize operations and maximize the reward function, which is set as comfort over energy.In the optimization formulation, the parameter () serves as the balancing factor, dictating the trade-off between energy consumption and user comfort.The optimization formulation takes actions denoted as Δ *  +1 , which represents the change of setpoint, and communicates these actions to the environment through a thermostat.In the following subsections, we delve into the details of the optimization formulation and the RL agent.

HVAC Controller
A model predictive controller has been developed to co-optimize energy and comfort for a single-zone apartment, as demonstrated in Equation 1and Equation 2. The first term is energy consumption (  +1 ) which is a function of temperature and the changing setpoint (Δ  +1 ) for the next  time steps.Energy consumption is estimated using a multivariate regression model explained in section 2.1.1.The second term pertains to thermal comfort, which is a function of indoor temperature (  +1 ).We have employed probabilistic personalized comfort models to accommodate individual thermal preferences, as these models offer accuracy at the individual level.Figure 2 illustrates an example of a comfort profile used in this study representing single occupancy.The corresponding occupant experiences 100% comfort around 73.9 • F (23.3 • C) and 50% comfort within the range of 70.6 • F (21.4 • C) to 77.1 • F (25.1 • C).More details of personalized comfort modeling could be found in [7].Both terms are normalized using   and    .The optimization is solved as a linear programming problem after linearizing the objective function.The parameter  governs the priority of comfort over energy, while  in Equation 2 denotes the minimum probability of comfort that the controller must maintain.Although this formulation is for a single-zone environment, it can be expanded to multi-zone environments by changing the scalars to vectors.3) and energy consumption (Equation 5).
In this formulation,   and Δ  +1 represent the zone temperature at time  and the temperature change from time  to time  + 1, respectively.Similarly,   and Δ  +1 denote the setpoint and the change of setpoint from time  to time  + 1.The  vector comprises disturbances, including outdoor temperature and occupancy flags.The scalars , , and  are calculated through multi-variate regression analysis.

Reinforcement Learning agent
The RL agent is designed to tune  for optimizing energy-comfort trade-offs.The reward function for the RL agent training is energy productivity, which is shown in Equation 7. The reward function quantifies the comfort achieved by consuming a unit of energy.
To this end, the RL agent makes the sequential decisions for the parameter value  every day, such that the reward is maximized.

Guided evolutionary search for parameters
The proposed evolutionary RL algorithm is shown in Algorithm 1. Firstly, we start with some   candidate parameter values for , which are sampled from a probability distribution   .The probability distribution   can be determined from expert knowledge, e.g., a normal distribution for positive parameters, with mean and variance based on historical parameter values.We denote the index of parameter candidates with  and the iteration index with .For each parameter candidate, we evaluate it on the environment and observe a noisy reward   .Then we use a predefined guidance function  (.) for each candidate parameter.The guidance function  (.) is to generate new distributions for each of the parameters, such that the mean of each distribution moves towards the best parameters observed in the current iteration .In this work, we use the guidance function, which takes the mean of some of the best parameters observed in the current iteration.We guide the search for the parameters with this guidance function, which is a function of guidance factor , and obtain new distributions for each candidate.Then a weighted sum of these distributions is taken on how the parameters are performed on the environment to get a new distribution  +1 for the next iteration.See Figure 3 for the visualization of the algorithm.The algorithm stops when the improvement of the reward function falls below a certain threshold.

REAL-WORLD TESTBED
Our real-world testbed is a one-bedroom apartment (655 SF) located in Blacksburg, VA.The air conditioning (AC) unit is controlled using an Ecobee smart thermostat.The control commands are sent to the thermostat via the Ecobee API [1].Additionally, we monitored the energy consumption of the AC system using an Emporia Smart Home Energy Monitor [12], with a sampling rate of up to 1Hz.We used a 20-minute timestep by averaging the energy consumption data.The weather data were gathered from [14].To generate the datasets for the predictive models, we randomly changed the setpoint between 70 • F (21.1 • C) to 77 • F (25 • C) during a three-day period.The Mean Absolute Error (MAE) for the trained models were 0.33 • F and 0.037 kW for the temperature model (Equation 3) and the energy model (Equation 5), respectively.To acquire future outdoor temperature, we utilized the Meteomatics Python library [11].The thermal comfort profile used was synthetically generated utilizing real-world data [7], where 100% comfort was at approximately 73.9 • F as shown in Figure 2. The parameters employed for the optimization formulation included  = 1 and  = 0.5.The RL parameters were set to be  = 1,   = 3, and Σ = 4   as standard deviation.We compared the performance of TUNEOPT with a common engineering practice of manual tuning [2].To this end, to choose the baseline, we established an initial distribution for  using expert knowledge.Subsequently, we randomly selected three  values and ran the optimization in the testbed on separate days.Then, we determined the baseline  value based on the highest energy productivity achieved.
baselineday-1 day-2 day-3 day-4 day-5 Time ).The training process continues until day 5 when rewards no longer improving.This study shows TUNEOPT's feasibility for adaptive energy optimization.Future research should extend the experiment period, assess the baseline controller's performance across multiple days and occupancy scenarios, and test the proposed controller with various baseline controller seeds.In light of the limitations, the summary outcome of the feasibility study is shown in Table 1.The TUNEOPT controller achieves a 32.5% improvement in energy productivity and a 24.3% reduction in energy consumption, with only a marginal 1% compromise in comfort.

CONCLUSION
This paper presents TUNEOPT, an evolutionary reinforcement learning (RL) HVAC system controller designed to adapt and finetune an energy-comfort co-optimization controller in a dynamic environment, enabling dynamic responses to changing real-world conditions.The RL agent seeks to maximize a reward function by tuning the predictive controller.Through real-world testbed experiments on an apartment unit, the feasibility of the TUNEOPT was demonstrated in learning and improving energy efficiency (by reducing 24% of energy use).As a future research direction, the extension of experiments, assessment of the baseline controller's performance over multiple days, exploration of TUNEOPT's adaptability to diverse climates and seasons, and evaluation of its performance in complex multi-occupancy scenarios with multi-variable tuning could be pursued.

Figure 4 1 ) 1 )
Figure 4  illustrates energy consumption, comfort, and productivity throughout the experimental period.Note that the average outdoor temperature was between 72 • F and 74 • F during the experiment and baseline selection, which we assume has a negligible effect on the results.The algorithm begins with an initial Gaussian distribution, as proposed by an expert (step 1 in Algorithm 1).Next, three-parameter candidates for the balancing coefficient () are randomly sampled from this initial Gaussian distribution, and the controller is run for three days for each of these candidates (step 2).In step 3, the parameter candidates are sorted based on their corresponding reward values.The mean of the top two candidate parameters is used as input to the designed guidance function  (•).Based on the guidance function, a new distribution (N  ( ( (1)  , )) is computed, and a new parameter candidate ( (1)  ) is sampled for the next day (day-1) using the new distribution.At the end of day 1, the algorithm uses the reward value from day 1 along with the reward values from baseline to estimate a new distribution (N  ( ( (2)  , )) and samples a new parameter value (