Economizer Optimization with Reinforcement Learning: An Industry Perspective

Building operations contribute approximately 28% of global greenhouse gas emissions according to the International Energy Agency. With the increase in cooling demand due to rising global temperatures, the optimization of rooftop units (RTUs) in buildings becomes crucial for reducing emissions. We focus on the optimization of the economizer logic within RTUs, which balances the mix of indoor and outdoor air. By effectively utilizing outside air, RTUs can significantly decrease mechanical energy usage, leading to reduced energy costs and emissions. However, the current practice of economizer optimization relies on static guidelines set by ASHRAE, which approximates the dynamics of individual facilities. We introduce a reinforcement learning (RL) approach that adaptively controls the economizer based on the unique characteristics of individual facilities. We have deployed our solution in the real-world across a distributed building stock. We address the scaling challenges with our cloud-based RL deployment on 10K+ RTUs across 200+ sites.


INTRODUCTION
As global temperatures continue to rise, the demand for cooling in buildings is expected to increase, especially in regions with hot climates [7].Without energy-efficiency measures, higher cooling demand will result in even higher emissions.We focus on economizer optimization in rooftop units (RTUs), typically deployed in industrial HVAC systems.We experiment on a site with 10K+ RTUs located in a hot region.The economizer balances the indoor and outdoor air during the conditioning process.By leveraging free cold air, the economizer can significantly reduce mechanical energy usage, leading to decreased energy costs and associated emissions.
The standard practice for economizer optimization follows the guidelines set by the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE).These guidelines rely on hard-coded setpoints based on regional climate data [5].This standardized does not account for the specific conditions and dynamics of individual facilities, and therefore, overlooks valuable opportunities to capitalize on free cooling based on real-time conditions.To overcome these limitations and propel energy optimization in facilities, recent advancements in reinforcement learning (RL) have emerged as promising solutions [4,9].We present an RL approach to create self-adapting economizer setpoints, that adjust to the unique characteristics of individual facilities in real-time.Additionally, these RL models are regularly updated, incorporating new data and insights to continuously enhance their performance.
Contributions: Prior works primarily focus on zonal setpoint changes (e.g.[8]), to our knowledge we are the first to focus on economizer optimization.Further, prior works demonstrate RL solutions in simulation [9] or a single building [8], we deploy our solution to 200+ sites with 10K+ RTUs.To our knowledge we are the first to discuss cloud-based deployment of RL based control at scale.We discuss solutions to challenges across four axes in our deployment: data standardization, storage scalability, compute scalability, and human-in-the-loop control.
We open-source our code with a permissive license1 .

REINFORCEMENT LEARNING
In RL, an agent interacts with an environment described by a Markov Decision Process (MDP).The agent selects actions based on the current state of the environment and receives feedback in the form of the updated state and associated rewards.RL has been extensively explored for building optimization tasks [4,9].We use the well-established Soft Actor-Critic (SAC) algorithm [6], which combines policy-based (Actor) and value-based (Critic) RL methods.SAC utilizes the state-value function to estimate long-term accumulated reward expected from a given state, and an actionvalue function to estimate cumulative rewards for state-action pairs.Additionally, SAC employs an advantage function to measure the advantage of an action compared to the average expected value at a state.The Actor's policy neural network is updated using policy gradients and maximizes rewards, while the critic neural network is refined to better approximate the true value function.Notably, SAC introduces an entropy regularization term to encourage exploration and strike a balance between expected rewards and entropy.This inclusion makes SAC an appropriate choice as the algorithm  Our experiments with SAC showed consistent improvement over the existing rule-based control, and we deemed it sufficient for deployment without further experimentation due to development costs.However, other RL algorithms can be incorporated into our framework as a drop-in replacement.

PROBLEM FORMULATION
Figure 1 shows an overview of economizer operation.The economizer reduces the load of cooling on the system, and we model how much energy can be saved with free cooling using available sensor measurements.Our economizers are equipped to monitor outdoor temperature and humidity, and adjusts the damper position based on a setpoint.Damper position is maintained at a minimum of 10% to ensure sufficient air exchange.Economizers effectively reduce the cooling load by minimizing the enthalpy required to reach the desired supply air temperature setpoint.
Figure 2 shows the psychrometric chart for economizer operation, which captures the relationships between temperature, humidity, and enthalpy.The neural network in our RL agent learns these relationships, and optimizes the economizer for the local weather patterns by maximizing the long-term reward.
We formulate the following MDP in OpenAI gym [3]: State: Outdoor temperature, outdoor humidity, return temperature and return humidity.These states include the features needed for air ratio calculation (%OA) and enthalpy (H) interpolation using psychrometric charts.We assume supply air temperature and humidity to be constant.Table 1 lists all the constants in our environment.where,   is the mixed air enthalpy,    is the supply air enthalpy,    is measured in CFM, and the constants account for the standard air density, and the conversion factor from minutes to hours.The episode length is set to 200 steps so that the RL agent maximizes the long-term discounted rewards.The reward accumulation occurs as part of the SAC critic loss function.
We calculate the enthalpy and estimate the percentage of outside air (%OA) in the mixed air, and make control decision with actions selected by the SAC model.The enthalpy of the mixed air can be estimated using temperature and humidity measurements.We use the following equation: where  is enthalpy in Btu/lb, H is humidity and  is temperature in • F. We use the same equation to compute    ,   and   , with corresponding values of temperature and humidity.
The economizer system utilizes sensors to monitor the return temperature (  ) and return air humidity (H  ).Additionally, outdoor temperature (  ) and humidity (H  ) sensors provide information about the external conditions.When the outside air enters through the damper, it mixes with the return air, resulting in a mixed air stream that passes through the heating and cooling coil before being supplied to the facility.The temperature of this mixed air is denoted as   .Since most RTUs lack sensors for measuring mixed air temperature and humidity directly, an approximation can be made using the formula below: The % is used to compute   and H  as shown in Algorithm 1.
In the case of differential enthalpy control with fixed dry-bulb temperature, the economizer enabling setpoints depend on the maximum enthalpy and maximum dry-bulb temperature allowed for the economizer.This results in the division of the psychrometric chart into two distinct operating regions as shown in Figure 2.
In our formulation, the hourly time step and the energy-based reward function does not necessitate the use of an RL solution because each time step can be optimized independently.However, the use of neural networks with deep RL does make even this greedy optimization much easier to solve compared to rule-based control as the agent learns the non-linearity shown in the psychrometric chart and automatically optimizes control for a specific site.The MDP framework also makes it easy to switch to a dynamic pricing or carbon intensity based reward in a future deployment, which does have a dependency across time steps.

VALIDATION WITH HISTORICAL DATA
We collected historical outdoor temperature, outdoor humidity, return temperature, and return humidity data from an RTU, recorded every 15 minutes from May 2018 to October 2021.We filtered out records that were either missing or anomalous, converted both outdoor and return humidities to decimal value and clipped them between range of [1e-5, 1], and resampled the data into hourly granularity by taking the average.We hold out 1 month of data for validation.After data cleaning and preprocessing, the processed dataset contains around 12K records.We trained our SAC agent for a total of 200K time steps.Algorithm 1 describes each step in the process, with  denoting Equation 2. Figure 3 shows the power consumption between different strategies in updating economizer setpoints for one RTU.On the y-axis is the average power consumption over the validation period, so lower the better.We compare against two baselines: no economizer, and the ASHRAE standard.The ASHRAE standard uses rules based on climate zones to approximately pick the appropriate setpoint for  We deployed the solution on 10K+ RTUs across 200+ sites for hourly inferencing.We train one agent per site, but use the same policy for all the RTUs in a site.We re-train the RL agent each quarter to account for drift in environment conditions.We redact online deployment results due to proprietary reasons.

DEPLOYMENT ARCHITECTURE
We leverage cloud services to scale our deployment, Figure 4 shows our architecture.We discuss the challenges in scaling and the solutions employed below.

Data standardization
Building data is often disorganized, with inconsistent naming conventions for sites, assets, and sensors due to various factors such as different teams, vendors, or engineers involved during setup.As a result, deploying downstream analytics, including RL policies, becomes a time-consuming process that requires customization per each site, and thus limits scalability.These challenges are wellknown, and have been documented in prior works [1,2].
To address the data standardization challenge, we leveraged the Brick schema, which provides a well-defined ontology for buildingrelated data, ensuring consistency and uniformity in data representation and structure [1].We load standardized sites, RTUs, sensors, and their relevant relationships into a Graph Database.We use SPARQL queries to retrieve relevant sites, assets, and points associated with our MDP.A sample query is shown below.

SELECT ? p o i n t ? t i m e r s e r i e s _ i d WHERE {
? s i t e BRICK : hasName " s i t e 1 " .? r t u BRICK : hasName " RTU1 " .? s i t e BRICK : h a s L o c a t i o n ?r t u .? r t u a BRICK : RTU .? r t u BRICK : h a s P o i n t ?p o i n t .? p o i n t a BRICK : Z o n e _ A i r _ T e m p e r a t u r e _ S e t p o i n t .? p o i n t BRICK : t i m e s e r i e s ?t i m e r s e r i e s _ i d .}

Data scalability
Traditional relational databases are not designed to simultaneously handle both the long-term requirements for model training alongside real-time requirements for model inference and analytics, particularly in our case of multiple years of minute to sub-minute level data from millions of sensors.While certain data architecture optimizations like redundancy, joins, partitioning, and other techniques may be effective in specific use cases, the ever-changing and adhoc nature of analytics continually introduces new requirements.We utilize a combination of long-term (cold) storage and shortterm (hot) storage to ensure a robust and scalable data management approach for model training, model inference, and other ad-hoc analytical use cases.For long-term storage of training data that is only accessed for quarterly retraining, we leverage Block Storage to store years of data at the lowest granularity possible.This method is cost efficient for infrequent reads yet allows for the transformation of the raw data into other storage systems for analytical use cases outside of energy optimization.For model inference and real-time analytics, we store incoming streaming sensor data into purposebuilt timeseries SQL databases.These tools are specifically designed to handle the simultaneous reading and writing of high-volume and high-frequency sensor data.
After retrieving the relevant data points required from a SPARQL query (results include a timeseries id), we parallelize the SQL queries to retrieve the time series data across sites and RTUs efficiently.Each query focuses on a specific site and RTU combination.By combining the power of a Graph Database and parallel SQL queries to timeseries databases, we effectively leverage standardized data to scale our deployment.

Compute scalability
We aim to concurrently perform inference across hundreds of sites every hour, and run model training every quarter.We rely on the serverless capabilities provided by Cloud compute to meet these scaling requirements.
In our use case, the SPARQL queries would return a list of sites with a list of RTUs per site.During model training, we use serverless orchestration tools to map training jobs to every RTU associated with a site, resulting in tens-of-thousands of concurrent jobs.We process the concurrent training jobs using a batch processing service, which is suited for GPU intensive process of environment simulation and subsequent training of the RL agent.We store the resulting policies in long-term storage for later model inference.These concurrent training jobs take up to an hour each.
During model inference, we also use workflow orchestration to map training jobs to every RTU associated with a site.However, we process the concurrent jobs using function-as-a-service (FaaS) instead of batch processing.We use CPUs in FaaS to reduce costs and availability issues, and complete concurrent inferences within 5 minutes on average.
Using this architecture, we have already scaled to 10K+ RTUs, and do not foresee difficulties in continuing to scale.The cost of our deployment is up to $50K/year, and we estimate saving >$250K in annual energy bills.

Human-in-the-loop Control
Although we have been able to automatically recommend economizer setpoints across hundreds of sites, we have not yet automated the command and control of the RTUs themselves.Instead, we output the recommended setpoints to a web-based front-end that displays the setpoints as well as the potential energy savings.Building operators review and input these recommended setpoints manually.The reasons for this are twofold.First is the ongoing security discussions associated with automating command-and-controls.While integration with the command and control suite will allow setpoint automation, it would also grant access to more critical setpoints that may disable the RTU completely.The second reason is the novelty and subsequent lack of trust with the algorithm, especially given the complexity of an RL solution.We are currently evaluating the performance with end users and stakeholders to establish trust in the recommended setpoints.Once the security issues are resolved, we expect building operators will slowly taper off the manual validations and allow more setpoints to be automated, at which time we can expect to fully realize the energy savings demonstrated with simulation.

Figure 1 :
Figure 1: Illustration of an RTU equipped with an economizer

Figure 2 :
Figure 2: Psychrometric Chart for Economizer Operation explores and learns efficiently in environments with stochasticity.Our experiments with SAC showed consistent improvement over the existing rule-based control, and we deemed it sufficient for deployment without further experimentation due to development costs.However, other RL algorithms can be incorporated into our framework as a drop-in replacement.

Figure 3 :
Figure 3: Performance comparison of our RL agent against ASHRAE standard policy for one RTU.

Figure 4 :
Figure 4: Cloud deployment architecture an economizer based on the psychrometric charts.The RL agent outperforms the ASHRAE standard by resolving the non-linearities in psychrometry and selecting a more precise setpoint that maximizes energy savings.Overall, the RL agent yeilds 5% reduction in average power use.We deployed the solution on 10K+ RTUs across 200+ sites for hourly inferencing.We train one agent per site, but use the same policy for all the RTUs in a site.We re-train the RL agent each quarter to account for drift in environment conditions.We redact online deployment results due to proprietary reasons.

Table 1 :
Constants in the RL environment.