Towards Invariant Time Series Forecasting in Smart Cities

In the transformative landscape of smart cities, the integration of the cutting-edge web technologies into time series forecasting presents a pivotal opportunity to enhance urban planning, sustainability, and economic growth. The advancement of deep neural networks has significantly improved forecasting performance. However, a notable challenge lies in the ability of these models to generalize well to out-of-distribution (OOD) time series data. The inherent spatial heterogeneity and domain shifts across urban environments create hurdles that prevent models from adapting and performing effectively in new urban environments. To tackle this problem, we propose a solution to derive invariant representations for more robust predictions under different urban environments instead of relying on spurious correlation across urban environments for better generalizability. Through extensive experiments on both synthetic and real-world data, we demonstrate that our proposed method outperforms traditional time series forecasting models when tackling domain shifts in changing urban environments. The effectiveness and robustness of our method can be extended to diverse fields including climate modeling, urban planning, and smart city resource management.


INTRODUCTION
Time series data plays a pivotal role in analyzing, monitoring, and simulating the development and design of smart cities. Extensive research across various domains has leveraged this data for applications including weather forecasting [2], temperature monitoring [13], and enhancing information systems [14].Despite these advancements, analyzing time series data in urban environments introduces significant challenges due to geographic domain shifts.Such shifts represent a critical barrier in forecasting efforts, as models must not only capture temporal dependencies but also discern and adapt to invariant relationships within diverse and changing urban environments.This research seeks to address these challenges by developing a robust model capable of navigating the complexities introduced by urban variability, thereby contributing to the foundational technologies necessary for the smart cities of the future [9].
As the field progressed, time series forecasting methods for smart cities have evolved significantly.Vector Autoregression (VAR) models recognize the interdependencies among multiple variables in time series and leverage them to predict future values, but they were limited by their assumption of linearity and their inability to handle non-linear relationships.Autoregressive Integrated Moving Average (ARIMA) models [4], built upon the foundation of Autoregressive Moving Average (ARMA) models, address the challenges of non-stationarity encountered by ARMA models.However, ARIMA models can be limited by their strict assumptions about data properties and may face computational inefficiency when applied to large datasets.Recurrent Neural Networks (RNNs) [11], including Long Short-Term Memory (LSTM) [7] and Gated Recurrent Unit (GRU) [6], revolutionized traditional methods by effectively capturing temporal dependencies.However, they encountered challenges in capturing long-range dependencies, were susceptible to vanishing or exploding gradients, and were data hungry [3].These limitations also manifest in location-aware time series forecasting.Transformer [12] has emerged as a significant advancement, effectively addressing the limitations of earlier methods by integrating a self-attention mechanism.By capturing both short and long-range dependencies, transformer excels in enhancing forecasting accuracy and surpassing the constraints of previous approaches.While transformer and other RNNs have demonstrated their effectiveness in time series forecasting, they often struggle when confronted with geographic domain shifts where the target urban environments differ significantly from the source urban environments (See Figure 1).In such cases, these models tend to have unsatisfactory performance [8,10].Let's delve into an illustrative example concerning urban dynamics to highlight the complexities of predicting urban air pollution, a critical task in time series forecasting.Traditional approaches often fall short by failing to recognize invariant causal relationships between variables, relying instead on spurious correlations that do not hold across different contexts.In this context, X represents the level of traffic congestion, Y denotes the air pollution levels, and Z indicates the prevalence of respiratory illnesses within the same urban area.We establish a direct causal link between traffic congestion (X) and air pollution (Y); as traffic congestion increases, so does air pollution, primarily due to higher vehicle emissions and engine idling.Moreover, there's a consequential relationship between air pollution (Y) and respiratory health issues (Z).Increased exposure to polluted air significantly raises the risk of respiratory ailments, leading to a higher prevalence of such diseases.This interplay illustrates the critical need for air pollution forecasting models that are precise, causally aware, and capable of identifying and leveraging these invariant causal relationships.Our work specifically addresses the intricate dynamics of air pollution within urban landscapes.By focusing on these dynamics, we aim to contribute to the development of more sustainable urban environments.We provide accurate predictions that are instrumental in informing policy decisions and urban planning strategies, ultimately aiming to reduce air pollution and mitigate its adverse health effects.Our approach innovates by integrating advanced causal inference techniques with time series forecasting, offering a novel perspective in the fight against urban air pollution.
However, the correlation between Y and Z may appear spurious when considering different urban environments.Some urban areas may have effective pollution control measures, ample green spaces, or favorable air circulation patterns, which mitigate the impact of air pollution on respiratory health.Consequently, the correlation between Y (level of air pollution) and Z (prevalence of respiratory illness) may be weakened or even absent in such areas.To improve its accuracy, it is essential to ensure the model considers the underlying causal factors and avoids being misled by spurious correlations.To overcome this limitation, we propose InvarNet, an innovative framework specifically designed to tackle the challenges of OOD generalization in location-aware time series forecasting model for smart cities. InvarNet consists of Invariant LSTM (Invar-LSTM) and Invariant Transformer (Invar-Transformer) models, built upon the LSTM and Transformer architectures, respectively.Integrating the invariant risk minimization (IRM) [1] framework enables the models to effectively handle geographic domain shifts, improving their time series forecasting capabilities and ensuring robust performance.In InvarNet, we begin by partitioning the time series based on their respective geographic locations.This partitioning allows us to isolate and analyze the data within specific urban environments.We then train our invariant time series forecasting model with the data from the source environments.Our proposed model is designed to encourage the learning of invariant relationships, rather than relying on spurious correlations that may be present across diverse environments.Therefore, it possesses the ability to generalize effectively within the target geographic domain.The development of InvarNet represents a significant contribution to the field.It provides a solid foundation for further exploration and advancements in OOD location-aware time series forecasting.Through comprehensive evaluations on both synthetic and realworld location-aware time series data, we have demonstrated the effectiveness of our approach, which also opens up a new avenue for research and development in smart cities.

PROBLEM FORMULATION
Consider a multivariate time series  = ⟨ 1 ,  2 , ...,   ⟩, where measurement   ∈ R  is recorded at time step  with  attributes.A set of location-aware multivariate time series is denoted as  = {(  ,   )}  =1 , where  is the number of observed locations,   = (lat  , lon  ) represents the geographic coordinates (latitude and longitude) of a specific location, and   ∈ R  × represents the observed multivariate time series at -th location.Problem Statement: In the context of location-aware multivariate time series, our objective is to develop a mapping function, denoted as  (•), on the training set   ⊂   .This training set consists of observations ⟨ 1 ,  2 , ...,   ⟩ from  locations.The aim is to predict   + , where  represents the desired number of time steps into the future from the current time step, using the historical data ⟨ 1 ,  2 , ...,   ⟩.Furthermore, we aim to ensure that this mapping function exhibits robust performance when applied to other geographic locations within   that were not included in the training set.

METHODOLOGY
In this section, we present our novel forecasting approaches by developing IRM training algorithms on two deep neural network architectures: Invariant Long Short-Term Memory (Invar-LSTM, see Figure 2) and Invariant Transformer (Invar-Transformer, see details in Subsection 3.2).

Invariant Long Short-Term Memory
First, we briefly introduce the application of LSTM in locationaware time-series forecasting.Given a training set   , which contains a set of multivariate time series from a variety of locations, our goal is to train an LSTM network  (•) that maps  1:  to   +  .We optimize the network by minimizing the loss function L (  +  ,   +  ), where   +  denotes the ground truth,   +  is the predicted value.It is often the modeling choice that L is a convex and differentiable function, such as mean square error and cross-entropy.We observe that directly training the LSTM network in this traditional empirical risk minimization manner leads to poor performance when applied it to other locations that are not included in   .We therefore explore the new IRM training paradigm for location-aware time-series forecasting.To capture the temporal dependencies within the time series, we employ the Long Short-Term Memory (LSTM) architecture.At each timestep , the hidden state ℎ  ∈ R ℎ is computed using the following equation: here,   ∈ R ℎ× is the weight matrix for inputs,  ℎ ∈ R ℎ×ℎ is the weight matrix for hidden states from the previous timestep, and  ∈ R ℎ represents the bias term.
In a traditional LSTM approach, the hidden state ℎ  is typically passed through a dense layer to predict future timesteps, resulting in  =  (  • ℎ  ), where  ∈ R  × represents the  features and future  time steps to be predicted.
To derive more generalizable prediction models, we integrate IRM training scheme to enhance both the prediction accuracy and robustness across different locations by encouraging the time-series forecasting models to learn invariant representations in changing urban environments.
In the case of Invar-LSTM, we introduce an additional step.After obtaining the result from the dense layer, denoted as  =  (  • ℎ  ), where  ∈ R  × and   ∈ R  × ×ℎ , we incorporate an invariant weight matrix   , which represents an all-1 matrix of size R  × .The output is calculated as  =  •   , where (•) denotes the Hadamard product.To optimize the model, the objective is defined as follows: In the equation ( 2), E represents urban environments in the training set, R  denotes the metric-specific loss function, and  ∈ [0, ∞) is a regularization parameter.While also penalizing the deviation of the gradients with respect to   from their values at   = 1.0.

Invariant Transformer
We also leverage the capabilities of the transformer architecture.Unlike recurrent models, which face inherent limitations in parallelization, especially when confronted with longer sequence lengths.We mitigate the sequential nature's hindrance, paving the way for enhanced parallelization and improved performance.
In the training of Invar-Transformer, we also incorporate the essential step of dividing the training set based on locations.Once the data is separated, we input the respective subsets into the model.Subsequently, we perform position encoding on the data   positioned at time  by following the steps outlined below: In this context, PE represents the positional encoding, while   signifies the manually designed frequency for each dimension.To seamlessly integrate positional information, we perform the sum of the input embedding and positional embedding, denoted as   .This operation effectively incorporates the positional encoding within the input data.
To acquire the query matrix , keys matrix , and values matrix  , we utilize three separate weights:   ,   , and   , correspondingly.Each weight is then multiplied by   to generate the respective matrices.
In the self-attention module, after obtaining , , and  from the input time series   , the computation involves a scaled dot product self-attention mechanism.This mechanism can be mathematically expressed as follows: where   is the dimension of keys.
The Transformer utilizes multi-head attention (M-H Attention) with  distinct sets of learned projections instead of a single attention function.This approach can be represented as: Here, Attn  = Attention(  ,   ,   ) represents the  ℎ self-attention module, and   denotes the output weight matrix.Notably, the dimension of the output  from the Multi-Head Attention module remains the same as the input   .The value of  is subsequently fed through the position-wise feed-forward layer, which comprises two linear transformations with a ReLU activation applied in between.This mathematical representation can be expressed as: Moreover, it is important to note that the dimension of  () remains consistent with the input   .Subsequently, we proceed to map the input  () to the output  ∈ R  × for each location.In the context of the Invar-Transformer, after obtaining  ∈ R  × , we incorporate an invariant weight matrix   that is an all-1 matrix of size R  × , similar to the approach used in LSTM.The output is calculated as  =  •   .To train the transformer model, we optimize it using the same objective function (Equation 2) as used in InvarLSTM.

EXPERIMENTS 4.1 Datasets
Synthetic Data: We first evaluate our methods using synthetic data, employing a structural equation model represented as follows: Here, we represent the temporal dependencies among variables as recursive equations.Specifically, X  is influenced by its previous value, Y  depends on its previous value as well as X  −1 , and Z  is influenced by its previous value, Y  −1 , and a normally distributed noise term with zero mean and unit variance.Moreover, in the urban environment represented by  ∈ E  , we consider that the value of  2 varies with different urban environments.Real-world Data: We conducted our experiments using a realworld dataset [15] consisting of air quality measurements collected in Beijing (BJ), Shenzhen (SZ), and Guangzhou (GZ), China, which are visualized in the Figure 3.The dataset covers the period from May 1, 2014, to April 30, 2015.Within this real-world air quality dataset, we have six attributes that are measured:  2.5 ,  10 ,  2 , ,  3 , and  2 .These attributes are monitored in multiple stations where Beijing has a total of 36 stations, Guangzhou has 30 stations, and Shenzhen has 10 stations.Each station is associated with precise geographical coordinates in terms of latitude and longitude.This information enables accurate spatial analysis of the air quality data.The dataset provides a high temporal resolution, with hourly measurements of pollutant concentrations.This level of granularity allows for detailed analysis of temporal variations, facilitating the identification of daily fluctuations and seasonal trends in air pollution.

Evaluation Metrics
Represent the predicted results and the ground truth in the testing set across  geographic locations as   +  and   +  , respectively.The error metrics can be defined as follows: For the MAE and MSE, lower values indicate better accuracy and prediction performance.

Main Results
Results of Synthetic Data Analysis: We conducted our experiments initially on synthetic data.The purpose was to showcase the superiority of the Invariance-based Time Series Forecasting Model (Invar-TSModel) in learning invariant representations, in comparison to the results obtained with the baseline TSModel trained using empirical risk minimization.In our synthetic time series data as shown in Equation ( 6), there are two distinct invariance relationships: The first invariance relationship, denoted as X  −1 → X  , originates from a temporal perspective.This invariance relationship is primarily driven by temporal dependence, where the historical data X  −1 serves as the causal factor, influencing the data observed at the current time step, X  , as the effect.The second invariance relationship is from a spatial perspective, where the attribute X serves as the cause and Y as the effect.Additionally, it is worth noting that the relationship between Y and Z exhibits invariance only in a specific urban environment, we refer to this as spurious correlation in other urban environments.In our synthetic time series data, the invariant causal relationship from X to Y remains invariant and unaffected by changes in the environments  2 .However, the relationship between Y and Z exists only when  2 = 1, indicating that for other urban environments where  2 ≠ 1, this invariant relationship breaks down and is considered a spurious correlation.In such instances, the causal structure can be depicted as shown in Figure 4.As a result, when attempting to predict Y using both X and Z, the presence of the spurious correlation between Z and Y can weaken the performance of the model.
To be more specific, given the data in Equation ( 6), for the model Y =  1 X, the optimal solution is  * 1 = 1.Similarly, for the model Y =  2 Z, the optimal solution is  * 2 =  2  2 +0.5 .When utilizing both X and Z to predict Y with the model Y =  1 X +  2 Z, the optimal solutions for  1 and  2 are found to be , respectively.We have provided the derivation in the Appendix A. These results highlight the impact of the varying urban environments  2 on the coefficients when predicting Y.In other words, when using X alone to predict Y, the coefficient remains unaffected by changes in the urban environment.However, when incorporating Z (either alone or in combination with other invariant variables) to predict Y, the coefficient is influenced by the varying urban environment  2 .
This theoretical analysis highlights a crucial insight: certain timeseries forecasting models may struggle to generalize effectively due to their failure to capture the invariant relationship between X and Y. Instead, these models tend to place greater emphasis on the spurious correlation between Z and Y occurs in specific urban environments, limiting their generalizability.To validate this idea, we conducted a comparative analysis of the performance between a traditional TSModel and its Invariancebased counterpart on the synthetic data.The results of this comparison are presented in Table 1.The training environments included three distinct types.For the first type of environment (Env-Type=2), with a total of two environments, in our experiments, we set 0} and E  = { 2   = 2.0}.Notably, our proposed Invar-TSModel showcased a better forecasting performance when compared to the traditional TSModel under this setting (See Figure 5).Moving on to the second type of environment (Env- Type=3-1B), we introduced an additional data to the training set that exhibited a larger domain shift compared to the target environments.Here, we established = 2.0}.In this scenario, the traditional TSModel experienced a decline in performance, while the Invar-TSModel demonstrated relatively better robustness in the face of domain shifts, although its performance was not as strong as in the previous setting.Finally, in the third setting (Env-Type=3-2G), we added data specifically from the target environment.The performance of the traditional model witnessed an improvement.Under these conditions, with 0} and = 2.0}, our proposed Invar-TSModel continued to outperform the traditional TSModel.
These findings on synthetic data illustrate the advantageous capabilities of the Invar-TSModel in effectively handling domain shifts caused by varying training urban environments, as it consistently outperforms the traditional TSModel in diverse settings.Results of Real-World Data Analysis: Our experiments on realworld data were designed with three different settings.Firstly, we selected a station from Beijing, denoted it as BJ-0, to serve as the test environment for all three settings.Then, in the first setting, we chose two different stations from Beijing, denoted them as BJ-1 and BJ-2, as the training environments.Moving on to the second setting, we selected two different stations from Shenzhen, namely SZ-1 and SZ-2, as our training environments.In the third setting, we included SZ-1 as part of the training environment, but instead of SZ-2, we replaced it with a station from Guangzhou called GZ-1.For all three settings, we utilized historical data from the previous seven days to predict values for the following three days.We trained LSTM, Invar-LSTM, Transformer, and Invar-Transformer models on the respective training environments for each setting.Except for the invariance mechanism, the backbone structure for LSTM and Transformer models remained the same.The performance of these models was then evaluated on the same BJ-0 test environment.The results of these evaluations can be found in Table 2.
In our analysis of the results, we observed distinct patterns across the different settings.In the first setting, where there was a small geographic domain shift within Beijing, we found that Invariancebased TSModel showed a slight improvement compared to the traditional TSModel.This suggests that the invariance mechanism helps mitigate the effects of geographic shifts, albeit to a limited extent.Moving to the second setting, which involved a larger geographic domain shift from Shenzhen to Beijing according to a significant increase in mean square error generated by the model.However, Invar-LSTM exhibited a notable improvement in performance compared to the first setting.This improvement can be attributed to the inherent ability of Invar-LSTM to learn invariant representations, which proves beneficial in the face of larger geographic domain shifts.In the third setting, we also tested our model with training data not only from one city but also from various other cities, where we replaced SZ-2 with GZ-1, LSTM performed better compared to the second setting.This indicates that the distribution of time series data from GZ-1 is more similar to BJ-0 in comparison to SZ-2.In this scenario, Invar-LSTM showcased its capability to handle the invariant relationship that arises in more diverse urban environments, resulting in better performance than using LSTM alone.Additionally, we extended our analysis to include popular Transformer models by integrating the invariance mechanism.We observed that the combined approach exhibited better performance than LSTM.By leveraging the advantages of the Transformer architecture and incorporating the invariance mechanism, we achieved the best performance in addressing geographic domain shifts in location-aware time series forecasting problems.

DISCUSSION
Our research was motivated by the causal invariance concept to address the location-aware time series forecasting out-of-distribution problem in smart cities.We started by exploring spatio-temporal and causal analysis, recognizing that certain time series models excel in capturing temporal invariance.However, many time series models overlook this crucial aspect on the spatial domain, leading to poor performance in such situations.Some models have proposed changes to the inner architecture to enhance spatial considerations but often at the cost of increased computational complexity.In contrast, we aimed to develop an efficient and compact model that could handle both spatial and temporal invariant relationships, thereby improving robustness and accuracy.
Our experiments revealed the prevalence of geographic domain shifts within the data, even within a single city.Another challenge arose from uneven training data volume, where, for example, City 1 had 10,000 data points while City 2 had only 2,000.Such data imbalance could cause models to favor fitting to City 1, resulting in poor generalization to other locations.Leveraging the geo-location aspect in combination with time series data, we devised techniques to encourage the model to identify and adapt to invariant relationships present in each environment, despite the changing conditions.

A APPENDIX
To understand the impact of changing geographic environments  2 on the performance of the model, we first assumed a specific data generation process that follows the structural equation model outlined below: where  1 ,  2 .. ∼ N (0,  2 ), and  3 ∼ N (0, 1).In different geographic environments,  2 will change with the environments.Next, we will provide empirical evidence to demonstrate the impact of changing geographic environments on the predictive performance of the model.
Given a model for predicting Y using variables X and Z, where X, Y, and Z follow the structural equation model as shown in equation ( 9): Firstly, we assume that our objective is to find an optimal value for  1 such that the estimator  (X) =  1 X provides a reliable approximation of the variable Y.In this case, our objective function can be defined as follows: 1 and  2 are independent random variables, so the objective function will be transformed into: the optimal solution is  * 1 = 1.Then, we assume that our objective is to find  2 s.t. (Z) =  2 Z is a good estimator of Y, the objective function will be:  Here, denote  4 = ( 1 +  2 ) ∼ N (0, 2 2 ), and thus: the optimal solution for the objective function is  * 2 =  2  2 +0.5 .At last, we assume that our objective is to find both  1 and  2 s.t. (X, Z) =  1 X +  2 Z is a good estimator of Y, the objective function will be: min Denote E  [Y −  (X, Z)] 2 as  ( 1 ,  2 ); thus, = (( 1 +  2 − 1) 2 + ( 2 − 1) 2 ) 2 +  2  2 ,  1 ,  2 , and  3 are independent random variables, so

Figure 1 :
Figure 1: Train a time series forecasting model (TSModel) using observational data from city A and subsequently applied it to forecast for city B and C.

Figure 3 :
Figure 3: Visualization of the Geographic Distribution of Cities.

Figure 4 :
Figure 4: Temporal Invariance (Left Solid Line) and Spurious (Left Dotted Line) Relationship, Spatial Invariance (Right Solid Line) and Spurious (Right Dotted Line) Relationship.

Figure 6 :
Figure 6: Geographic Distribution of Monitoring Stations used in Our Experiments. min

Table 1 :
Evaluation Results on Synthetic Data.

Table 2 :
Evaluation Results on Real-world Data (Metric: Mean Square Error).