CityCAN: Causal Attention Network for Citywide Spatio-Temporal Forecasting

Citywide spatio-temporal (ST) forecasting is a fundamental task for many urban applications, including traffic accident prediction, taxi demand planning, and crowd flow forecasting. The goal of this task is to generate accurate predictions concurrently for all regions within a city. Prior works take great effort on modeling the ST correlations. However, they often overlook intrinsic correlations and inherent data distribution across the city, both of which are influenced by urban zoning and functionality, resulting in inferior performance on citywide ST forecasting. In this paper, we introduce CityCAN, a novel causal attention network, to collectively generate predictions for every region of a city. We first present a causal framework to identify useful correlations among regions, filtering out useless ones, via an intervention strategy. In the framework, a Global Local-Attention Encoder, which leverages attention mechanisms, is designed to jointly learn both local and global ST correlations among correlated regions. Then, we design a citywide loss to constrain the prediction distribution by incorporating the citywide distribution. Extensive experiments on three real-world applications demonstrate the effectiveness of CityCAN.


INTRODUCTION
To build Intelligent Transportation Systems (ITS), numerous sensors are widely placed in cities to capture traffic conditions [21], producing massive spatio-temporal (ST) data, as depicted in Fig. 1 (a).Foreseeing citywide ST data, such as traffic accidents, crowd flows, and crowd densities, is crucial for ITS development.It can facilitate various urban applications, such as assisting transportation managers in mitigating accidents [3,11,53], guiding car-sharing companies in vehicle allocation [12,18,61], and aiding drivers in selecting optimal routes [8,17,19].
One key characteristic of citywide ST data is spatio-temporal (ST) correlations, illustrated in Fig. 1 (a).Specifically, a target region's conditions are influenced by three dependencies: spatial (represented by the orange line), temporal (blue line), and spatiotemporal (ST) (green line).In the era of big data, researchers have proposed many data-driven methods, especially deep learning approaches, to capture these ST correlations.Most works [47,60] address spatial and temporal dependency separately, neglecting the direct ST dependency.They typically capture spatial dependencies via convolutional neural networks (CNNs) [5,75] or graph neural networks (GNNs) [14,58], and exploit temporal dependencies with recurrent mechanisms (RNNs) [42,65] or attention mechanisms [68,69,73].To fully exploit the ST correlations, more recent approaches model the spatial and temporal dependency simultaneously via local ST graphs [43], pyramid CNNs [29] or ST enhanced mechanisms [48].Despite recent advances in ST modeling, two major challenges persist in forecasting citywide ST data: Challenge I: How to identify the useful correlations among regions across time?Studies have shown that urban zoning and functionality influence the citywide ST correlations [7,30].To incorporate urban functionality into citywide forecasting, previous studies [28,30] integrate geographical features (e.g., Points of Interest) as auxiliary inputs to ST networks.However, these methods reply on ST  networks to learn spatial correlations, often leading to an overgeneralized consideration of ST correlations across all regions.In reality, one region's future conditions is largely influenced by regions with useful correlations, rather than every region in the city.For example, the work area traffic is intrinsically correlated to residential areas due to daily commutes but shows limited correlation with agricultural zones.Thus, instead of broadly capturing all ST correlations, we emphasize identifying and utilizing useful correlations among regions to enhance citywide ST forecasting.

Intervention Predictions
Challenge II: How to constrain the predictions to align with actual citywide distributions?The characteristics of urban zoning and functionality give citywide ST data a distinctive distribution [40,64], as illustrated in Fig 1 (b-c).For instance, most traffic events (e.g., accidents) take place in urban areas and they rarely occur in rural areas [3,47]; taxi flows are concentrated in downtown districts and sparse in other boroughs [16,77].Previous works [5,67] prioritize regions with large event numbers, often employing region-wise losses, such as MSE, RMSE, to optimize neural networks.Some of them [41,47] even specifically amplify the impact of high-event regions through re-weighted loss strategies.However, these works force networks to calculate errors that are skewed towards the regions with large event numbers, overlooking these with fewer events.Thus, they may cause the network to generate predictions that considerably deviate from actual citywide distributions.
In this paper, we propose a Causal Attention Network (CityCAN) for citywide ST forecasting.Given useful correlations are influenced by urban zoning, we argue these correlations are invariant spatial correlations among regions over time.Thus, to address challenge I, CityCAN employs a causal framework (depicted in Fig. 2) to learn the useful correlations, while ignoring its complementary correlations (i.e., useless correlations).In CityCAN, regions with invariant/useless correlations are assigned to useful/useless superregions for invariant/useless learning branches with two complementary superregion matrices.Then, the useful correlations can be identified by pushing the predictions from the invariant learning branch and intervention branch to be invariant, regardless of changes in useless representations learned from the useless learning branch.Enhancing the ST modeling within these branches, we propose a novel Global Local-Attention Encoder (GLAE) as the ST Encoder to jointly encode spatial and temporal dependencies via local and global ST attentions.To tackle challenge II, we design a citywide loss that penalizes the network from a global perspective, i.e., on the city level.Specifically, it constrains the predictions on spatial dimension aligns closely with the true spatial distribution by considering all regions in the city collectively.In other words, it measures the distribution similarity between predictions and future conditions across all regions.Overall, we summarize our contributions as follows: • We propose CityCAN, a causal attention network for citywide ST forecasting, which leverages causal theory to uncover useful spatial correlations over time.• We introduce a Global Local-Attention Encoder (GLAE) for better spatio-temporal correlation modeling.• We design a citywide loss, which constrains the prediction distribution, leading to improved citywide ST forecasting.• Experiments show CityCAN outperforms state-of-the-art methods on four datasets in three practical applications.

PRELIMINARIES
Definition 1 (Region): The area of interest, i.e., city, is divided into  regions based on their longitude and latitude [47].These regions can be either regular or irregular in shape.

CITYCAN
In this section, we present our CityCAN, as shown in Fig. 3, which employs two strategies to tackle citywide ST forecasting: a causal framework to identify useful ST correlations (Section 3.1) and a citywide loss to constrain the prediction distribution (Section 3.2).

Causal Learning for Citywide Forecasting
Due to the urban zoning and functionality, despite ST correlations in citywide data can be dynamic in a short period (e.g., days), invariant spatial correlations among city regions (e.g., correlations between residential and school areas) do exist over time.We treat these invariant correlations as useful correlations.To identify these correlations, inspired by classification tasks [39,44] that adopt causal theory to disentangle the relevant and irrelevant features, we take a causal look at the citywide ST forecasting and propose a causal learning strategy for this regression task.into high-level representations through a 2D convolution with kernel size (1, 1).To inject the space-time location for each region, we extend the positional embedding [71] to ST positional embeddings.Specifically, for a region  at time  (denoted as (, )), we define its absolute space-time position as ( +   ), where the index of  and  starts from 0. By adding the representations and ST positional embeddings, we obtain final features H ∈ R  × × ℎ for ILB and ULB to assign useful and useless superregions, where  ℎ is the feature dimension.
Invariant & Useless Learning Branch (ILB & ULB) work on learning the invariant and useless correlations among regions, respectively.They share the same architecture, which includes a Superregion Generator, a Global Local-Attention Encoder, a Superregion Partition, and a Decoder.Superregion Generator groups regions with correlations between each other into one superregion.To identity useful correlations and filtering out useless ones, we introduce two learnable superregion matrices, i.e., useful superregion matrix G  and useless superregion matrix G  .They are derived from the correlations observed among regions in the training data.To group correlated/uncorrelated regions to the useful/useless superregions, we apply the matrices to the original regions and their corresponding adjacent relationships: is the learned adjacent metrics of superregions,   =  ( / ) is total number of superregions in space dimension,  denotes the region reduction parameter, and T is the transposition operation.We ensure the useful and useless correlations are complementary to each other, and thus let superregion metrics satisfy G  + G  = 1, where 1 is the all-one matrix.
Thus, we can obtain  =   × useful/useless superregions with their corresponding useful/useless features H / H and adjacent relationships Ã / Ã in ILB/ULB.In ILB, the GLAE (our ST Encoder) takes H and Ã as inputs, while in the ULB, it uses H and Ã as inputs.GLAE works on capturing ST correlations, either within the useful superregions in ILB or the useless ones in ULB.It produces ST representations h  for useful superregions and h  for the useless ones (More details in Section 3.1.2).These representations can be easily mapped back to their original regions via a Superregion Partition using the superregion matrices G  , G  : where ì H  , ì H  ∈ R  × ℎ .Then, given useful ST representation ì H  and useless ST representation ì H  of original regions, we use the fully-connected layers as the Decoder in ILB and ULB to generate the predictions and useless predictions Ô, Ô ∈ R  ×  ×  , where   is a task-specific dimension of traffic features.Note that since GLAE is an attention-based encoder, it reduces  2 complexity given the region reduction parameter  .
Intervention Module (IM) aims to eliminate the influence of useless representations by providing implicit interventions on the latent level.Inspired by [44], we first generate interventions using a Random Shuffle operation, which randomly collects useless representations from different useless superregions.These random interventions are then concatenated with the useful representation h  to generate the intervened predictions Ô via the Superregion Partition with G  and the Decoder.Then, we encourage the invariance between the intervened predictions Ô and the predictions Ô obtained from the ILB to mitigate the impact of useless features through an intervention loss L int : where   denotes the random shuffle operation, Φ  represents operations in Superregion Partition and Decoder, and || refers to the concatenation function.To this end, CityCAN can fully exploit the useful correlations by ignoring the influence of useless correlations.
Losses for Causal Learning Except for the intervention loss L int , we also introduce supervised loss and useless loss to disentangle the useful features and useless features for boundless traffic condition values.Supervised loss L sup estimates predictions generated from useful representations in ILB: Unlike the classification work [44] that uses uniform classification loss to eliminate the influence of irrelevant patterns, we design a useless loss L usl for regression tasks.It pushes the useless representation to be unnecessary by minimizing its value to zero: Then, the total loss for the causal learning strategy is: where  1 ,  2 ,  3 are hyperparameters.To this end, CityCAN leverages the intervention strategy in causal theory, guiding the causal framework to identify useful correlations among regions.• ST influential mask M  ∈ R × prevents future information leakage by setting the attention that represents influence from future time intervals with zero [46].However, unlike the masks in prior works [25,46], M st is not a triangular matrix as the superregions are arranged by space-time location.• Spatial bias PE  ∈ R × enhances spatial relationships by setting temporal influence to zero and repeating spatial influence with superregions' spatial relationships Ã.Then, we revise the conventional attention operation to the calibration attention (CA) operation: where Local CAM (LCAM) captures local ST representations within a sliding window of size   .The total number of ST superregions in each window is   =     .We apply CAM on these supperregions with their components M  , PE  ∈ R   ×  (see calculations above).Given  () temporal intervals at -th block, there are  () =  − ( − 1)(  − 1) sliding windows, resulting in  () •   •   superregions.The index of  starts from 1. Since each superregion aggregates ST features from all other superregions, we let the LCAM only outputs the features of the last time interval for each sliding window, i.e., ì h is the total number of superregions in -th block, and  () is: Global CAM (GCAM) learns the global ST features from all superregions across time and space.Similar to LCAM, we apply CAM on all ST superregions, i.e.,  () superregions, with their corresponding calibration components M  , PE  ∈ R  () × () , to obtain the final output of GCAM ì h Cropping Layer removes redundant features from the farthest superregions as traffic conditions are primarily influenced by the most adjacent time intervals.The redundant information resides in ì h because each superregion has aggregated ST information from all other superregions in GCAM.Thus, at the last block , we only use the superregions at the last temporal interval  ( ) , i.e., h = ì h  (:, −1) ∈ R   × ℎ .Then, total number of superregions  () is updated to:

Citywide Loss
Regions within a city, influenced by urban zoning and functionality, can be categorized into: (1) significant regions, characterized by frequent events and may require extra human interventions (e.g., traffic control in case of predicted accidents or pre-allocation of taxis for areas with high predicted demand).( 2) trivial regions, which often have small or zero event numbers and do not require specific human interventions.Significant and trivial regions are non-evenly distributed in a city.To ensure effective interventions without wasted resources, the network should accurately predict targeted features for all regions in the city simultaneously.However, the causal loss (Eq.6) emphasizes region-wise error, which can misalign predictions with the city's actual spatial distribution.To address this issue, we introduce an auxiliary loss, named citywide loss, to regularize the distribution between predictions and labels.Also, recognizing the heightened importance of significant regions, particularly in applications requiring costly human intervention, we first introduce a calibration prior to up-weight significant regions.
Calibration Prior leverages the citywide domain knowledge that a similar spatial distribution over time.This knowledge exists because traffic is influenced by the city's geography and semantics.Thus, we can identity the significant regions via a region prior   by summarizing the interested conditions features of each region over observed samples, i.e., training samples, and obtain the calibration prior   based on the region prior   : where   ∈ R  ×  , I is the total number of training samples,  ∈  is the index of spatial region,  refers to targeted traffic condition features, e.g., the features of traffic accident risk, taxi flow, crowd density, and  is the calibration parameter that controls the selection of the most important significance regions.
Citywide Loss with Calibration Prior enables the network to generate the prediction distribution that can reflect the true citywide distribution, while penalizing the errors in significant regions more.We calculate the citywide loss based on the re-weighted cosine similarity: where  is a hyperparameter to avoid division by zero.The reweighted cosine similarity is applied to all regions collectively for each time interval, thus can constrain the traffic condition spatial distribution.It also provides proper focus on each and every region, during training, as it re-weights the importance of the regions across the city.Note that the calibration prior is applied to both the predictions and labels, thus keeping the distribution.

Losses for CityCAN
The final loss for CityCAN contains two parts, i.e., causal loss in Eq. 6 and citywide loss (CWL) in Eq. 11: where  1 ,  2 ,  3 ,  4 are hyperparameters.To this end, CityCAN can consider both the region-wise and citywide errors, and therefore ensures high predictive performance across all regions in the city.4.1.2Tasks.We conduct experiments on three tasks, including the traffic accident risk forecasting, crowd flow forecasting, and crowd density forecasting:

EXPERIMENTS 4.1 Experimental Settings
Traffic accident risk forecasting task: we follow the existing works [6,47], not only to predict the occurrence of traffic accidents, but also to estimate the risk value.The risk value should reflect both the frequency and severity of traffic accidents in the region, and thus it is defined as the sum of each traffic accident's severity within a region.In the experiments, we forecast traffic accident risk conditions for the next time interval (  = 1) given historical observations of 6-time intervals ( = 6).
Crowd flow forecasting task and crowd density task: we follow existing works [27,63] to predict the crowd inflow/outflow and crowd density value for all regions in the city, respectively.In the experiments, we predict crowd flow conditions and crowd density conditions for next 6 time intervals (  = 6) given historical observations of 6-time intervals ( = 6).

Evaluation Metrics.
We follow the previous studies [47,50] to evaluate our model with two metrics: Mean Absolute Error (MAE) and Root Mean Squared Errors (RMSE).Additionally, towards more comprehensive evaluations of traffic accident risk forecasting, we use F1 score, F1@20, F1@30 to present the ability of the model to indicate regions with risk, where F1@K denotes the F1 score for top K regions with high accident values.

Implementation Details.
Our model is trained on a single GTX 2080 Ti using Adam optimizer with a learning rate of 0.001.We set region reduction parameter  to 4, number of blocks  to 4, feature dimension  ℎ to 128, number of multi-heads  to 4. We balance the region-wise and city-wise losses and set  1 ,  2 ,  3 ,  4 to 0.4, 0.05, 0.05, and 0.5, respectively.The calibration parameter  is dataset-specific.We adopt an early-stop strategy with a maximum of 150 epochs for all experiments.For the data partitioning, we used the last 8 weeks as the test set, the preceding 4 weeks as the validation set, and the remaining data as the training set.

Experimental Results & Analysis
4.2.1 Traffic Accident Risk Forecasting.Table 2 shows the prediction results of baselines and our model for traffic accident risk on two datasets.Our model consistently surpasses the baselines on all datasets in terms of accuracy for risk value and F1 for accident indication.Note that most recent works, i.e., SDCAE and GSNet, cannot adapt to the Chicago21 dataset.This is because these models, designed for regular regions, employ CNNs for spatial capturing.However, the Chicago21 dataset reflects the natural community divisions of a city, containing irregular regions.These regions have a non-Euclidean structure, which cannot be modeled using CNNs.From the results, we can observe that: (1) Our model outperforms baselines by a large margin on accident indication.Specifically, it brings 2.07 times higher citywide F1 on average, demonstrating its superior ability to identify regions with/without accident incidents.(2) Baselines designed for traffic accident risk forecasting (i.e., SDAE, SDCAE, and GSNet) perform better than general ST forecasting models (i.e., STGCN and GWNET).This is because that accidents are rare events, and general ST models, focusing on ST modeling, fail to consider the sparse data inherent in this task.(3) Surprisingly, HA outperforms all baselines on city-wide F1 scores.We conjecture that deep models focus on significant regions and neglecting trivial ones, thereby failing to identify trivial regions and leading to lower city-wide F1 scores.On the other hand, HA only considers historical observations for each region, avoiding this issue.Our model, adopting the citywide loss, considers the prediction distribution and places proper focus on each region, resulting in the best performance.(3) Our model achieves the lowest MAE and RMSE error, suggesting that it can generate more accurate risk values, since it generates predictions based on the useful correlations that truly impact the future condition.Table 3 shows the results of baselines and our model for crowd flow forecasting on BikeNYC and Chicago21 dataset, and crowd density forecasting on Chicago22 dataset.Compared to the accident risk forecasting task, the data in these two tasks are not sparse.
The results show that our model consistently outperforms existing methods on all metrics.Specifically, it reduces MAE error by 17.78% and RMSE error by 14.47% on average over three datasets.It demonstrates that our proposed model is a general model which can achieve better performance on various citywide tasks.51, have fewer correlations with other regions and thus have higher weights in the useless superregion matrix.(4) Our model allows each region to be assigned to multiple superregions based on its correlations with different regions.For instance, Region 7 is also assigned to Superregion 9, as it also has correlations with Region 5 and Region 6.One region's weight to different superregions varies according to its importance within different correlations.
4.4.2Citywide Distribution.Fig. 5 shows the distribution of the citywide data on the NYC13 dataset, which is obtained by averaging the traffic accident risk values over the temporal dimension and normalizing the traffic accident risk values over the spatial dimension.The visualization of training samples is derived from the training set, while the ground truth represents the visualization of the ground truth of forecasting in the test set.These two visualizations share a similar distribution, validating our assumption that the citywide distribution does not vary dramatically between the training and test set.The forecast visualizations produced by SDAE, SDCAE, and GSNet illustrate these models' limitations in generating good predictions that align with the actual citywide distribution.It is because they adopt region-wise losses, which neglect the citywide distribution.Especially, they fail to identify trivial regions, which consistently present zero values over time, and may lead to inefficient resource allocation.CityCAN considers both region-wise and citywide errors in forecasting, and thus can r=2 r=3 r=4 r=5  successfully recognize significant and trivial regions and achieve good forecasting results for all regions in the city.

Effects of Hyperparameters
In Fig. 6, we study the effects of hyperparameters in CityCAN over Chicago22 for crowd density forecasting.From the results, we observe: (1) CityCAN achieves the lowest RMSE error when the region reduction parameter  = 4.This is because a higher reduction rate results in fewer superregions, which allows the network to obtain summarized features from original correlated regions and eliminates some redundant features.However, if the reduction rate is too high, it may contain many useless correlations, which negatively impact performance.(2) CityCAN achieves the highest performance when it contains  = 3 CA blocks as reducing/increasing the number of blocks may lead to underfitting/overfitting issues.
(3) Increasing the local window length negatively impacts performance, as it is important to consider temporal information from each time interval for short-term forecasting.(4) CityCAN performs best when  = 0.7, particularly in terms of RMSE, because it gives suitable weights to the significant regions that have higher values.

RELATED WORK
Citywide Spatio-Temporal Forecasting is a crucial task for ITS and has attracted much attention over the years.Recent works have explored spatio-temporal (ST) networks for various citywide tasks, such as traffic accident prediction [37,49,65,76], traffic flow prediction [13,20,26,33,51], traffic speed prediction [9,70], taxi demand prediction [1,62,78], etc.They have achieved superior performance over traditional statistical models, like k-nearest neighbor [35] and ARIMA [45,54] thanks to their ability to model complex nonlinear ST correlations.More recent works [29,43,48] suggest that jointly learning the spatial and temporal dependencies enhances prediction performance.However, they still face challenges in considering the global ST correlations between the irregular regions simultaneously.Meanwhile, attention-based models [56,71] have shown success in learning global dynamic dependencies on temporal forecasting.However, they focus on long-term multivariate time-series and efficient attention mechanisms [23,25,32,36,52,72], ignoring spatial correlations and domain knowledge, and therefore cannot be applied to citywide ST forecasting directly.Also, in citywide forecasting, citywide distribution is relatively under-explored.Although some works [15,24,47,55,75] have studied zero-inflated data that are distributed sparsely in the city.They focus on regionwise optimization, which results in producing skewed predictions that cannot align with the citywide distribution.In this work, we propose a novel attention-based ST encoder that incorporates citywide domain knowledge in a casual framework and a citywide loss to constrain the prediction distribution for better ST modeling.
Causal Learning [38,39] enables the deep learning models with the ability to eliminate spurious correlations, leading to improved performance in various tasks.For example, CONTA [66] removes non-causal associations between image pixels and labels via the backdoor adjustment in image segmentation tasks.Liu et al. [34] learns the causal invariance of the motion representations by disentangling the physical laws, style confounders, and non-casual features for better motion prediction.CAL [44] boosts graph classification performance by applying causal interventions on representation level.STNSCM [10] analyze the causal relationship between input data and contextual conditions.Different from them, we propose a causal attention network that removes the useless correlations that exist in ST data for citywide regression.There are some concurrent studies on causal learning for ST data [59,74].

CONCLUSION AND FUTURE WORK
In this paper, we proposed CityCAN, novel network for citywide ST forecasting.Leveraging the causal theory, we design a causal framework for citywide ST forecasting that applies implicit interventions at the latent level, enabling CityCAN to learn useful regional correlations.To jointly capture the ST correlations for both regular and irregular regions, we also introduce a Global-Local Attention Encoder in CityCAN.It captures both the local and global ST correlations with a calibrated attention mechanism for better ST modeling.We then proposed a citywide loss, which considers the citywide distribution between the predicted and real conditions, to enable CityCAN to accurately predict the targeted features for all regions in the city at once.Extensive experimental results and analyses verified the effectiveness of CityCAN.CityCAN is not limited to citywide ST forecasting.In the future, we will evaluate it on other ST tasks, such as crime prediction.We will also exploit the different architectures for invariant learning and useless learning to reduce computational costs.

Figure 1 :
Figure 1: (a) A illustration of the spatio-temporal data and three types of dependency.(b-c) A visualization of citywide traffic accidents and crowd inflow in Chicago on July 1, 2021 at 14:00.Traffic conditions are influenced by urban zoning.

Figure 2 :
Figure 2: Insight of the network.

Figure 3 :
Figure 3: (a) Overview of CityCAN, where invariant and useless correlations among regions are disentangled via the causal learning strategy.(b) Global Local-Attention Encoder (GLAE) learns useful and useless ST features based on invariant and useless correlations.It has the same architecture but different parameters in the invariant and useless learning branches.

Figure 4 :
Figure 4: (a) A visualization of superregion metrices for crowd flow forecasting on Chicago21 dataset.(b) A visualization of crowd inflow on Chicago21 dataset, where the idx (index) in each region refers to its spatial index.(Eq.5) and intervention loss (Eq.3).w/o UL excludes the useless loss.w/o IL omits the intervention loss.w/o CWL is the model without the citywide loss (Eq.11).w/o CP does not have the calibration prior within the citywide loss.w/o CL removes the cropping layer.w/o LCAM and w/o GCAM are models without the LCAM and GCAM modules, respectively.Table 4 reveals: (1) Causal learning strategy improves performance, validating the effectiveness of identifying useful correlations.(2) Excluding useless loss or intervention loss hurts the model performance, indicating useless correlations do exist and misleads the network.These two losses must work together to achieve causal learning as useless loss ensures zero influence of useless features, while intervention loss guarantees invariant results after adding useless features.(3) Omitting citywide loss degrades performance, demonstrating the importance of considering all city regions.(4) Higher RMSE in w/o CP indicates that the calibration prior can provide useful domain knowledge to enhance performance in regions with high condition values.Although applying the calibration prior tends to focus more on RMSE, resulting in a slight increase in MAE, it is particularly useful in scenarios where high condition values are of high interest, such as traffic accident risk.Its influence can easily be removed by setting the calibration parameter to 1. (5) The inferior performance of w/o CL highlights that removing redundant information can make the model focus on the most important features.(6) w/o LCAM and w/o GCAM show inferior performance, demonstrating that capturing the local and global ST correlations is necessary for citywide ST forecasting.

Figure 5 :
Figure 5: A visualization of the citywide spatial distribution of traffic accidents in the NYC13 dataset, along with the forecasting results of various methods.

Figure 6 :
Figure 6: Effects of hyperparameters on Chicago22 dataset in terms of MAE and RMSE.
Definition 2 (Traffic Condition & Traffic Features): Traffic conditions are traffic-related conditions, such as the risk level for traffic accident data, inflow/outflow for crowd flow data and the count for crowd density data.The features of these traffic conditions are traffic features  .Given a time interval ,   ∈ R

Input Embedding Superregion Generator Superregion Generator Global Local-Attention Encoder Global Local-Attention Encoder Random Shuffle Superregion Partition Superregion Partition Decoder Decoder Superregion Partition Decoder Invariant Learning Branch Useless Learning Branch Intervention Module 𝓛 𝒔𝒖𝒑 𝓛 𝒄𝒘𝒍
and  ∈  is the index of the space location and time location, respectively.We can obtain the ST Pos embedding   ∈ R ×  by applying the learnable positional embedding[46]to the ST positional index.To inject external factors into the latent features, we use learnable embedding layers to encode external factors and generate the Ext embedding   ∈ R ×  .Then, we obtain the Aux embedding by concatenating the ST Pos embedding   and Ext embedding   , i.e., E =   ||  ∈ R × ℎ .After that, we obtain inputs for subsequent ST correlation modeling by adding the Aux embedding E to ST representations h.higher influence on each other.To incorporate these relationships into attention operations, we proposed a Calibration Attention Module (CAM), whose core is a calibration attention (CA) operation.The CA operation works on calibrating the attention based on citywide ST relationships via two components: [22,46,71]lLocal-AttentionEncoder(GLAE).As mentioned in Section 3.1.1,weproposetheGLAE,asshowninFig.3 (b), to capture the ST correlations in ILB and ULB.Since vehicles can travel at varying speeds, either quickly or slowly, in a city, it is essential to model both local and global ST correlations.Inspired by recent studies that employ temporal attentions[56,71]to address short-and long-term temporal dependencies, GLAE extends the temporal attention[56,71]to the spatio-temporal attention for citywide ST forecasting.A GLAE has  Calibration Attention (CA) Blocks, which include an auxiliary feature embedding, a local CA module (LCAM), a global CA module (GCAM), and a cropping layer.Since the architecture of GLAE is the same in ILB and ULB, we omit specific branch names in subsequent sections.The auxiliary feature (Aux) embedding provides ST positional information and external factor information.It includes the ST positional (Pos) embedding and the external factor (Ext) embedding.To enhance the positional context, we introduce ST Pos embedding to encode the space-time location.It extends the canonical positional embedding[71]into ST format.For a superregion   = (, ), the ST positional index of its location is  = ( +    ), where  ∈ CA Module for Local & Global ST Learning The conventional attentions[22,46,71]cannot apply to ST data directly, as they primarily focus on temporal dimension, ignoring ST relationships.Citywide ST data has two crucial ST relationships: (1) future traffic conditions cannot affect past conditions; (2) spatially connected superregions have a and V ∈ R × ℎ .Note that reshaping and broadcasting are needed to retain the ST positional index of superregions.In the CAM,  CA operations are performed to attend different ST patterns.Then, the output of the CAM is the aggregated ST representations ì h ∈ R × ℎ , obtained by applying a layer normalization, a residual connection, and a fully connected feed-forward network on the concatenation of the  CA operations.CAM can capture both local and global ST features owing to its attention-based design.To better learn local and global ST features, we use the CAM in two different ways: the local CAM (LCAM) and the global CAM (GCAM):

Table 1 :
The statistic of datasets.

Table 2 :
Model comparisons on the NYC13 and Chicago21 datasets for traffic accident forecasting, where -denotes the model cannot be applied on the dataset.

Table 3 :
Model comparisons for crowd flow prediction on BikeNYC dataset and Chicago21 dataset, and crowd density prediction on Chicago22 dataset.

Table 4 :
Ablation studies of CityCAN for crowd density prediction on Chicago22 dataset.

Table 4
details the effectiveness of each component in CityCAN.w/o C lacks the causal framework, which adopts a single invariant learning module.w/o UIL it the model without both useless loss