Graph Neural Processes for Spatio-Temporal Extrapolation

We study the task of spatio-temporal extrapolation that generates data at target locations from surrounding contexts in a graph. This task is crucial as sensors that collect data are sparsely deployed, resulting in a lack of fine-grained information due to high deployment and maintenance costs. Existing methods either use learning-based models like Neural Networks or statistical approaches like Gaussian Processes for this task. However, the former lacks uncertainty estimates and the latter fails to capture complex spatial and temporal correlations effectively. To address these issues, we propose Spatio-Temporal Graph Neural Processes (STGNP), a neural latent variable model which commands these capabilities simultaneously. Specifically, we first learn deterministic spatio-temporal representations by stacking layers of causal convolutions and cross-set graph neural networks. Then, we learn latent variables for target locations through vertical latent state transitions along layers and obtain extrapolations. Importantly during the transitions, we propose Graph Bayesian Aggregation (GBA), a Bayesian graph aggregator that aggregates contexts considering uncertainties in context data and graph structure. Extensive experiments show that STGNP has desirable properties such as uncertainty estimates and strong learning capabilities, and achieves state-of-the-art results by a clear margin.


INTRODUCTION
Spatio-temporal graph data, such as air quality readings [24,51], traffic flow data [23,31,46], and climate data [25] reported by deployed sensors, are ubiquitous in the physical world.Analyzing such data fosters a variety of applications for smart cities, enhancing people's lives and helping decision-making [15].Ideally, the data should be fine-grained to realize its benefits but it is often impractical to deploy and maintain sufficient sensors in an area of interest because of the high expenditure [36,43].For example, a professional station to measure air quality can cost around $200,000 to build and $30,000 per year to maintain [50].As a result, many applications have to rely on sparse data, leading to suboptimal solutions.Thus, finding ways to approximate the data in areas with no sensors has become a pressing issue.
In this paper, we study the problem of spatio-temporal extrapolation, which involves estimating a function that predicts the spatiotemporal data at target locations of interest areas based on the surrounding context nodes and related exogenous covariates, operated in a graph structure, as shown in Figure 1(a).As an example, we use air quality extrapolation [50] illustrated in Figure 1(b).We utilize the air quality index from context sensors to extrapolate the indexes at the target locations, taking into account covariates such as weather conditions that can also impact air quality.
To achieve our goal, one pivotal property that needs to be considered is spatio-temporal correlations, i.e., the spatial dependencies within the graph and the temporal dependencies along the time axis.To capture these correlations, Neural Networks (NNs), especially Spatio-Temporal Graph Neural Networks (STGNNs), nowadays have become a favorite due to their tempting learning competence [12,46].However, NNs have two main limitations: (i) They lack the sought-after ability to estimate uncertainties.Sensors often produce signals with different levels of ambiguity, such as noisy or missing observations due to unpredictable factors such as network disruptions.Explicitly capturing these uncertainties has proved to be a boon for making reliable decisions [42,44].However, most NNs are deterministic and unable to account for these uncertainties.(ii) Their ability to generalize to new data is limited.NNs require a large amount of training data to parameterize a function.Nevertheless, their reliance on parameters hinders their adaptability in unpredictable real-world environments, as the model needs to be retrained whenever the environment changes.Additionally, their sensitivity to hyperparameters demands a hyperparameter search to attain optimal performance.
The limitations of NNs have led researchers to draw inspiration from probability models, with Gaussian Processes (GPs) being one potential approach [35].GPs define a stochastic process in which the spatio-temporal relations are modeled by various kernels [32].Their Bayesian principle and non-parametric nature enable them to handle uncertainty and generalize well to a wide range of functions [29].However, GPs suffer from limited expressivity of kernels, which can be disadvantageous.To address these issues, Neural Processes (NPs) [9] have emerged as a promising approach.NPs learn to construct stochastic processes by parameterizing neural networks in which an aggregator is introduced to integrate contexts.By combining the strengths of both NNs and GPs, NPs offer a compelling framework for spatio-temporal extrapolation.
Unfortunately, NPs cannot be applied directly to spatio-temporal graph data due to the following factors: (i) They struggle to learn temporal dependencies effectively.Existing NPs [33,38] utilize latent state transitions to capture temporal relations recurrently.However, transitions tend to only focus on learned latent variables, disregarding the contextual information at later recurrent steps.This phenomenon, known as transition collapse, can impede learning over long sequences [38].(ii) They are incapable of modeling spatial relationships in a graph.Existing NPs' aggregation operations [11,16,41] lack the ability to model complex spatial relations defined in a graph.In addition, their deterministic nature makes them suboptimal for aggregating data with varying levels of ambiguity, such as missing values and noise in Figure 1(a).
To tackle these challenges, we propose Spatio-Temporal Graph Neural Processes (STGNP) for spatio-temporal extrapolation over graphs.STGNP has two stages: the first stage leverages a deterministic network to learn spatio-temporal representations of nodes.Instead of relying on a recurrent structure, the temporal dynamics are modeled by stacking convolution layers in a bottom-up way [1], while the spatial relations are captured by cross-set graph neural networks.In the second stage, we employ state transitions to aggregate latent variables of target nodes following a top-down manner.Here, the transition assumes horizontal time independence and incorporates long-range temporal evolution from the upper layers.As the number of transitions only relates to the number of stacked layers, much smaller than the sequential length, our model naturally exhibits resistance to transition collapse.
For the aggregator in each transition, we claim that different context nodes have different levels of importance.Motivated by [41], we propose Graph Bayesian Aggregation (GBA) that directly aggregates distributions over latent variables regulated by the graph.Intuitively, a context node contributes less to the latent distribution if it is far from the target location or exhibits a high degree of ambiguity recognized by the module.This design explicitly considers graph structure into NP's aggregator and enhances the model's capability to capture node uncertainties.
In summary, our main contributions are summarized as follows: • We propose Spatio-Temporal Graph Neural Processes.

PRELIMINARIES
We first define concepts and notations of spatial-temporal data on graphs.Then, we introduce the basic concepts of neural processes.

Definitions and Notations
Definition 1 (Graph).We represent nodes as a graph G = (V, E), where V is the vertex set and E is the edge set defining the weights between nodes.The -hop neighborhood of a node  ∈ V denoted by N  () is the set of nodes that are reachable from  with  steps.
Based on E and , an -hop adjacency matrix   is derived to measure the non-Euclidean distances between connected neighbors.Definition 2 (Spatio-Temporal Data).Observed signals are retrieved from each node in the graph.We use   = ( ,1 , ..,  , , ..,  , ) ∈ R  ×  to denote data of node  that is measured over a time window  , where   is the number of features. = ( 1 , ..,   , ..,   ) ∈ R  × ×  is denoted as a signal tensor of all nodes over the window  , where  is the total number of observable nodes in the graph.

Definition 3 (Exogenous Covariates
). Exogenous covariates benefit the learning process as they usually have notable correlations with node data.These covariates are readily available from different sources.For instance, weather conditions can affect air pollutant data and they are collected from weather stations.We denote these factors as a tensor  ∈ R  × ×  and consider them explicitly.

Neural Processes
Neural Processes [9] construct stochastic processes that map  ∈ R   in an input domain to  ∈ R   in an output space, conditioned on a context set C = {(  ,   )}  =1 of observed input-output pairs.NPs follow the same principle as Gaussian Processes except the stochastic process is learned implicitly by neural networks.Specifically, NPs define a conditional latent variable framework, where the distribution of a latent variable  is described by a learned conditional prior  (|C) from the context set.Then, with inputs of target variables  D = {  }  =1 in a target set D, a likelihood module  ( D | D , ) is trained to predict the corresponding output variables  D .The following posterior predictive likelihood formulates the generative process of NPs: In practice, NPs assume the target variables are independent, decomposing the likelihood such that  ( D | D , ) is factorized as  =1  (  |  , ), where  = |D|.NPs organize a meta-learning framework where each pair of {C, D} constructs its own stochastic process, making it less parametric dependent whit strong generalizability.Note that conditions C in the prior should be aggregated by a permutation-invariant function (e.g., mean, attention) to define a stochastic process, according to Kolmogorov Extension Theorem [30].As the marginalization of the latent variable  is normally intractable, the model is usually trained either by Monte-Carlo (MC) sampling to estimate Equation 1 directly [8] or by variational approximation of maximizing the evidence lower bound (ELBO) [9]: where (|C ∪ D) and  (  |  , ) are the approximated posterior and the likelihood learned by neural networks.As the true conditional prior  (|C) in the numerator is also intractable, the same module (•) is employed to approximate  (|C) ≈ (|C).

METHODOLOGY
In this section, we propose STGNP, a neural latent variable model to enhance spatio-temporal extrapolation.As its graphical model illustrates in Figure 2, the key pipeline is to learn deterministic representations (STRL) and stochastic latent variables (GBA) in two stages.We first introduce the problem of spatio-temporal extrapolation.Then, we describe the deterministic stage for learning spatio-temporal representations and derive Graph Bayesian Aggregation to aggregate contexts in the stochastic stage.Finally, we introduce the generative process and the optimization procedure.Note that as target nodes are independent, we only discuss a single target node  in the following sections for brevity.target set  where  is the total number of target nodes, given the covariates  D , the context set C, and the adjacency matrix .In this paper, we use subscript  and  to index target and context nodes respectively, and adopt the terms location, node, and sensor interchangeably.

Spatio-Temporal Representation Learning
The deterministic stage has three building blocks to capture spatial and temporal correlations: learnable target token, dilated causal convolution, and cross-set graph convolution.We introduce them individually and then illustrate the overall learning framework.
Learnable Target Token.Our network takes sensor data and covariates as inputs; however, the data   of a target node is unknown.Existing methods typically preprocess it by either filling it with zeros [2,45] or employing linear interpolation to estimate its values [14].However, using zero to represent target variables can be confusing, as it may be interpreted as the semantic zero within the dataset.Moreover, interpolation incurs large errors, which also hampers model performance.Inspired by Masked AutoEncoder [13], we leverage a shared learnable token  ∈ R  0 as embeddings for target nodes, while using an embedding layer with the parameter  ∈ R   × 0 to get embeddings for context nodes.The token is optimized by the network to identify a position in the feature space that represents the target node, which avoids inferior preprocessing.
Cross-Set Graph Convolution Layer.Graph convolution is a seminal operation to learn spatial relations in a graph structure.Existing GCNs methods typically treat dependencies over all nodes equally [14,45].However, in our task, relations across the target and context sets take priority due to their direct influence on the target nodes.Based on this insight, we argue that disregarding internal relations within the two sets does not adversely affect performance and propose cross-set graph convolution (CSGCN), in which only dependencies across the set C and D are captured.Specifically, given the representation of the target node   −1  ∈ R  ×  −1 at layer  − 1, we update it by its neighbors   −1  in the context set up to -hop, with each neighbor weighted by the adjacency weight   , : where    ∈ R   −1 ×  are learnable parameters and N   () is hop neighbors of the target node  indexed from   .Note that when  = 0,  0  is the broadcast target token and  0  is the context embeddings.Compared to traditional GCNs, CSGCN offers improved efficiency, reducing the computational complexity from O (( +) 2 ) to O ( ×).Despite this efficiency gain, it maintains strong learning capabilities as demonstrated in the experiments.
Dilated Causal Convolution Layer.We employ dilated causal convolutions (DCConv) [48] to capture temporal dependencies.Unlike the recurrent structure, it learns temporal relations over long sequences by stacking causal layers.This approach proves advantageous as the number of layers is considerably smaller than the length of the sequence, mitigating the issue of transition collapse in the later stage.Specifically, at time , a 1D causal convolution learns a temporal representation ℎ  , ∈ R   for node : where   −1  ∈ R  ×  −1 is a node representation at the previous layer, ★K  means a DCConv with the kernel size  ×   −1 ×   , and ⊙ is the Hadamard product.The dilation factor  is initialized to 1 with an exponentially increasing rate of 2 [40] and zero-padding is used to ensure the inputs and outputs have the same time length  .
Learning Framework.As shown in Figure 3, each layer of the network first applies a CSGCN to model spatial relations, followed by a DCConv to capture temporal dependencies for node representations.Additionally, the features of covariates, learned through a 1 × 1 convolution, are incorporated into the node representations.Note that we do not explicitly involve covariates in CSGCNs and DCConvs, as they may exhibit different spatio-temporal dependencies or even lack relations in certain scenarios [39].By stacking layers with skip connections, the representations of the target node are obtained in which each layer maintains temporal dependencies at various scales, with upper layers capturing long-range relations and lower layers comprising fine-grained information.Thus, the stochastic stage is able to access different scales through its hierarchical dependency.

Graph Bayesian Aggregation
The core component for the stochastic stage is our proposed Graph Bayesian Aggregation, which aggregates information from context nodes and derives latent variables    ∈ R  ×  describing stochastic processes over target nodes.Figure 4 illustrates the aggregation process.Based on Bayes' theorem [3], we assume a prior  (   ) over the target node.Then for each context node , a latent observation model  (   |   ,  , ) is derived in which its mean conditions on a linear transformation of   and  , .Thus once observe    , the latent variable   is updated through its posterior: where we suppose the latent observations are independent and only consider the 1-hop neighbor to simplify the computation.The prior  (   ) follows a factorized Gaussian: where     and  2    are mean and variance learned by Enc   (•) that will be discussed in the following section.For the latent observation model, we also impose a factorized Gaussian.Note that instead of learning its mean, we learn the observation    directly together ... with its variance  2    , which guarantees valid Gaussian conditioning during inference [41]: where   and  2   are parameterized by Enc   (•).The Gaussian assumption avoids an intractable computation of the marginal likelihood of the posterior's denominator.In fact, we can calculate it easily by Gaussian conditioning in a closed-form solution.(see proof in Appendix B): μ where σ2 and μ   are updated parameters and the operations are conducted in an element-wise manner.With factorization, the conditioning is efficient to compute, avoiding costly matrix inversion.
In addition, all the calculations are differentiable so that GBA can be optimized in an end-to-end way by stochastic gradient descent.
There are significant implications behind the aggregation.First, it incorporates the graph structure into the model by applying a linear transformation through the adjacency matrix, which is equivalent to GCNs when neglecting uncertainty terms.This suggests that GBA has similar learning abilities to GCNs.Second, the aggregation takes the uncertainties of nodes into consideration, which is a compelling property against previous methods.From the equations, the contribution of a context node is determined by its learned observation    , the variance     , and the distance weight  , .Specifically, Equation 8gives a reasonable assumption that a context node located at a greater distance would provide less confident information.Additionally, Equation 9 implicates a node's contribution diminishes when its associated variance     is large, signifying higher ambiguity.This theoretically guarantees the model's robustness when dealing with noisy data.

Generative Process
The target latent variable    depends on its representation    and those of the context nodes   .The longer-range temporal dependencies are transited by conditioning    on  +1  , forming a vertical time hierarchy.In practice, given    and a sample from  ( +1  ), the network Enc   ( +1  ,    ) first learns a prior  (   ) over the target node in Equation 6.Then, the deterministic representations of context nodes are adopted to learn their latent observations by Enc   (   ) in Equation 7. Next, parameters of  (   ) are updated according to Equation 8 and 9.After the bottom layer  = 1, a likelihood model concatenates samples   = (1  , ...   ) from all layers and the target node's exogenous covariates   to predict its extrapolations   .Formally, the generative process of STGNP is summarized as: where the first term is a likelihood; the second term denotes a conditional prior aggregated through the GBA.Note that at the top layer ,  +1  = 0 and the likelihood is assumed to be a factorized Gaussian distribution.

Inference and Optimization
Typically, closed-form solutions for non-linear transitions and likelihood do not exist; thus we train the model through variational approximation.The approximated posterior (  |C ∪ D, ) has the same structure as the conditional prior but takes target node data   as inputs.Then the deterministic and stochastic modules can be optimized together by the evidence lower-bound (ELBO): Given the hierarchical structure of Equation 10, the Kullback-Leibler divergence term KL can be further decomposed as: where unlike using the learned token,  ′  0 is the feature embeddings of   .Following [9], we use the same variational module to approximate the conditional prior so that  (•) = (•) in Equation 12During optimization, ELBO can be minimized using stochastic gradient descent with the reparameterization trick [19].

EXPERIMENTS
To evaluate the performance and properties of STGNP, we conducted experiments to answer the following questions: • Q1: How does STGNP perform on real-world datasets compared to other baselines?• Q2: What is the quality of the uncertainty estimates?• Q3: What is the effect of each component in our model, e.g., the learnable token, CSGCN, DCConv and GBA.• Q4: Is STGNP resistant to different missing ratios in datasets?
• Q5: Is it sensitive to hyperparameters and prone to overfitting?• Q6: Does STGNP show robust generalizability when the domain of sensors changes?

Experimental Setup
4.1.1Dataset Descriptions.We carry out experiments on three real-world spatio-temporal datasets.
• Beijing [51]: Beijing contains air quality indexes (AQI) from 36 stations and district-level meteorological attributes.Following [5,12], we aim to extrapolate the AQI of PM2.5, PM10, and NO2, using meteorological attributes such as temperature, humidity, pressure, wind speed, direction, and weather as covariates.• London 1 : We adopt London to evaluate the performance on different domains to answer Q6.It collects signals from 24 stations.Note that some baselines use non-publicly available covariates like POIs in the above two datasets.To ensure a fair comparison, we do not utilize them for all models in this work.
4.1.2Baselines.We consider eight baselines which can be categorized into three classes.
• Statistical models: We utilize KNN, IDW [28], RF [7], and ANCL [32].ANCL is a GPs-based framework that designs specialized kernels for different data attributes.• Neural Network methods: We take ADAIN [5] and MCAM [12] as NNs baselines.ADAIN uses MLP and RNN layers for static and dynamic data, followed by an attention mechanism to aggregate features, whereas MCAM introduces multi-channel attention blocks for static and dynamic correlations.• Neural Processes approaches: We modify SNP [38] and name the variant as SGNP.As SNP cannot deal with graph data, we add a cross-set graph network at each recurrent step before the aggregation.SGANP is an advanced version of SGNP modified from [33], which utilizes attention mechanisms as the aggregator.
Our model, STGNP, is based on a probabilistic framework and is a general method that is not tailored to specific tasks.It can also be applied in situations where exogenous covariates are not available by simply removing the corresponding causal convolution blocks.In contrast, ANCL relies on periodic and categorical kernels and MCAM utilizes horizontal and vertical wind speeds that are specific to the air quality task.SGNP and SGANP have a similar NP framework to ours, but we abandon the recurrent structure and propose GBA to aggregate nodes under a Bayesian framework.

Evaluation Strategy and Hyperparameters.
In Figure 5, we illustrate the evaluation strategy for experiments.As we lack finegrained data for the areas, we manually set aside 30% of existing stations in the dataset as target nodes for reporting performance.The remaining 70% is used for training purposes, ensuring a ratio of 3 = 7.This ensures that target nodes will not involve in the training phase.

Overall Performance
To answer Q1, we report the overall results of baselines over Beijing and Water datasets, as shown in Table 1.Note that ANCL and MCAM require meteorological features, so they cannot be applied to the Water dataset.From the table, we see that STGNP consistently outperforms other baseline models with a notable margin, with the lowest errors on all datasets.Moreover, we have the following observations.First, the GPs model ANCL outperforms the other statistical models, indicating that GPs have the essential ability to capture complex dependencies with dedicated kernels.Second, NNs models surpass the above methods on all datasets because of their strong learning capability.Third, we find that SGNP and SGANP cannot consistently outperform NNs, possibly due to the transition collapse of the recurrent structure which may cause them to ignore the input contexts.Comparing these two models, although the attention aggregator performs better than the mean aggregator on tasks like computer vision, this is not always the case for graph data.This is because the data is constrained by a graph, which should be considered explicitly.Our STGNP has the best results due to the use of causal convolutions to alleviate transition collapse and the GBA to take into fact node uncertainties.Another interesting discovery is that our model, SGNP, SGANP, and ANCL exhibit lower metric variances compared to other NN models.Evidently, for PM10 concentrations with large extrapolation errors, the metric variances for these models are small whereas for ADAIN and MCAM, their MAE and RMSE reach high values of 1.3, 3.4, and 3.2, 5.5, separately.The primary reason for the models with low variances is their ability to model data in a probabilistic manner and their awareness of uncertainty.This characteristic enables the models to be robust to parameter initialization and reduces the chance of getting trapped in local minima.
We also visualize extrapolation results in Figure 6 which illustrates the PM2.5 extrapolations of the five best models on the Beijing dataset, together with the ground truth.We observe that our model produces the most accurate results toward the ground truth data.This is particularly evident during the sudden change in the data

Uncertainty Estimates Analysis
One of the key benefits of our STGNP model is its capability to provide high-quality uncertainty estimates.To answer Q2, we first evaluate the qualitative results to show that they provide valuable information.In Figure 6, our model consistently has more accurate extrapolations compared with baselines.However, at 3-6 AM, March 28 shown in the blue box, all methods fail to generate precise data.Significantly, our model renders higher uncertainties (orange shadow), indicating possible inaccurate extrapolations.This ability to accurately estimate uncertainty can be extremely useful in practical decision-making.For instance, if the uncertainty for a location is consistently high, researchers could use this information to prioritize the deployment of sensors in order to achieve more accurate data analysis with minimal expenditure.Next, to evaluate the quality of our estimates quantitatively, we follow the approach used in [4] and examine the number of extrapolations that fall in the uncertainty intervals.Assume a Gaussian likelihood with variance  2 to suggest uncertainty, the intervals 1, 2, 3 centered at the predicted data cover ≈ 68.3%, 95.5%, 99.7% of the probability density.We posit that a better model should have a higher proportion of points falling in the intervals.the statistical results of the proportions for ANCL, SGNP, SGANP, and STGNP.It shows that all NPs methods have superior performance compared to ANCL and that our STGNP outperforms SGNP and SGANP in most metrics.This is likely due to the fact that our model also considers uncertainties in context nodes through GBA.

Ablation Study
To assess the contribution of each component to the overall performance of our model and answer Q3, we conduct ablation studies and present the results in Figure 8.In each study, we only modify the corresponding part while leaving other settings unchanged.
Effect of learnable token: We remove the learned token and use linear interpolation to preprocess target nodes (w/o TK).The results unveil that the token has a positive impact on the model's performance.We believe this is because using simple interpolation to represent all signals of the target nodes leads to significant errors.Mingled with context nodes' signals, this hobbles the model's learning ability, and even the uncertainty estimates struggle to rectify them.In contrast, the learned token identifies a suitable position in the feature space that accurately indicates the target nodes, thereby avoiding this problem.
Effect of CSGCN: To evaluate the effectiveness of cross-set graph convolutional network, we compare it to three variants: (a) w/o CSGCN: removing CSGCN in our model; (b) r/p GCN: replacing CSGCN with a standard GCN.(c) r/p RGCN: replacing it with an advance relational GCN [34].Our results show that removing the spatial learning module CSGCN leads to a degradation in performance, underscoring the importance of capturing spatial dependencies.Additionally, we find our CSGCN achieves performance on  par with the standard GCN, which learns dependencies among all nodes in two sets, and even slightly better than the RGCN, which explicitly characterizes categorical relations among nodes.This observation suggests that modeling dependencies across two sets are sufficient to achieve satisfactory performance, thereby validating our insight.
Effect of GBA: Results from experiments where GBA is discarded (w/o GBA) or replaced with a max aggregator (r/p MAX), a mean aggregator (r/p MEAN), or an attention aggregator (r/p ATTN) highlight the critical role of GBA.Removing or replacing it leads to a significant decline in results on most datasets.This is likely because context nodes have varying levels of ambiguities.Without uncertainty estimates, the model has difficulty extracting valuable context information, which hampers the performance of the likelihood module.
Effect of causal convolution: We replace our temporal learning framework with an RNN structure (r/p RNN) and the results show a strong deterioration in the model's performance.The outcome indicates transition collapse, which occurs in an RNN because of a large number of transitions, can have a substantial impact on the model's capability.In contrast, our STGNP model effectively mitigates this issue by significantly reducing the number of transitions.

Missing Ratio Study
Evaluating the model's performance on data with missing values is important, as real-world sensors might lose signals due to unpredictable factors.To answer Q4, we train models using manually corrupted data by randomly replacing a ratio of data with zero.Following the same procedure, we preprocess them with linear interpolation for NNs, while leaving the missing values as zero for NPs models.The results, which illustrate the MAE performance of various baselines for ratios ranging from 0.2 to 0.7, are shown in Figure 7(a).We observe significant performance degradation in ADAIN and MCAM, which can be similarly attributed to large errors caused by interpolation.In contrast, STGNP and SGANP demonstrate better performance, with STGNP performing the best results.The key factor behind their success lies in the ability of their likelihood modules to capture uncertainties associated with the target nodes.This capability effectively mitigates the impact of missing data.Compellingly, the GBA module in our STGNP is able to further enhance its robustness by accurately modeling uncertainties in the signals of individual sensors.

Hyperparameter Study
To answer Q5, we study the performance of STGNP under various hyperparameter settings.We first explore the impact of the number of channels in two stages.Following the default setting of the channel numbers [8, 16, 32] where  = 2, we vary  from 1 to 4. From Figure 7(b), we observe that even with the drastically increasing number of learnable parameters (violet curves), the MAE metrics (green, blue, red, and yellow curves) keep stable.Then, we examine the effect of changing the number of channels of the likelihood module ranging from 32 to 256 and also observe stable performances, as shown in Figure 7(c).Finally, we investigate the effect of the number of layers ranging from 1 to 5. From Figure 7(d), the performances initially improve as it increases, but start to oscillate since 3 layers.This can be explained by the perceptive field of the model, where 1 and 2 layers correspond to fields of 3 and 7, respectively.These limited perceptive fields constrain the model's performance.However, once the perceptive field becomes sufficiently large (i.e., at 3 layers and above), the model captures long enough temporal dependencies, leading to improved and more stable performance.These findings suggest STGNP is insensitive to hyperparameters, as long as the perceptive field is adequately large.

Cross-Domain Evaluation
To assess the generalization ability of models and answer Q6, we train them on the Beijing dataset and evaluate results using the London dataset.Table 3 reports the performances of PM2.5, where the first and second columns of MAE mean the performance of training models using London data directly and training on Beijing while evaluating on London.We confirm that our STGNP has the best results in both training approaches.In addition, when measuring the performance discrepancy, we also discover that STGNP has the smallest performance degeneration, especially compared to NNs models.This is likely due to STGNP's strong spatio-temporal learning capability as well as its non-parametric design inherited from GPs.This allows the model to learn high-level spatio-temporal principles in the feature space and to instantiate these dependencies during testing by constructing a stochastic process from the sets.

RELATED WORK 5.1 Spatio-Temporal Extrapolation
Spatio-temporal extrapolation is a task that involves predicting the state of a surrounding environment based on known information.Early works in this area use statistical machine learning methods, such as -Nearest Neighbors (KNN) and Random Forest (RF) [7], to solve this problem.KNN approaches rely on linear dependencies between data points, while RF can capture non-linear dependencies.However, these methods only consider spatial relations and struggle to model more complex and dynamic correlations.Gaussian Processes [35] learn to construct stochastic processes, in which the spatio-temporal dependencies are captured by flexible kernels that are designed to handle different types of features [21].For instance, Patel et al. [32] used periodical and Hamming distance kernels for temporal and categorical features respectively.However, these kernels are tailored to specific scenarios and the cubic time complexity limits their applicability.Some approaches view extrapolation as a tensor completion task [49].Based on a low-rank matrix assumption, these methods capture the spatio-temporal patterns while being efficient to optimize.However, they are transductive, meaning that they can only infer the state of nodes involved in the training process.They are not able to generalize to new nodes that were not present in the training data.Recently, Neural Networks (NNs) have become the dominant paradigm.Cheng et al. [5] proposed an attention model for air quality inference, where the dynamic and static data are encoded by RNNs and MLPs, and the attention mechanism integrates the features of nodes.Han et al. [12] combined GCNs with a multi-channel attention module to improve performance.However, NNs tend to struggle with learning uncertainties and can overfit on datasets with low amounts of data.Some NNs also view extrapolation as a kriging problem.For instance, Appleby et al. [2] interpolate a node given its neighbors with time information as extra features, and We et al. [45] learn the temporal dynamics, explicitly.However, the first only captures spatial relations and the second cannot incorporate exogenous covariates.On the contrary, our approach is able to achieve both goals.

Neural Processes Family
Neural Processes (NPs) combine the merits of both NNs and GPs [9], which possess strong learning ability and uncertainty estimates.Basically, it induces latent variables over the context set, forming a conditional latent variable model.Then, a likelihood module is utilized to generate the target outputs.Le et al. [20] found that NPs suffer an underfitting problem because of the incapable aggregation function (e.g., mean or sum).Then, Kim et al. [16] proposed Attentive Neural Processes (ANP) to calculate the importance within/across the context set and target set.Kim et al. [17] further proposed a stochastic attention mechanism for aggregation where the attention weights are inferred using Bayesian inference and Volpp et al. [41] introduced a stochastic aggregator to aggregate context variables directly.However, these works only consider the spatial domain and cannot handle graph data.Singh et al. [38] first focused on sequential data and proposed Sequential Neural Processes (SNP).With a latent state transition function from a variational recurrent neural network (VRNN) [6], it constructs a sequential stochastic process for a timeline.Then, Yoon et al. [47] introduced Recurrent Memory Reconstruction to compensate for the distribution shift in a sequence.However, these recurrent structures have the transition collapse problem, which can make it difficult to learn temporal relations over long sequences.In contrast, our method utilizes casual convolutions to alleviate the challenge.

CONCLUSIONS AND FUTURE WORK
We introduce Spatio-Temporal Graph Neural Processes, the first framework for spatio-temporal extrapolation in the Neural Processes family.Our model captures temporal relations and addresses the transition collapse problem using causal convolutions while effectively learning spatial dependencies using the cross-set graph network.The Graph Bayesian Aggregation aggregates context nodes in a way that takes into account their uncertainties and enhances the learning ability of NPs on graph data.Experimental results demonstrate the superiority of STGNP in terms of extrapolation accuracy, uncertainty estimates, robustness, and generalizability.In the future, an intriguing direction would be to explore STGNP for spatio-temporal forecasting, which is another fundamental task in the area.By aggregating historical representations, the model could provide predictions about future time steps.

A MATHEMATICAL NOTATION
We define the major mathematical notations in the paper in Table 4 for better understanding.

B DERIVATION OF GRAPH BAYESIAN AGGREGATION
In this section, we give formal derivations of the proposed Graph Bayesian Aggregation.We first derive the general GBA without factorization or specific graph stricture.We assume a latent prior  over a target node (omitting subscript  for brevity).The latent observation functions of all context nodes are an independent linear transformation of  following Gaussian distributions: where we use the precision matrix Λ and  for convenience;   is a transformation matrix, representing the graph structure.The logarithmic joint probability over  and [ 1 , ..,   ] is: where the second order terms can be decomposed as: According to [3], matrix  above is the precision matrix of the joint distribution, where the covariance matrix can be calculated as: Next, the linear terms in Equation 15 can be decomposed as: Then from [3], the mean of the joint distribution is computed by: The mean and covariance of the marginal distribution of  ( 1 , ..,   ) can be extracted from Equation 17 and 16: Finally, with  (,  1 , ..,   ), and  ( 1 , ..,   ), we could obtain the probability of  ( | 1 , ..,   ) through Gaussian conditioning, which is also a Gaussian: In GBA, the Gaussian distributions assume to be factorized; thus Λ −1 = diag(  ) and  −1  = diag(   ).The transformation   =  (  ), where   represents the distance between a target node and the context node .Then, the Gaussian can be further weight as:

C DERIVATION OF ELBO FOR STGNP
Here, we derive the evidence lower-bound (ELBO) for STGNP.For brevity, we still omit the subscript .

D.2 Training Settings
All parameters are initialized with Xavier normalization [10] and optimized by the Adam optimizer [18] with a learning rate of 10 −3 .
We train each model for 150 epochs.At each iteration, we randomly sample  − 3 nodes to extrapolate the remaining 3 nodes, with the time length  = 24.Note that the number of target nodes has an impact on the performance of the trained models.We conducted experiments to determine the optimal number of target nodes and found that using 3 nodes generally resulted in the best performance across all baseline models.

D.3 Cross-Domain Evaluation
We first learn models using the Beijing dataset and then evaluate their performances on the London dataset.As the London dataset lacks weather information, we remove this attribute in Beijing during training.The other training procedures remain the same.Note that we only investigate the performance of PM2.5 concentration and exclude stations BX1, and HR1 due to their large portion of missing values.This is because, for the London dataset, both PM10 and NO2 have a significant amount of missing data (5/7 stations without any signal, and 2/1 stations with missing rates larger than 60%), which makes training unstable and the performance of the models largely depends on the training and testing data split.

E ADDITIONAL VISUALIZATIONS
We present additional visualization results of STGNP and the baselines on the Beijing dataset.As shown in Figure 10, our model consistently outperforms others on all time steps.Additionally, on less accurate extrapolations, our model is able to yield large uncertainties, which facilitates decision-making.

Figure 1 :
Figure 1: (a) Context sensors 1-5 are utilized to generate data of a target location, considering exogenous covariates and the graph structure.(b) Extrapolations of our STGNP.

Figure 3 :
Figure 3: The pipeline of the spatio-temporal representation learning network, where we first capture temporal dependencies using the DCConv and then learn spatial relations by CSGCN.Embed denotes an embedding layer.

Figure 4 :
Figure 4: Graph Bayesian Aggregation.Enc   (•) and Enc   (•) are neural networks that learn mean and variance for the prior and latent observation distribution.
To the best of our knowledge, this is the first work that generalizes NPs to spatio-temporal graph modeling.STGNP captures uncertainties explicitly and can generalize to different functions robustly, which is a major advantage over NNs approaches.Additionally, STGNP is able to learn temporal relations and graph data effectively, which sets it apart from classical NPs models.•We introduce Graph Bayesian Aggregation, a Bayesian method for aggregating context nodes, which allows the aggregator to model graph structure and uncertainties for context nodes.
• We conduct comprehensive experiments to evaluate the performance and properties of STGNP.Our results demonstrate that STGNP outperforms state-of-the-art baselines by a significant margin and exhibits compelling properties, such as uncertainty estimates, high generalizability, and robustness to noisy data.
We formulate spatio-temporal extrapolation in the NPs framework and first define the context set C containing nodes with exogenous covariates and observed data {(  ,   )}  =1 ∈ R  × × (  +  ) .Our goal is to learn a posterior predictive distribution  ( D | D , C, ) to generate  D ∈ R  × ×  over the same time period in the  ×  are deterministic representations, latent variables of a target node  and representations of a context node. ∈ R  0 is the learnable target token.The shadow circle, STRL, GBA denote an observed variable, a spatio-temporal representation learning, and Graph Bayesian Aggregation.
[16,32,a is sequentially divided into three segments: an 80% training set, a 10% validation set, and a 10% test set.During training and validation, we use  − 3 nodes to extrapolate the remaining 3 randomly selected nodes.We report extrapolation metrics of MAE, MSE, and MAPE of the target stations.Since our , SGNP, and SGANP contain stochastic modules, we adopt the mean of distributions directly, instead of sampling, to report the results.The time window  is 24 and the adjacency matrix is constructed based on the locations of stations, normalized by a thresholded Gaussian kernel[37].It is worth noting that, in the case of NNs, data processing is necessary to handle missing values in the dataset.For this reason, we employ linear interpolation to fill in the missing values for ADAIN and MCAM, while for the other baselines, missing values are left as zero.Hyperparameters are the same on all datasets for STGNP.The deterministic learning stage has 3 layers with a kernel size  = 3 and channel numbers[16,32, 64].The stochastic stage is a 3-layer 1 × 1 convolutions module and the channels of latent variables are[16,32, 64].The likelihood function is a 3-layer 1×1 convolution with 128 channels in each layer.STGNP, ADAIN, MCAM, SGNP, and SGANP are implemented with PyTorch and trained on an NVIDIA A100 GPU.We repeat each experiment 5 times and report the average and variance of the metrics.Our implementations 2 are publicly available.
Figure 5: Evaluation strategy and dataset division.During testing, we use context nodes to extrapolate target nodes.model

Table 1 :
Performances of STGNP and the baselines on two datasets.We denote the metric variance as Δ  = 0.1 ×  .
after 9 AM on March 29, as highlighted in the red box.This demonstrates the effectiveness of our model in capturing both spatial information from surrounding sensors and temporal dependencies.

Table 2 :
Proportions (%) of data falling in the intervals of 1 − 3, where  is the standard deviation of the likelihood.

Table 3 :
Performances of cross-domain evaluation.