Spatio-temporal Diffusion Point Processes

Spatio-temporal point process (STPP) is a stochastic collection of events accompanied with time and space. Due to computational complexities, existing solutions for STPPs compromise with conditional independence between time and space, which consider the temporal and spatial distributions separately. The failure to model the joint distribution leads to limited capacities in characterizing the spatio-temporal entangled interactions given past events. In this work, we propose a novel parameterization framework for STPPs, which leverages diffusion models to learn complex spatio-temporal joint distributions. We decompose the learning of the target joint distribution into multiple steps, where each step can be faithfully described by a Gaussian distribution. To enhance the learning of each step, an elaborated spatio-temporal co-attention module is proposed to capture the interdependence between the event time and space adaptively. For the first time, we break the restrictions on spatio-temporal dependencies in existing solutions, and enable a flexible and accurate modeling paradigm for STPPs. Extensive experiments from a wide range of fields, such as epidemiology, seismology, crime, and urban mobility, demonstrate that our framework outperforms the state-of-the-art baselines remarkably, with an average improvement of over 50%. Further in-depth analyses validate its ability to capture spatio-temporal interactions, which can learn adaptively for different scenarios. The datasets and source code are available online: https://github.com/tsinghua-fib-lab/Spatio-temporal-Diffusion-Point-Processes.


INTRODUCTION
Spatio-temporal point process (STPP) is a stochastic collection of points, where each point denotes an event  = (, ) associated with time  and location .STPP is a principled framework for modeling sequences consisting of spatio-temporal events, and has been applied in a wide range of fields, such as earthquakes and aftershocks [3,36], disease spread [34,40], urban mobility [29,54,61,62], and emergencies [58,66].
Spatio-temporal point processes have been widely studied in the literature [2,3,7,13,45,59,67] with rich theoretical foundations [5,14,23].Due to computational complexities, a general approach for STPPs is to characterize the event time and space with distinct models.Conventional STPP models [7,13,45] mainly capture relatively simple patterns of spatio-temporal dynamics, where the temporal domain is modeled by temporal point process models, such as Poisson process [23], Hawkes process [17], and Self-correcting process [21], and the spatial domain is usually fitted by kernel density estimators (KDE) [56].With the advance of neural networks, a series of neural architectures are proposed to improve the fitting accuracy [3,22,65].However, they still adopt the approach of separate modeling.For example, Chen et al. [3] use neural ODEs and continuous-time normalizing flows (CNFs) to learn the temporal distribution and spatial distribution, respectively.Zhou et al. [65] apply two independent kernel functions for time and space, whose parameters are obtained from neural networks, to build the density function.
However, for STPPs, the time and space where an event occurs are highly dependent and entangled with each other.For example, in seismology, earthquakes are spatio-temporal correlated due to crust movements [55], which occur with a higher probability close in time and space to previous earthquakes.Take urban mobility as another example, people are more likely to go to work during the day, while tend to go for entertainment at night.Therefore, it is crucial to learn models that can address the spatio-temporal joint distribution conditioned on the event history.However, it is non-trivial due to the following two challenges:
attention and temporal attention to capture their fine-grained interactions adaptively, which characterizes underlying mechanisms of the joint distribution.Table 1 compares the advantages of our framework with existing solutions.DSTPP can learn spatio-temporal joint distributions without any dependence restrictions.As no integrals or Monte Carlo approximations are required, it is flexible and can perform sampling in a closed form.It can also be utilized to model a variety of STPPs, where events are accompanied with either a vector of real-valued spatial location or a discrete value, e.g., a class label of the location; thus it is broadly applicable in real-world scenarios.We summarize our contributions as follows: • To the best of our knowledge, we are the first to model STPPs within the diffusion model paradigm.By removing integrals and overcoming structural design limitations in existing solutions, it achieves flexible and accurate modeling of STPPs.• We propose a novel spatio-temporal point process model, DSTPP.
On the one hand, the diffusion-based approach decomposes the complex spatio-temporal joint distribution into tractable distributions.On the other hand, the elaborated co-attention module captures the spatio-temporal interdependence adaptively.• Extensive experiments demonstrate the superior performance of our approach for modeling STPPs using both synthetic and realworld datasets.Further in-depth analyses validate that our model successfully captures spatio-temporal interactions for different scenarios in an adaptive manner.

PRELIMINARIES 2.1 Spatio-temporal Point Process
A spatio-temporal point process is a stochastic process composed of events with time and space that occur over a domain [35].These spatio-temporal events are described in continuous time with spatial information.The spatial domain of the event can be recorded in different ways.For example, in earthquakes, it is usually recorded as longitude-latitude coordinates in continuous space.It can also be associated with discrete labels, such as the neighborhoods of crime events.Let   = (  ,   ) denotes the  ℎ spatio-temporal event written as the pair of occurrence time  ∈ T and location  ∈ S, where T × S ∈ R × R  .Then a spatio-temporal point process can be defined as a sequence  = { 1 ,  2 , ...,   }, and the number of events  is also stochastic.Let   = {  |  < ,   ∈  } denote the event history before time , modeling STPPs is concerned with parameterizing the conditional probability density function  (,  |  ), which denotes the conditional probability density of the next event happening at time  and space  given the history   .Discussion on shortcomings.In existing methods for STPPs, given the event history, space and time are assumed to be conditionally independent [9,28,32,45,65,68] or unilaterally dependent [3,5] i.e., the space is dependent on the time by  ( |).These dependence restrictions destroy the model's predictive performance on entangled space and time interactions conditioned on history.Besides, most approaches require integration operations when calculating the likelihood, or limit intensity functions to integrable forms, leading to a trade-off between accuracy and efficiency.We compare the shortcomings of existing approaches in Table 1 2 , which motivate us to design a more flexible and effective model.

Denoising Diffusion Probabilistic Models
Diffusion models [19] generate samples by learning a distribution that approximates a data distribution.The distribution is learned by a gradual reverse process of adding noise, which recovers the actual value starting from Gaussian noise.At each step of the denoising process, the model learns to predict a slightly less noisy value.
On the contrary, the reverse denoising process recovers  0 starting from   , where   ∼ N (  ; 0,  ).It is defined by the following Markov chain with learned Gaussian transitions: 2 TPP models can be used for STPPs where the space acts as the marker.  (  −1 |  ) aims to remove the Gaussian noise added in the forward diffusion process.The parameter  can be optimized by minimizing the negative log-likelihood via a variational bound: ] .
(3) Ho et al. [19] show that the denoising parameterization can be trained by the simplified objective: where   = √︁    0 + (1 −  ).  needs to estimate Gaussian noise added to the input   , which is trained by MSE loss between the real noise and predicted noise.Therefore,   acts as the denoising network to transform   to   −1 .Once trained, we can sample   −1 from   (  −1 |  ) and progressively obtain  0 according to Equation (2).

SPATIO-TEMPORAL DIFFUSION POINT PROCESSES
Figure 2 illustrates the overall framework of DSTPP, which consists of two key modules, the spatio-temporal self-attention encoder, and the spatio-temporal diffusion model.The spatio-temporal encoder learns an effective representation of the event history, then it acts as the condition to support the spatio-temporal denoising diffusion process.We first present the spatio-temporal encoder in Section 3.1.
Then we formulate the learning of the spatio-temporal joint distribution as a denoising diffusion process, and introduce the diffusion process and inverse denoising process in Section 3.2.We describe how to train this model and perform sampling in Section 3.3.Finally, We demonstrate the detailed architecture of the denoising network parametrization in Section 3.4.

Spatio-temporal Encoder
To model the spatio-temporal dynamics of events and obtain effective sequence representations, we design a self-attention-based spatio-temporal encoder.The input of the encoder is made up of events  = (, ).To obtain a unique representation for each event, we use two embedding layers for the time and space separately.
For the space  ∈ R  , we utilize a linear embedding layer; for the timestamp, we apply a positional encoding method following [68]: where   denotes the temporal embedding and  is the embedding dimension.For the spatial domain, we use linear projection to convert continuous or discrete space into embeddings as follows: where   contains learnable parameters.We use   ∈ R  × if the space  is defined in the continuous domain R  ,  ∈ {1, 2, 3}.We use   ∈ R  × if the spatial information is associated with discrete locations represented by one-hot ID encoding  ∈ R  , where  is the number of discrete locations.In this way, we obtain real-value vectors   for both continuous and discrete spatial domains.For each event  = (, ), we obtain the spatio-temporal embedding   by adding the positional encoding   and spatial embedding   .The embedding of the  = {(  ,   )}  =1 is then specified by   = { ,1 ,  ,2 , ...,  , } ∈ R × , where  , =  , +  , .In the meantime, we also keep the temporal embedding   = { ,1 ,  ,2 , ...,  , } and spatial embedding   = { ,1 ,  ,2 , ...,  , }, respectively, with the goal of capturing characteristics of different aspects.If only spatio-temporal representation is available, the model may fail when dealing with cases where the temporal and spatial domains are not entangled.With learned representations from different aspects, we did not simply sum them together.Instead, we concatenate them and enable the model to leverage representations adaptively.
After the initial spatial embedding and temporal encoding layers, we pass   ,   , and   through three self-attention modules.Specifically, the scaled dot-product attention [53] is defined as: where , , and  represent queries, keys, and values.In our case, the self-attention operation takes the embedding  as input, and end for Return:  0  ,  0  then converts it into three matrices by linear projections: where   ,  , and   are weights of linear projections.Finally, we use a position-wise feed-forward network to transform the attention output  into the hidden representation ℎ().
For three embeddings   ,   and   containing information of different aspects, we all employ the above self-attentive operation to generate hidden spatial representation ℎ  (), temporal representation ℎ  (), and spatial-temporal representation ℎ  ().As a result, the hidden representation ℎ  −1 in Figure 2 is a collection of the three representations.

Spatio-temporal Diffusion and Denoising Processes
Conditioned on the hidden representation ℎ  −1 generated by the encoder, we aim to learn a model of the spatio-temporal joint distribution of the future event.The learning of such distribution is built on the diffusion model [19], and the values of space and time are diffused and denoised at each event.Specifically, for each event   = (  ,   ) in the sequence, where   denotes the time interval since the last event, we model the diffusion process as a Markov process over the spatial and temporal domains as ( 0  ,  1  , ...,    ), where  is the number of diffusion steps.From  0  to    , we add a little Gaussian noise step by step to the space and time values until they are corrupted into pure Gaussian noise.The process of adding noise is similar to image scenarios, where the noise is applied independently on each pixel [19].We diffuse separately on the spatial and temporal domains by the following probabilities: where   = 1 −   and   =  =1   .On the contrary, we formulate the reconstruction of the point   = (  ,   ) as reverse denoising iterations from    to  0  given the event history.In addition to the history representation ℎ  −1 , the denoising processes of time and space are also dependent on each other obtained in the previous step.The predicted values of the next step are modeled in a conditionally independent manner, which is formulated as follows: In this way, we manage to disentangle the modeling of spatiotemporal joint distribution into conditionally independent modeling, which enables effective and efficient modeling of the observed spatio-temporal distribution.The overall reverse denoising process is formulated as follows: For the continuous-space domain, the spatio-temporal distribution can be predicted by Equation 11.For the discrete-space domain, we add a rounding step at the end of the reverse process,   (  | 0  ), to convert the real-valued embedding  0  to discrete location ID   .

Training and Inference
Training.For a spatio-temporal point process, the training should optimize the parameters  that maximize the log-likelihood: where  is the number of events in the sequence.Based on a similar derivation in the preliminary section, we train the model by a simplified loss function for the  ℎ event and diffusion step  as follows [19]: where  ∼ N (0,  ).Samples at each diffusion step k for each event are included in the training set.We train the overall framework consisting of ST encoder and ST diffusion in an end-to-end manner.
The pseudocode of the training procedure is shown in Algorithm 1.
Inference.To predict future spatio-temporal events with trained DSTPP.We first obtain the hidden representation ℎ  by employing the spatio-temporal self-attention encoder given past  − 1 events.Then, we can predict the next event starting from Gaussian noise    ,    ∼ N (0,  ) conditioned on ℎ  .Specifically, the reconstruction of  0  from    = (   ,    ) is formulated as follows: where   and   are both stochastic variables sampled from a standard Gaussian distribution.  is the trained reverse denoising network, which takes in the previous denoising result    , the hidden representation of the sequence history ℎ  −1 and the diffusion step .Algorithm 2 presents the pseudocode of the sampling procedure.

Co-attention Denoising Network
We design a co-attention denoising network to capture the interdependence between spatial and temporal domains, which facilitates the learning of spatio-temporal joint distributions.Specifically, it performs spatial and temporal attention simultaneously at each denoising step to capture fine-grained interactions.Figure 3 illustrates the detailed network architecture.Each step of the denoising process shares the same structure, which takes in the previously predicted values  +1  and  +1  , and the denoising step  with positional encoding.Meanwhile, the network also integrates the hidden representation ℎ  −1 to achieve conditional denoising.
Temporal attention aims to generate a context vector by attending to certain parts of the temporal input and certain parts of the spatial input, and so does spatial attention.We calculate the mutual attention weights, i.e.,   and   , for space and time based on the condition ℎ  −1 and current denoising step  as follows: where   ,  ,   ,   are learnable parameters.  and   measure the mutual dependence between time and space, which are influenced by the event history and current denoising step.
Then we integrate the spatio-temporal condition and  +1  by feedforward neural networks, and each layer is formulated as follows: where   ∈ R  × ,  ∈ R  ×1 , ℎ , ℎ ∈ R  × , and   ,   ,  ℎ ,  ℎ ∈ R  ×1 are learnable parameters of the linear projection, and  denotes the ReLU activation function.Finally, the outputs of spatial attention and temporal attention are calculated as follows: where   , and   , are the predicted noise at step  for the  ℎ event.We can obtain the predicted values    and    at step  according to Equation (14).Then the predicted values    and    are fed into the denoising network again to iteratively predict the results towards the clean values of space and time.In this way, the interdependence between time and space is captured adaptively and dynamically, facilitating the learning of the spatio-temporal joint distribution.

EXPERIMENTS
In this section, we perform experiments to answer the following research questions: • RQ1: How does the proposed model perform compared with existing baseline approaches?• RQ2: Is the joint modeling of spatial and temporal dimensions effective for STPPs, and what's the spatio-temporal interdependence like during the denoising process?• RQ3: How does the total number of diffusion steps affect the performance?• RQ4: How to gain a deeper understanding of the reverse denoising diffusion process?

Experimental Setup
4.1.1Datasets.We perform extensive experiments on synthetic datasets and real-world datasets in the STPP literature.All datasets are obtained from open sources, which contain up to thousands of spatio-temporal events.Varying across a wide range of fields, we use one synthetic dataset and three real-world datasets, including earthquakes in Japan, COVID-19 spread, bike sharing in New York City, and simulated Hawkes Gaussian Mixture Model process [3].
Besides, we use a real-world dataset, Atlanta Crime Data, the spatial locations of which are discrete neighborhoods.We briefly introduce them here, and further details can be found in Appendix A.
(1) Earthquakes.Earthquakes in Japan with a magnitude of at least 2.5 from 1990 to 2020 recorded by the U.S. Geological Survey3 .
(2) COVID-19.Publicly released by The New York Times (2020), which records daily infected cases of COVID-19 in New Jersey state 4 .We aggregate the data at the county level.(3) Citibike.Bike sharing in New York City collected by a bike sharing service.The start of each trip is considered as an event.(4)HawkesGMM5 .This synthetic data uses Gaussian Mixture Model to generate spatial locations.Events are sampled from a multivariate Hawkes process.(6) Crime 6 .It is provided by the Atlanta Police Department, recording robbery crime events.Each event is associated with the time and the neighborhood.

Baselines.
To evaluate the performance of our proposed model, we compare it with commonly-used methods and state-ofthe-art models.The baselines can be divided into three groups: spatial baselines, temporal baselines, and spatio-temporal baselines.It is common for previous methods to model the spatial domain and temporal domain separately, so spatial baselines and temporal baselines can be combined freely for STPPs.We summarize the three groups as follows 7 : • Spatial baselines: We use conditional kernel density estimation (Condition KDE) [4], Continuous normalizing flow (CNF), and Time-varying CNF [4] (TVCNF) [4].The three methods all model continuous spatial distributions.• Temporal baselines: We include commonly used TPP models.

Evaluation Metrics.
We evaluate the performance of models from two perspectives: likelihood comparison and event prediction comparison.We use negative log-loglikelihood (NLL) as metrics, and the time and space are evaluated, respectively.Although the exact likelihood cannot be obtained, we can write the variational lower bound (VLB) according to Equation ( 3) and utilize it as the NLL metric instead.Thus, the performance on exact likelihood is even better than the reported variational lower bound.The models' predictive ability for time and space is also important in practical applications [37].Since time intervals are real values, we use a common metric, Root Mean Square Error (RMSE), to evaluate time prediction.The spatial location can be defined in -dimensional space, so we use Euclidean distance to measure the spatial prediction error.We refer the readers to Appendix C.1 for more details of the used evaluation metrics.

Overall performance
Table 2 and Table 3 show the overall performance of models on NLL and prediction, respectively.Figure 4 shows the prediction performance of models in discrete-space scenarios.From these results, we have the following observations: • Unreasonable parametric assumptions for point processes destroy the performance severely.The worst performance of the self-correcting process indicates the assumption that the occurrence of past events inhibits the occurrence of future events, does not match realities.On the contrary, the Hawkes process, which assumes the occurrence of an event increases the probability of event occurrence in the future, outperforms other classical models (Poisson and Self-correcting), with an obvious reduction of temporal NLL.Nevertheless, the self-exciting assumption can still fail when faced with cases where previous events prevent subsequent events.Therefore, classical models that require certain assumptions, cannot cover all situations with different dynamics.• It is necessary to capture the spatio-temporal interdependence.NSTPP models the dependence of space on time by  ( |), 7 Appendix B provides more details of the used baselines.Figure 4: The performance of models on discrete-space datasets for both time and space of the next event.
and the performance regarding spatial metrics is improved compared with independent modeling, including Conditional KDE, CNF, and Time-varying CNF.However, it does not outperform other TPP models in the temporal domain, suggesting that modeling the distribution  ( |  ) without conditioning on the space  fails to learn the temporal domain sufficiently.• DSTPP achieves the best performance across multiple datasets.
In the continuous-space scenarios regarding NLL, our model performs the best on both temporal and spatial domains.Compared with the second-best model, our model reduces the spatial NLL by over 20% on average.The performance on temporal NLL  also achieves remarkably significant improvement across various datasets.In terms of models' predictive power, our model also achieves optimal performance, with remarkable improvements compared to the second-best model.In addition, as Figure 4 shows, DSTPP delivers better predictive performance compared with other solutions in modeling discrete-space scenarios.The flexible framework that requires no parameter assumptions and MC estimations enables DSTPP to achieve superior performance.

Analysis of Spatio-temporal Interdependence
To gain a deeper understanding of the spatio-temporal interdependence in the denoising process, we perform an in-depth analysis of co-attention weights.Specifically, the analysis is conducted on two representative datasets: Earthquake and Synthetic-Independent, where the Earthquake dataset is highly spatio-temporal entangled, and the Synthetic-Independent dataset is totally spatio-temporal independent.Appendix A provides the generation details of the synthetic dataset.We use these two datasets to validate whether the designed co-attention mechanism can learn different interdependence between time and space.At each step of the denoising process, we calculate attention weights of the temporal and spatial dimensions on themselves and each other.Figure 6 shows how attention weights change as denoising proceeds.
As shown in Figure 6(a), at the early stage, temporal and spatial domains do not assign attention weights to each other, and the attention weights on themselves are close to one.At the final stage (step ≥ 150), the two domains start to assign attention weights to each other.At last, for the temporal domain, the attention weights on time and space are approximately 0.83 and 0.17; for the spatial domain, the attention weights are close to evenly divided (0.52 and 0.48), suggesting that the spatial domain is more dependent on the temporal domain.In the later stage of the denoising iterations, the model learns a distribution closer to the real case; thus, it is reasonable that the spatial and temporal domains assign more attention weights to each other.Figure 6(b) displays different results: the two domains share almost no attention weights to each other, indicating that the model has successfully learned the independent relationship.Figure 6(a) and (b) together validate the effectiveness of the co-attention mechanism, which can adaptively learn various interaction mechanisms between time and space.

Ablation Studies
Co-attention Mechanism.In order to examine the effectiveness of the co-attention mechanism, we degrade our DSTPP into a base framework, DSTPP-Ind, which models the distributions of space and time independently in the denoising process.To be specific, we replace   ( , where space and time are not conditionally dependent on each other.Figure 5 shows the performance comparison of DSTPP and DSTPP-Ind in continuous-space settings.We can observe that DSTPP trained by incorporating the joint modeling of time and space performs consistently better than DSTPP-Ind with independent modeling.These results indicate the necessity to capture the interdependence between time and space, and meanwhile, validate the effectiveness of the spatio-temporal co-attention design.Due to the space limit, we leave other results in Appendix D.

Analysis of Reverse Diffusion Processes
To gain a deeper understanding of the denoising process, We visualize the spatial distribution during the reverse denoising iterations in Figure 7.As we can observe, at the beginning of the denoising process, the spatial distribution displays a Gaussian noise.With progressive denoising iterations, the data distribution deforms gradually and becomes more concentrated.Finally, at the last step, the spatial distribution fits perfectly with the ground truth distribution.It indicates that our DSTPP is able to learn the generative process of spatial distribution successfully.Besides, the denoising process is not a linear change, where the distribution changes during the last 50 steps are more significant than the previous steps.Combined with results in Section 4.3, where the interdependence between spatial and temporal domains is effectively captured in the latter stage, it is reasonable that the denoising effect is improved significantly during this period.

RELATED WORK
Spatio-temporal Point Processes.Temporal point process models [9,28,32,63,68] can be directly used for STPPs, where the space is considered as the event marker.Kernel density estimation methods are also used to model continuous-space distributions in STPP models [2,3,22,35,45,65].Most existing solutions follow an intensity-based paradigm, and their main challenge is how to choose a good parametric form for the intensity function.There exists a trade-off between the modeling capability of the intensity function and the cost to compute the log-likelihood.Some intensityfree models [38,47] are proposed to tackle this problem; however, the probability density function either is unavailable [38] or still has certain model restrictions [47].Another drawback of existing models is that they can only model either the continuous-space domain or the discrete-space domain, which largely limits their usability in real-world scenarios.
Recently, a line of advances have been developed for the generative modeling of point processes.For example, generative adversarial networks [8,57] are used to learn to generate point processes in a likelihood-free manner.Reinforcement learning [26,52] approaches and variational autoencoders [31,39] are also included to explore the generative performance of TPPs.Some works also use noise contrastive learning [16,33] instead of MLE.We are the first to learn point processes within the paradigm of diffusion models, which successfully address limitations in previous existing solutions.

CONCLUSION
In this paper, we propose a novel framework to directly learn spatiotemporal joint distributions with no requirement for independence assumption and Monte Carlo sampling, which has addressed the structural shortcomings of existing solutions.The framework also poses desired properties like easy training and closed-form sampling.Extensive experiments on diverse datasets highlight the impact of our framework against state-of-the-art STPP models.As for future work, it is promising to apply our model in urban system [25,60] as well as large-scale natural systems, such as climate changes and ocean currents, which are concerned with highly complex spatio-temporal data.

Figure 2 :
Figure 2: The overview of the proposed DSTPP framework.

Figure 3 :
Figure 3: Network architecture of the spatio-temporal coattention mechanism.Each step in the denoising process shares the same network structure, with spatio-temporal hidden representations as conditions.

Figure 5 :
Figure 5: Ablation study on the joint spatio-temporal modeling.DSTPP-Ind denotes the degraded version of DSTPP, where spatial and temporal domains are independent.

Figure 6 :
Figure 6: Spatial and temporal attention weights in the denoising iterations for two datasets with different spatiotemporal interdependence.Best viewed in color.

Earthquakes COVID- 19 HawkesGMMFigure 7 :
Figure 7: Visualization of the spatial distribution at different stages in the denoising process (the first five columns in blue color).The last column in red color presents the real distribution.Starting from Gaussian noise, our DSTPP model gradually fits the spatial distribution of ground truth.Best viewed in color.

Table 1 :
Comparison of the proposed model with other point process approaches regarding important properties.

Table 2 :
Performance evaluation for negative log-likelihood per event on test data.↓ means lower is better.Bold denotes the best results and underline denotes the second-best results.Temporal ↓ Spatial ↓ Temporal ↓ Spatial ↓ Temporal ↓ Spatial ↓ Temporal ↓ (1)Spatial baselines and temporal baselines can be combined freely for modeling spatio-temporal domains.

Table 3 :
Performance evaluation for predicting both time and space of the next event.We use Euclidean distance to measure the prediction error of the spatial domain and use RMSE between real intervals and predicted intervals for time prediction.