Learning Dynamic Graphs from All Contextual Information for Accurate Point-of-Interest Visit Forecasting

Forecasting the number of visits to Points-of-Interest (POI) in an urban area is critical for planning and decision-making for various application domains, from urban planning and transportation management to public health and social studies. Although this forecasting problem can be formulated as a multivariate time-series forecasting task, the current approaches cannot fully exploit the ever-changing multi-context correlations among POIs. Therefore, we propose Busyness Graph Neural Network (BysGNN), a temporal graph neural network designed to learn and uncover the underlying multi-context correlations between POIs for accurate visit forecasting. Unlike other approaches where only time-series data is used to learn a dynamic graph, BysGNN utilizes all contextual information and time-series data to learn an accurate dynamic graph representation. By incorporating all contextual, temporal, and spatial signals, we observe a significant improvement in our forecasting accuracy over state-of-the-art forecasting models in our experiments with real-world datasets across the United States.


INTRODUCTION
Point-of-Interest (POI) data is a treasure trove of information, providing geographical locations, entity names, and types of places of interest, such as the Eiffel Tower (landmark) or a Starbucks coffee shop.Predicting the number of visitors to a POI at a specific time offers valuable insights into collective social and mobility behavior [19].It has numerous practical applications, such as traffic flow analysis, epidemic spread prediction, and travel demand estimation (e.g., for ride-hailing apps).Take the prediction of epidemic spread as an example: during the COVID-19 pandemic, accurate predictions of the number of visits to grocery stores in a particular neighborhood could inform policies on store closures or visitor restrictions to control the spread of the virus (e.g., [5]).
Forecasting POI visiting patterns is a complex task due to the ever-changing nature of human mobility behavior.External factors such as rush hours, seasonal traffic fluctuations, weather, holidays, and planned or unexpected events, transient, such as a football game or long-term, such as the COVID-19 pandemic, all contribute to this unpredictable behavior.One way to improve forecasting accuracy is to exploit the similarities between POIs, a non-trivial task.It involves identifying and effectively combining correlations from different signals such as past visiting patterns, geographical locations, semantic similarities (e.g., shared POI attributes such as cuisine types of restaurants), and taxonomic distances (e.g., similar visitation trends between categories such as "restaurant" and "bar").These signals also tend to change over time, making it challenging to infer similarity between two POIs even when considering correlations based on past POI visit numbers.
While predicting future POI visit numbers can be formulated as a time-series forecasting task, there are several limitations to existing methods.Classical time-series forecasting methods like ARIMA [22] and VAR [21] rely on assumptions of linearity and stationarity, which do not hold in complex real-world scenarios and fail to capture long-term dependencies.Recently, deep learning has achieved impressive results in various tasks, such as image classification [17,18], natural language processing [36], and ensuring privacy and fairness [32,33].However, typical deep learning models such as RNN-based approaches (e.g., LSTM [14] and GRU [7]), while capturing intra time-series dependencies, both short-term and long-term, cannot exploit relationships across time series.One way to capture these inter time series correlations is to conceptualize the problem as a graph, where nodes are POIs and edges (and edge weights) capture their interdependencies, and then apply Graph Neural Networks (GNNs) on the resulting graph.The challenge here is how to build a representative graph.
Toward this end, the GNN approaches can be divided into Static and Dynamic categories.The static GNNs build the graph once based on input feature vectors using predefined similarity measures, often derived from specific domain knowledge [16,31,41].
These models emphasize less on graph construction, but focus on graph convolution techniques and processing methods.For example, DCRNN [16] builds a simple static graph of traffic sensors based on their road-network distances and then passes it to Graph Diffusion Convolution with a sequence-to-sequence architecture.The second category learns a dynamic graph representing the timevarying relationships between variables [4,37,38,43].For instance, StemGNN models latent correlations between time-series windows to generate a time-varying graph on which it applies Spectral Graph Convolution [15], which uses Graph Fourier and Discrete Fourier transforms to capture time series correlations.
For POI visit forecasting, visit patterns of a single POI change over time for various reasons, which makes the dynamic GNNs a suitable solution.However, in certain time windows, say during holidays or COVID, the visit patterns of two semantically (and/or geographically) far POIs may look similar.Conversely, due to some temporary events, e.g., remodeling, two similar POIs may have different visit patterns.Therefore, our goal is to build a comprehensive dynamic graph using all contextual information robust to time variances (dynamic) or predefined node similarities (static).
Consequently, our proposed Busyness Graph Neural Network (BysGNN) builds a dynamic graph by capturing POIs' spatial correlations, intra-series dependencies in individual visit patterns, interseries dependencies across visit patterns, semantic similarity, and taxonomic proximity.This is achieved through a robust gated attention mechanism and an effective thresholding strategy.The gated attention mechanism integrates semantic-based and distancebased similarity matrices as a gate to determine the extent to which similarity scores between time series should be considered to generate the graph's adjacency matrix.Subsequently, the thresholding mechanism eliminates noisy relationships to improve the graph representation's accuracy.Another challenge is that, while semantic similarities are not dynamic, they are task-dependent and are unknown a priori.To address this, BysGNN uses a pre-trained language model to obtain initial semantic embedding based on the textual description of POIs and fine-tunes them based on the forecasting task to account for accurate semantic similarities as the network trains.
We conducted extensive experiments with real-world datasets and compared BysGNN with some naive baselines and state-ofthe-art static and dynamic GNNs.The results demonstrate the superiority of BysGNN in effectively building dynamic graphs incorporating information from various contextual and past visit signals.The experiments show that previous dynamic GNNs that rely solely on visit pattern similarity require many similar POIs to perform well.In contrast, static GNNs perform better with fewer POIs.Interestingly, a naive baseline outperforms static GNNs but falls short compared to dynamic GNNs and BysGNN, underscoring the significance of a dynamic graph structure.Our ablation study validates the positive impact of simultaneously considering the inter-time-series, semantics, spatial, and taxonomic similarities on forecasting accuracy, with semantics showing the most substantial improvement.Finally, we illustrate example cases where other dynamic GNNs consider two POIs to be similar (or different) due to visit pattern similarities (or dissimilarities) in a recent time window, resulting in an inaccurate adjacency matrix.In contrast, BysGNN's gated adjacency matrix shows a high influence from long-term semantic/geographical similarities (or dissimilarities), resulting in accurate forecasting.
The rest of the paper is organized as follows.We review the related work for time-series forecasting in Section 2. Section 3 formally defines the problem of forecasting POI visit numbers.We describe our BysGNN framework in Section 4. Finally, we present our experimental setup, datasets and results in Section 5 and conclude the paper in Section 6.

RELATED WORK
Time-series modeling has long been a prominent area of academic research, leading to the development of a wide variety of forecasting methods.These methods can be broadly categorized into univariate and multivariate time series techniques.Univariate technologies focus on analyzing single observations recorded sequentially without considering correlations between different time series variables [20,27,29].For instance, the ARIMA family of methods assumes a linear relationship, where predictions are weighted linear sums of past observed values.Salinas et al. propose a forecasting method based on autoregressive recurrent neural networks, which models the probability distributions of the variable in the future [29].In contrast, multivariate time series techniques aim to capture interactions and co-movements among a group of variables [1,24,40].For example, Zerveas et al. present a novel framework based on a transformer encoder that extracts dense vector representations of multivariate time series [42].
Graph Neural Networks (GNNs) [13,30] have emerged as powerful machine learning models for modeling non-Euclidean data represented by graphs.In recent years, the application of GNNs in multivariate time series forecasting has witnessed significant success across various domains.One notable application is DCRNN [16], which models traffic flow as a diffusion process on a directed graph, effectively capturing spatial dependence between sensors for traffic forecasting tasks.However, such approaches often rely on predefined correlations between components to pre-construct a graph, which remains fixed during training and testing.Another approach, HAGEN [38], introduces a graph convolutional recurrent network that dynamically captures crime correlations between regions and temporal crime dynamics for crime forecasting.Additionally, StemGNN [4] successfully captures inter-series correlations and temporal dependencies jointly for multivariate time series forecasting.
Dynamic GNNs have been used extensively for epidemic forecasting, e.g., COLA-GNN [10] proposes a novel GNN-based framework with a location-aware attention mechanism to capture spatiotemporal dependencies, enabling accurate long-term predictions.STAN [12] is another prediction framework that utilizes graph attention networks to incorporate interactions between similar locations, enhancing its accuracy in the prediction of pandemics.Moreover, CausalGNN [39] adopts an attention-based approach to learn a combined spatiotemporal and causal latent embedding from disease dynamics and epidemiological context, facilitating precise forecasting of daily new COVID-19 cases.
In contrast to most GNN-based frameworks that rely on recurrent neural networks to capture temporal dependencies, Choi et al. introduce a novel approach [8] by integrating neural-controlled differential equations with graph convolution processing technology for spatiotemporal forecasting.
Despite the abundance of graph-based modeling approaches in time series forecasting, these approaches cannot effectively fuse multiple contextual sources, which is the focus of BysGNN, and thus can be considered an orthogonal approach to these models.

PROBLEM FORMULATION
This section provides the preliminaries and a formal definition of the problem of forecasting POI visit numbers.Definition 3.1.(Multi-Context Correlations).Multi-context correlations refer to the latent relationships among POIs influenced by various contextual factors, such as time of day, day of the week, distance, and events.In POI visiting number forecasting, these correlations include spatial (closeness of geographic locations), temporal (the changes in visit patterns over time and the dependencies between visit patterns of different POIs), semantic (similarity of POI attributes, such as POI types), and taxonomic (the general semantic categories of POIs, representing the high-level visiting trend) correlations.Leveraging these multi-context correlations can enhance the accuracy of forecasting frameworks for POI visits.Definition 3.2.(Busyness Graph).We define a Busyness Graph network  = ( , ) where  is a set of | | =  nodes, and each node corresponds to a specific POI (e.g., a restaurant) or a category of POIs (e.g., all restaurants).We denote the  ∈ R  × as the adjacency matrix, where    > 0 indicates that there exists an edge connecting nodes   to   , and    indicates the strength of this edge which shows the amount of influence that   has on the forecasts of   .The adjacency matrix  is dynamically updated based on the multi-context correlations and captures the most recent knowledge about the interaction between POI nodes.Definition 3.3.(POI Visit Forecasting Problem).Given  = ( 1 , ...,   ) ∈ R  × as the input time series that represents the hourly visit numbers to each POI for a time window of  steps, and  = ( 1 , ..,   ) ∈ R  × as the set of  attributes (e.g., category and name) of each POI, our goal is to generate and utilize the dynamic Busyness Graph  to find  = ( 1 , ...,   ) ∈ R  × , which shows the future visit numbers for the next  time steps for each POI.

BUSYNESS GRAPH NEURAL NETWORK 4.1 Overview
This section describes the BysGNN (Busyness Graph Neural Network) framework to tackle the problem of forecasting POI visiting numbers.For each forecasting horizon, BysGNN first generates a dynamic graph by exploiting all contextual information.The graph edges and their strengths represent the multi-context correlations between POIs.Then BysGNN uses the dynamic graph for accurate forecasting.
The overall architecture of BysGNN is presented in Figure 1, which consists of two main blocks.The first block, referred to as the BysGNN Graph Construction Block (sec 4.2), is responsible for building the dynamic Busyness Graph, which captures the multi-context correlations among POIs.The second block called the GNN Block (sec 4.3), performs the convolution operation on the Busyness Graph and generates node embeddings used to make forecasts.
The BysGNN's Graph Construction Block starts by feeding the input time series to the Aggregated Data Generator module (sec 4.2.1), which generates new aggregated time-series data based on a predefined measure (POI taxonomy in our case) and adds them to the original input.This step allows the model to learn the taxonomic correlations in the following steps.The augmented time-series data are passed through the Intra-Series Correlation Layer (sec 4.2.2), which learns the temporal dependencies in individual time series (i.e., intra-series correlations) and generates temporal embeddings summarizing the time-series information for the given time window.The temporal embeddings are then passed to the Node Features Generation Layer (sec 4.2.3), which assigns a node to each individual time series and generates feature vectors for each node in the graph.We call the nodes corresponding to aggregated time series "meta-nodes" and the nodes corresponding to individual POIs "POI nodes." This layer first builds semantic embeddings based on POI attributes (such as categories and names) that allow the model to learn semantic similarities between nodes.The next step concatenates the semantic embeddings with the previous temporal embeddings to form the node feature vectors.
The next step involves passing the semantic and temporal embeddings into the Multi-Context Correlation Layer (sec 4.2.4).This layer is crucial as it generates the adjacency matrix of BysGNN's dynamic graph, which reflects the dependencies among POI nodes and meta-nodes across multiple contexts, including inter-series relationships across time series, the spatial proximity of POIs, and semantic similarities between POI and meta nodes.To ensure reliable inter-series correlation scores, BysGNN incorporates a gating mechanism that combines pairwise spatial and semantic similarity scores as a gate to allow the flow of inter-series correlation scores.This mechanism enables the model to effectively use the interseries similarity scores for forecasting by considering the spatial and semantic context between POIs.
This approach results in a more robust and sparser adjacency matrix, preventing over-smoothing by retaining strong relationships while reducing the impact of weak and noisy relationships.The layer also utilizes a case amplification technique to threshold the adjacency matrix and remove noise.
After generating the node features and adjacency matrix in the previous layers, BysGNN creates the dynamic Busyness Graph.Busyness Graph is then passed to the GNN Block for further processing, where a Graph Convolution layer is applied to obtain forecasts based on the final node embeddings.

GNN Block
Forecast Results Specifically, given  ∈ R  × as the original time-series input (with  as the number of time series and  as the window size for each series), and  as the set of all input POI category types, this module generates | | new time-series data by aggregating individual POI time-series based on their categories, and an additional time series by aggregating all the time series data together (Global visits time series) and adds them to the original input.As a result, the module output will be  ′ ∈ R ( +| |+1) × .The aggregation is defined as a function   such that:   : R  ′ × ′ → R  ′ (summation, in our case).

Intra-Series Correlation
Layer.This layer is responsible for learning the intra-series dependencies in individual time series and generating embedding vectors summarizing the time-series data for the given window.The layer receives the augmented time series data  ′ ∈ R ( +| |+1) × as input and generates time-series embeddings  ∈ R ( +| |+1) × , where  represents the embedding dimension.The layer consists of separate  + | | + 1 GRU weight matrices to learn the temporal intra-series dependencies for each time-series data independently.Figure 2 illustrates the process of generating these embeddings for each series.It is important to note that even though separate GRU units are used for different time series, the size of these GRU units is relatively small, and the number of parameters and training time will not be significantly different compared to using one big GRU unit.
Each time series  ′  = ( ′ 1 , ...,  ′  ) ∈ R  holds the visiting numbers for a sequence of  time steps.First, BysGNN maps each time-series sequence to the space of higher dimensions by passing data points in the series through two linear layers and getting X = ( x1 , ..., x ) ∈ R  × .Then, it feeds the resulting sequence ( x1 , ..., x ) to the -th GRU unit and gets the GRU states   = (ℎ 1 , ..., ℎ  ) ∈ R  × , with   = ℎ  being the output of the GRU unit.
Inspired by [26], instead of using only the GRUs' output   as the summary of the window sequence for -th time series in  ′ , which is prone to be affected by the final observations in the window more than it should, BysGNN utilizes a self-attention mechanism to assign weights to GRU states at different timestamps based on the importance of that state to pay more attention to the more important timestamps.Therefore, BysGNN passes all hidden states Self A�en�on For calculating the attention score vector   , containing the attention score values ( 1 , ...,   ) corresponding to each hidden state, BysGNN first concatenates each vector ℎ  , which contains the hidden states for timestamp , with the final GRU state   .Then, the concatenated vector passes through a linear layer with   as the weights matrix and a ℎ activation is applied.Finally, to ensure that all attention scores remain between 0 and 1 with the attention score vector summing to 1, BysGNN passes the intermediate results through a    layer.BysGNN then calculates the weighted average ẑ of hidden states using the attention scores vector   .Finally, it adds   with ẑ and passes the result to a layer normalization unit [2] to obtain the final temporal embedding   ∈ R  for -th time series.BysGNN puts together the embeddings for all time-series data to obtain

Node Features Generation
Layer.This layer is responsible for building the node features for BysGNN's dynamically generated graph by combining embeddings for time series and semantics of POIs.
To achieve this, given   = ( 1 , ...,    ) ∈ R  as the vector of  attributes for -th time series, BysGNN first generates a sentence describing the corresponding POI (or POI category if the time series is an aggregated one).The attributes used include POI names, addresses, working hours, phone numbers, and top and sub-categories.Sentence generation involves predefined templates for POIs and POI categories (for meta-nodes), which are filled automatically with the corresponding attributes.For example, the following sentence describes a POI node (with italic text indicating the specific attributes of that node): "This point of interest is Simon mall located at 5085 Westheimer Rd, Houston, TX, 77056.It is open for business during Monday -Friday: 10:00 -19:00, Saturday 10:00 -17:00, and closed on Sunday.It can be contacted by phone at (213)538-XXXX.The location belongs to the top-category Lessors of Real Estate, with the sub-category Malls." Similarly, the following description is generated for a meta-node: "This is the meta-node representing all the points of interest in Houston that belong to the top category Spectator Sports." Additionally, it is worth noting that the generation of sentence descriptions is not limited to our specific POI dataset.If necessary, sentence templates can be easily customized and populated with attributes from alternative POI datasets.
Next, an intermediate embedding  ′  is obtained by tokenizing and passing the generated sentence through a pre-trained MPNet language model [35].Although this language model is optimized to achieve state-of-the-art performance in semantic similarity tasks, it is not specifically trained to represent POI semantics accurately in the POI visit forecasting task.To address this limitation, BysGNN fine-tunes the intermediate embedding using a linear layer during training.The resulting final semantic embedding û ∈ R  is tailored to capture time-series semantics and has a dimensionality of .
The final feature vector   for the -th time series is obtained by concatenating the temporal embedding   ∈ R  from the output of the Intra-Series Correlation Layer with the semantic embeddings û ∈ R  : The Node Features Generation Layer block in Figure 1 shows this process.By combining individual node feature vectors, the node feature matrix  = ( 1 , ...,   +| |+1 ) ∈ R ( +| |+1) × (+ ) is obtained.

4.2.4
Multi-Context Correlation Layer.The Multi-Context Correlations layer plays a crucial role in BysGNN's dynamic graph generation process.This layer is responsible for generating the adjacency matrix for the dynamic graph structure by capturing the multi-context dependencies among POI nodes and meta-nodes.
To achieve this, the layer takes in the semantic embeddings Û ∈ R ( +| |+1) × and the time-series embeddings  ∈ R ( +| |+1) × from the previous layer, as well as the pairwise Euclidean distance   ∈ R  × between POI nodes from the input.It then generates three similarity matrices: semantic similarity matrix   , spatial similarity matrix   , and attention matrix   , each with a dimensionality of R ( +| |+1) × ( +| |+1) .
To create the semantic similarity matrix   , BysGNN calculates the pairwise cosine similarity scores between the semantic embedding vectors in Û .Cosine similarity is used as the similarity metric because it is normalized and effectively captures the degree of alignment between the semantic meanings of embedding vectors.
To generate the spatial similarity matrix   , BysGNN first passes the pairwise Euclidean distance matrix   through a thresholded Gaussian kernel [34] to obtain  ′  ∈ R  × : where   (, ) is the Euclidean distance between the -th and -th POI nodes,  is the standard deviation of the distances in   , and  is a predefined threshold for sparsity.This process provides distance-similarity scores between POI nodes.As meta-nodes are not physical locations with geo-coordinates, BysGNN considers the distance similarity score within meta-nodes and between metanodes and POI nodes to be 1.This ensures that the lack of geocoordinates for meta-nodes does not impact the learned relationship between them and POI nodes.BysGNN builds the distance similarity matrix   ∈ R ( +| |+1) × ( +| |+1) such that the first  elements for the first  columns are the same as the corresponding elements in  ′  and the rest are set to 1.To create a similarity matrix for time-series windows of different nodes, BysGNN utilizes Multi-head Attention, as outlined in [36], which generates an attention matrix   representing pairwise correlation scores between time-series embeddings.The temporal embedding vectors are passed to the Multi-Head Attention unit and   is constructed using the following operations: Here, the matrix  is used as the matrix of Keys , Queries , and Values  simultaneously. represents the number of attention heads, while   ,    ,    , and    are learnable weight matrices.Moreover,  refers to the dimension of each temporal embedding   .Equation 3 yields the attention scores matrix   .
BysGNN then utilizes a gate that combines the weighted average of semantic similarities (  ) and spatial similarities (  ) to control the flow of information in time series similarities (  ) and create the un-thresholded adjacency matrix ().Specifically, the gate is formulated as follows: Where  is a learnable parameter between 0 and 1 that adjusts the balance between the impact of spatial and semantic similarities based on the specific POI visits forecasting task.The Hadamard product operator (⊙) is then applied to the gate and the time series similarities matrix, resulting in the un-thresholded adjacency matrix () as follows: BysGNN's gating mechanism helps to preserve strong long-term relationships and penalize noisy relationships between distant or semantically dissimilar nodes.
Finally, to filter out previous noisy relationships, BysGNN applies a thresholding step to the adjacency matrix .This is achieved by first normalizing each row of the attention matrix and then transforming the values using a case amplification power function to make it easier to differentiate between small and large values.This significantly reduces the impact of small values over those of larger ones.Next, a predefined threshold value  is applied to the amplified matrix to obtain a binary mask.The resulting binary mask is then applied to the un-thresholded adjacency matrix  to obtain the final adjacency matrix Ŝ as follows: where max(  ) is the maximum value in the -th row of ,  is the exponent of the case amplification function and  is the predefined threshold value.BysGNN utilizes Ŝ ∈ R ( +| |+1) × ( +| |+1) as the adjacency matrix for the generated Busyness Graph.This process is illustrated in the Multi-Context Correlation Layer block of Figure 1.

GNN Block
After completing every step in the BysGNN Graph Construction Block (as described in section 4.2), BysGNN constructs the Busyness Graph  ( , Ŝ) for the given time window by combining the node features ( ) and the derived adjacency matrix ( Ŝ).This dynamically generated graph has already captured the underlying multi-context correlations between POIs.
Next, BysGNN incorporates three Graph Convolutional Network (GCN) layers from [15] for message passing within the Busyness Graph.Using the selected convolutional layers (   ), the final node representations  ′ are generated as follows: Here,  * represents the node embedding's dimension.BysGNN then passes the concatenation of  ′ and the original node features  through a fully connected linear layer to produce the final forecasts.

EXPERIMENTS AND DISCUSSIONS
This section describes our experimental setup and methodology.Details related to hardware and software setup, evaluation metrics, and hyper-parameters are available in the appendix A.1.

Data Description
To evaluate the accuracy of our model in forecasting hourly visitor numbers, we utilized the POI data and hourly visitation patterns datasets provided by SafeGraph [25], a commercial data provider that compiles its datasets using phone GPS locations and open government data.Our experiments were conducted in five different cities and involved two data regimes (large and small) that spanned the period from January 1, 2019, to February 4, 2020.In the large data regime, we considered hourly visit counts for the top 400 most visited POIs in each city, while the small data regime focused on the top 40 most visited POIs.This allowed us to examine the impact of the number of variables on the performance of our method and other baselines.For more detailed information about the data used in each experiment, refer to Table 4 in the appendix.

Baselines
We compare the performance of BysGNN with three baseline groups: Naive Baselines, Static Graph Neural Networks, and Dynamic Graph Neural Networks.The Naive Baselines include two simple statistical models: Naive Seasonal and Historical Average.The Static GNNs group consists of ConvGRU [31], ConvLSTM [31], DCRNN [16], and A3T-GCN [3] models, which operate on a predefined graph structure based on pairwise Euclidean distances between POIs.The Dynamic GNNs group includes StemGNN [4], a state-of-the-art technique for MTS forecasting that uses an attention mechanism to infer relationships between nodes.Further details on each baseline model can be found in the appendix A.1.1.

Experiment Results
We used input windows of 24 hours to train each GNN-based model to predict the number of visits for each POI for the next 6 hours for the datasets described in Table 4 of the appendix.The forecasting results for each dataset for both large (400 POIs in each city) and small (40 POIs in each city) data regimes are presented in Table 1.
Although our experiments included visitor data from five different cities, Table 1 presents results for only three cities due to space limitations.The results for the remaining two cities exhibit the exact same trend and are reported in Table 5 of the appendix.
The results of the large data regime presented in Table 1 (values to the left of the vertical line for each field) show that our proposed BysGNN model consistently outperforms all other baselines across all datasets, except for the MAPE value on the Los Angeles dataset.This demonstrates the superiority of our architecture for forecasting seasonal time-series data.Moreover, our Dynamic GNN models (StemGNN and BysGNN) demonstrate significantly lower error values compared to Static GNN models.This underscores the limitation of using a static graph with predefined relationships between variables, even when incorporating a sophisticated GNN Block.Although StemGNN performs the best among the baselines, our BysGNN model outperforms it in almost every instance with an error reduction of up to 12.3%, despite having a less complex GNN Block.This result validates our assumption that a well-designed Graph Construction Block would substantially enhance forecasting performance.Specifically, while StemGNN improved upon Static GNNs by constructing a dynamic graph solely based on the timeseries windows, our BysGNN model took it a step further by introducing multi-context correlations that are more resilient to noise and yield a more precise depiction of the underlying relationships between variables.
The values on the right of the vertical line for each field in Table 1 present the results for the small data regime, where there are only 40 POIs in each dataset.BysGNN continues to demonstrate the best overall performance, outperforming all other GNN-based models, including StemGNN, by up to 19% error reduction in all cases except for RMSE for Los Angeles.It is worth noting that all models exhibit worse RMSE and MAE results in this data regime compared to the large number of POIs data regime.This is because RMSE and MAE depend on the scale of the number of visits, and the average number of visits is significantly higher in the small data regime compared to the previous large data regime (see Table 4).In contrast, MAPE is not affected by the scale of the number of visits and provides a better measure to compare these two regimes.As shown in Table 1, MAPE in static GNNs improves significantly in the small regime compared to the large data regime, indicating that strong predefined assumptions about the relationships between variables work better when the number of variables is smaller.StemGNN, on the other hand, is the only model that performs worse in every case in the small data regime compared to the large data regime.This highlights a major drawback of previous Dynamic GNNs: since they rely solely on the similarity of visit patterns to build a dynamic graph representing relationships, they require a high number of variables (nodes) to uncover meaningful relationships.Consequently, they may fail to infer accurate inter-node relationships based on the limited number of time series, such as in the small data regime.We even observe that, in the case of Houston, a static GNN like DCRNN outperforms StemGNN in terms of MAPE.Conversely, BysGNN shows improved MAPE performance in all cases, underscoring the importance of accounting for multi-context correlations in graph construction.
Another interesting observation is the high effectiveness of the Naive Seasonal model, which outperforms most Static GNN models in both regimes.This is due to the highly seasonal nature of our visitation time-series datasets, where the weekly number of visits to most POIs remains relatively stable.Consequently, the Naive Seasonal model is a reasonable forecasting model and hard to beat.GNN-based models only have access to exact visiting numbers during the last 24 hours in the input sequence, making it more difficult for them to outperform the Naive Seasonal model.However, StemGNN and BysGNN are able to beat the Naive Seasonal forecasts due to the robustness of their dynamic graph, which accounts for the dynamic intra-and inter-time-series correlations at each window.This further highlights the significance of a flexible and expressive graph structure.to the output of the gating mechanism for the construction of the adjacency matrix.

Ablation Study
Table 2 shows the evaluation results for each scenario.It clearly shows that all studied components contribute to an improvement in results, thus confirming our hypothesis that incorporating multicontext correlations to build a more expressive and robust dynamic graph has a significant impact on forecasting quality.Interestingly, we observed that removing semantic embeddings has the largest overall impact on the three combined performance metrics.This aligns with intuition, as similar types of POIs, such as restaurants and coffee shops, tend to experience similar visit patterns in realworld data, and BysGNN can effectively capture and utilize this semantic similarity.Moreover, while considering spatial correlations improves forecasting accuracy, the impact of considering semantics is greater than that of spatial correlations.This is again intuitive, as we would expect similar types of POIs, such as restaurants, to have similar visits regardless of their location in a city (e.g., high visits during lunch hours), while the visitation patterns of close-by POIs might not necessarily be as correlated (e.g., a gas station with a nearby restaurant).
Although semantics have the highest impact on combined evaluation metrics, meta-nodes seem to have the largest impact on RMSE alone.This indicates that capturing the multi-context dependencies between POI nodes and meta-nodes (taxonomic correlations) can significantly enhance the results.This is because, in POI visits datasets, individual POIs often follow a higher-level aggregated visit pattern.For instance, a single school would likely follow the visitation pattern of all schools at a higher level.Hence, meta-node patterns guide more precise learning of POI node patterns, and BysGNN can dynamically learn these relationships during training.Moreover, our hypothesis that learned attention scores based on visit patterns might be noisy and require the gated attention mechanism and a robust thresholding mechanism to diminish the impact of noisy patterns is supported by the substantial impact of adjacency thresholding on performance.Additionally, while the self-attention unit in the Intra-Series Correlation Layer improves the results, it has the least impact on MAPE values.This can be attributed to the fact that GRUs already exhibit a strong capability to capture intra-series dependencies.

Interpretation of Adjacency Matrix
This section compares BysGNN's dynamically generated graphs with 1) static graphs defined using spatial and semantics similarity and 2) dynamic graphs using attention on time-series windows for their forecasting performance (Section 5.5.1) and impact on node embeddings (Section 5.5.2).For dynamic graphs, we use the graphs created on Wednesday, Jan 22, 2020, at 10:00:00 AM, representing a typical weekday scenario.To simplify visualization, we show the results from the top 40 most visited POIs in NYC from 9 POI categories, resulting in 10 meta-nodes (one for each of the 9 categories and 1 for all POIs, the  meta-node).3 shows that BysGNN's dynamically generated graphs from multiple contexts consistently outperformed the other graphs, indicating its effectiveness in capturing relevant temporal relationships while preserving static relationships among nodes.The dynamic graphs derived from temporal attention yielded the second-best performance, highlighting the effectiveness of dynamic GNNs in forecasting.Using spatial similarity performed worse than semantics similarity, especially in RMSE, suggesting that POI proximity alone is not a good indicator for capturing meaningful visiting pattern relationships.The following section will examine specific examples to showcase the node embeddings learned from these graphs.

Impact of Graph
Types on Node Embeddings. Figure 3 illustrates the node embeddings obtained after the GNN Block, projected onto a 2D space using UMAP [23] dimensionality reduction technique for each graph type in Table 3.Each data point represents a node or a meta-node, with two shapes (triangles and dots) representing the spatial embedding clusters (Figure 3a) and five colors indicating the semantic embedding clusters (Figure 3b).The two clusters of spatial embeddings correspond to two boroughs of the 40 POIs: Manhattan and Queens.For instance, POI 7 and 26 in Figure 3a, highlighted with purple arrows, represent "Red Mango" (yogurt shop) and "42nd Street" (iconic crosstown street), respectively.Despite being in Manhattan within a half-mile distance, their visit patterns differ significantly, demonstrating geospatial proximity does not always indicate similar visit patterns between POIs.
The five clusters of semantic embeddings exhibit oversmoothing [6], discard minor variations in visiting patterns between POIs, and potentially lead to similar forecasting results for POIs in the  They are in the same semantic cluster due to their shared POI category of "historical sites." However, they have substantially different visit trends, and relying solely on semantic similarities also fails to capture the relationship between their visit patterns.Figures 3c and 3d illustrate the node embeddings from the temporal attention and BysGNN graphs.In both embedding spaces, neighboring embeddings do not have consistent spatial and semantic similarity (neighbors have different colors and shapes).Also, BysGNN's embeddings result in two clusters, while the node embeddings from the temporal attention graph do not exhibit clear cluster boundaries.
Consider nodes 33 and 38 (highlighted with red arrows), which are nearby in the BysGNN's embedding space while far away in the embedding space learned from the temporal attention graph.The recent visits window to these POIs is highlighted in yellow in Figure 4a, and the model aims to forecast the visits to the right of this highlighted area.Although the blue and red sequences in the highlighted window share a similar trend, the temporal attention graph assigns distant embeddings for these two nodes.This happens because the temporal attention mechanism is optimized for the entire sequence rather than for the subsequence in the highlighted window.In contrast, BysGNN effectively captures this similarity by considering additional contextual information.As this pair of POIs belongs to the same spatial cluster, BysGNN enhances the similarity score derived from temporal attention for this specific pair, resulting in close node embeddings.
On the other hand, let us focus on nodes 22 and 18, highlighted with blue arrows in Figures 3c and 3d.Here, the temporal attention graph would predict a similar visit pattern for the input visit windows depicted in Figure 4b, due to the same aforementioned reasons.However, BysGNN considers additional contexts: these POIs belong to different spatial and semantic clusters.Consequently, BysGNN accurately determines that their future visit patterns should be dissimilar and learns distant node embeddings for them.
These findings underscore the shortcomings of relying exclusively on predefined spatial and semantic relationships or only on dynamic time-series windows to capture accurate visit pattern correlations.In contrast, BysGNN provides a robust solution by effectively considering all contexts.

CONCLUSION
This work presents BysGNN, a dynamic graph neural network that is specifically designed to uncover multi-context correlations among POIs for accurate visit forecasting.Using various sources of information, including geographic information, visit numbers, semantics, and taxonomic information of POIs, BysGNN learns an accurate dynamic graph representation that is then passed to a simple GNN block for forecasting.Our experiments on real-world datasets across the United States demonstrate the superiority of BysGNN over state-of-the-art forecasting models, including those using highly sophisticated GNN blocks.In future work, we plan to apply BysGNN to other datasets with similar underlying multicontext correlations, such as health data.

A APPENDIX A.1 Experiment Details
A.1.1 Baselines.We compare the performance of BysGNN with three groups of baselines: • Naive Baselines: We first compare our model with two simple statistical baselines that provide relatively accurate results when the visitation time series is highly seasonal.(1) Naive Seasonal: We use the number of visits on the same day/time from the previous week as the prediction values for the same day/time of the current week.(2) Historical Average: We take the average of the number of visits for the same day of the week during the previous month as the prediction, similar to Google Maps' popular times graph [11].

4. 2 . 1
Aggregated Data Generator.This layer generates and adds new aggregated time series based on POI types (categories) to the original time series data, allowing the model to learn taxonomic correlations.For example, POIs in the Gas Station category might follow a similar visit pattern.At the same time, the visit pattern of POIs in the Restaurant category could also be similar to the aggregated visiting patterns of POIs in the Gas Station category.On top of this, the degree to which the POIs in the same category

Figure 1 :
Figure 1: BysGNN Overall Framework follow the same pattern as the aggregated pattern of that category differs vastly between different POIs.As a result, the taxonomic correlation that BysGNN defines consists of correlations between patterns at the same aggregation level and correlations between visiting patterns from different levels of aggregation.In the POI visit forecasting problem, BysGNN considers three different aggregation levels corresponding to individual POIs' visit patterns, POI categories' visit patterns, and the Global visits pattern (an additional time series generated by aggregating the visits to all POIs), respectively.Adding these aggregated time series allows BysGNN to learn such taxonomic correlations in the next step.Specifically, given  ∈ R  × as the original time-series input (with  as the number of time series and  as the window size for each series), and  as the set of all input POI category types, this module generates | | new time-series data by aggregating individual POI time-series based on their categories, and an additional time series by aggregating all the time series data together (Global visits time series) and adds them to the original input.As a result, the module output will be  ′ ∈ R ( +| |+1) × .The aggregation is defined as a function   such that:   : R  ′ × ′ → R  ′ (summation, in our case).

Figure 2 :
Figure 2: Intra-Series Correlation Layer We created five variations of our proposed BysGNN model to understand the effectiveness of different BysGNN components: (1) w.o Semantics: BysGNN without utilizing POI semantics in node features and semantics similarities in adjacency matrix; (2) w.o Space: BysGNN without utilizing spatial correlations and distance between POIs in the Multi-Context Correlations Layer; (3) w.o Meta-Nodes: BysGNN without the Aggregated Data Generator module; (4) w.o Self-Attention: BysGNN without the proposed self-attention mechanism in the Intra-Series Correlation Layer; (5) w.o Adj-Thresholding: BysGNN without applying thresholding

Figure 3 :
Figure 3: Visualization of Node Embeddings for Different Graphs Obtained After the GNN Block.The embeddings are projected to the 2D space using the UMAP dimension reduction technique.Each data point represents a node or meta-node, with semantic embedding clusters depicted by five distinct colors and spatial embedding clusters indicated by two different shapes (triangles and circles).The numbers atop the points correspond to the node indices.

Figure 4 :
Figure 4: Visits Time Series for Two Pairs of POIs Within and Beyond the Observed Input Window (Highlighted in Yellow)

•
RMSE: The square root of the average of the squared difference between the ground truth and the predicted values.A.1.4Hyperparameter Configuration.The hyperparameters used in our model were carefully chosen using cross-validation to optimize performance.Here is a summary of the key configurations:• Dataset Split: We divided the dataset into three parts, allocating 70% for training, 20% for validation, and 10% for testing.The respective time periods for each set were 280 days, 80 days, and 40 days.• Data Normalization: Z-score normalization was applied to ensure standardized input data.• Training: The model was trained using the RMSProp optimizer with an initial learning rate of 0.001.The learning rate decayed by a factor of 0.2 every 10 epochs.We trained the model for 40 epochs with a batch size of 32.• Adjacency Thresholding: A case amplification factor of 2.5 was employed to enhance the performance of the adjacency thresholding module.The threshold value  was set to 0.15, determining the cutoff for adjacency values.• Embedding Dimensions: The temporal embedding dimension  was set to 128, and the semantics embedding dimension  was set to 168.• Attention Heads: The Inter-Series Correlation Layer used a Multi-Head Attention mechanism with 8 heads.• GNN Node Embedding: The node embedding dimension for graph convolution in the GNN block, denoted as  * , was set to 32. • Gaussian Kernel Threshold: The threshold value  for the Gaussian kernel was determined as twice the standard deviation of distances between POIs, varying depending on the specific dataset used in each experiment.All hyperparameters were kept consistent across all experiments except for the Gaussian kernel threshold.

Table 2 :
Ablation Study Results for the Houston Dataset.The percentage of relative error change for each variant compared to the original BysGNN is listed below the actual error value.The highest percentage of the error change for each evaluation metric is highlighted in bold.

Table 3 :
Error Results for Different Adjacency Matrices Impact of Graph Types on Forecasting Performance.Table [28]atic Graph Neural Networks: These GNN-based models operate on a predefined graph structure and, therefore, require prior knowledge of the graph topology.For our experiments, we construct a static graph based on the pairwise Euclidean distance between different POIs.(1)ConvGRU and (2) ConvLSTM[31]: these models combine a GRU and an LSTM, respectively, with ChebNet[9]to make spatiotemporal forecasts.(3)DCRNN[16]:Thismodeladoptsanencoderdecoderframework and proposes a diffusion convolutional layer to capture the spectral and temporal dependencies.(4)A3T-GCN[3]:Thismodelcaptures the global temporal dynamics and spatial correlations given a graph structure.Moreover, it introduces an attention mechanism to adjust the importance of different time points and boost forecasting accuracy.•DynamicGraphNeuralNetworks:This family of GNNbased models does not require a predefined graph structure but instead uses an attention mechanism to infer relationships between nodes.(1)StemGNN:Thismethod[4]combines the Graph Fourier Transform and Discrete Fourier Transform to learn the correlations among the time series of different nodes.This is a state-of-the-art (SOTA) technique for MTS forecasting.A.1.2HardwareandSoftwareSetup.Our experiments were performed on a cluster node equipped with an 18-core Intel i9-9980XE CPU, 125 GB of memory, and two 11 GB NVIDIA GeForce RTX 2080 Ti GPUs.Furthermore, all neural network models are implemented based on PyTorch version 1.13.0 with CUDA 11.7 using Python version 3.10.8.We also implemented the GNN-based baselines (with the exception of StemGNN) using the Pytorch Geometric Temporal library[28].A.1.3Evaluation Metrics.Since we modeled the problem of forecasting the number of visits to POIs as a time-series forecasting task, we evaluate our prediction performance by comparing the average of Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE) over the prediction horizon ( timesteps).The smaller values for these metrics indicate better forecasting performance.Here is the description of each metric: • MAE: Average of the difference between the ground truth and the predicted values.• MAPE: The percentage equivalent of MAE.