Networked Time Series Imputation via Position-aware Graph Enhanced Variational Autoencoders

Multivariate time series (MTS) imputation is a widely studied problem in recent years. Existing methods can be divided into two main groups, including (1) deep recurrent or generative models that primarily focus on time series features, and (2) graph neural networks (GNNs) based models that utilize the topological information from the inherent graph structure of MTS as relational inductive bias for imputation. Nevertheless, these methods either neglect topological information or assume the graph structure is fixed and accurately known. Thus, they fail to fully utilize the graph dynamics for precise imputation in more challenging MTS data such as networked time series (NTS), where the underlying graph is constantly changing and might have missing edges. In this paper, we propose a novel approach to overcome these limitations. First, we define the problem of imputation over NTS which contains missing values in both node time series features and graph structures. Then, we design a new model named PoGeVon which leverages variational autoencoder (VAE) to predict missing values over both node time series features and graph structures. In particular, we propose a new node position embedding based on random walk with restart (RWR) in the encoder with provable higher expressive power compared with message-passing based graph neural networks (GNNs). We further design a decoder with 3-stage predictions from the perspective of multi-task learning to impute missing values in both time series and graph structures reciprocally. Experiment results demonstrate the effectiveness of our model over baselines.


INTRODUCTION
Multivariate time series (MTS) data are common in many real-world applications, such as stock prediction [13,71], traffic forecasting [43,79,80] and pandemic analysis [31,52].However, these data are often incomplete and contain missing values due to reasons such as market close or monitoring sensor/system failure.Predicting the missing values, which is referred to as the MTS imputation task, plays an important role in these real-world applications.
Recently, a large amount of approaches emerge for MTS imputation [17] in the literature.To name a few, BRITS [5] is built upon bidirectional recurrent modules and GAIN [77] is one of the earliest works that use adversarial training for the task.However, many of them ignore the available relational information within the data and thus are less effective to predict missing values compared to those considering both spatial and temporal information.In order to tackle this problem, some recent works utilize GNNs or other similar algorithms to assist the imputation over MTS data.GRIN [11] adopts a bidirectional recurrent model based on message passing neural networks [22].They perform a one-step propagation of the hidden representations on the graph to capture the spatial dependencies within the MTS data.SPIN [48] is a follow-up method which solves the error accumulation problem of GRIN in highly sparse data.It introduces a new attention mechanism to capture spatial-temporal information through inter-node and intra-node attentions.By stacking several attention blocks, the model simulates a diffusion process and can handle data with high missing rates.Recently, NET 3  [29] generalizes the setting and studies tensor time series data in which the underlying graph contains multiple modes.The authors utilize a tensor graph convolution network (GCNs) and a tensor recurrent neural network (RNNs) to handle the tensor graphs and time series respectively.
Despite the strong empirical performance of these methods on the MTS imputation problem, they rely on the assumption that Figure 1: An illustrative example of an interaction network during the COVID-19 pandemic where some patients' infection status might not be available and we have no access to whom these people interact with, which represents a networked time series (NTS) with both missing node features and missing edges.
the underlying graph is fixed and accurately known.However, the graph structure of an MTS may constantly change over time in real-world scenarios.Take epidemiological studies as an example, during the evolution of a pandemic, individuals like human beings or animals may move around and thus the graph that models the spread of disease is dynamic.In literature, such time series data are referred to as networked time series (NTS) 1 [29].Given the nature of NTS data, the missing components can occur in both the node features and the graph structures (See an example in Figure 1), which makes NTS imputation an essentially harder problem compared to MTS imputation.
In this paper, we first formally define the problem of NTS imputation.We point out that the key challenges of this problem are twofold: (1) The graph that lies behind time series data is evolving constantly, and contains missing edges.Therefore, algorithms should capture the graph dynamics and at the same time be able to restore the lost structures.(2) The node feature time series also contains missing values, which requires the model to solve a general MTS imputation problem as well.To address these challenges, we formulate NTS imputation as a multi-task learning problem and propose a novel model named PoGeVon based on variational autoencoder (VAE) [35].Our proposed model consists of two parts, including a recurrent encoder with node position embeddings based on random walk with restart (RWR) [60] and a decoder with 3-stage predictions.The global and local structural information obtained from RWR with respect to a set of anchor nodes provides useful node representations.Moreover, the 3-stage prediction module in the decoder is designed to impute missing features in time series and graph structures reciprocally: the first stage prediction fills the missing values for node features and then is used for the imputation over graph structures during the second stage, in return, the predicted graph structures are used in the third stage for node feature imputation.Finally, we replicate the VAE model in PoGeVon to handle the bidirectional dynamics in the NTS data.The main contributions of this paper can be summarized as: • Problem Definition.To our best knowledge, we are the first to study the joint problem of MTS imputation and graph imputation over networked time series data.
• Novel Algorithm and Analysis.We propose a novel imputation model based on VAE, which consists of an encoder with RWR-based node position embeddings, and a decoder with 3-stage predictions.We provide theoretical analysis of the expressive power of our position embeddings compared with message-passing based temporal GNNs, as well as the benefit of multi-task learning approach for NTS imputation problem from the perspective of information bottleneck.• Empirical Evaluations.We demonstrate the effectiveness of our method by outperforming powerful baselines for both MTS imputation and link prediction tasks on various realworld datasets.The rest of the paper is organized as follows.Section 2 defines the imputation problem over NTS data.Section 3 presents the proposed PoGeVon model.Section 4 shows the experiment results.Related works and conclusions are given in Section 5 and Section 6 respectively.1 lists main symbols and notations used throughout this paper.Calligraphic letters denote tensors or graphs (e.g., X, G), bold uppercase letters are used for matrices (e.g., A), bold lowercase letters are for vectors (e.g., v).Uppercase letters (e.g.,  ) are used for scalars, and lowercase letters (e.g., ) are for indices.For matrices, we use A[, ] to denote the value at the -th row and -th column.

PROBLEM DEFINITION
We first present some necessary preliminaries and then introduce the networked time series imputation problem in this section.Definition 2.1 (Multivariate Time Series (MTS)).A multivariate time series X ∈ R  × × is a sequence of observations: {X 1 , X 2 , ..., X  }, where each observation X  ∈ R  × is a slice of X at time step  that contains  entities with  features.Problem 1 (NTS Imputation).Given: A partially observed NTS with graph sequence G( Ã, X) = { G1 , G2 , ..., G }; Output: The predicted graph adjacency tensor A and the tensor X of node feature time series.
Note.For clarity, we use node features and node time series interchangeably, and same for the graph adjacency imputations and missing edges/links predictions.

METHODOLOGY
In this section, we introduce our model named Position-aware Graph Enhanced Variational Autoencoders (PoGeVon) in detail.In order to predict the missing values in both the node features and the graph structures, we design a novel variational autoencoder (VAE), whose detailed architecture is shown in Figure 3.It consists of an encoder with node position embeddings based on random walk with restart (RWR), and a decoder with 3-stage predictions.We then replicate the VAE to handle bidirectional dynamics.We start in Subsection 3.1 to discuss the multi-task learning setting of NTS imputation problem, analyze its mutual benefit and implications to the encoder/decoder design.Then, we present the details of the proposed encoder (Subsection 3.2) and decoder (Subsection 3.3), followed by the training objective in Subsection 3.4 as well as the compelxity analysis in Subsection 3.5.

Multi-Task Learning Framework
Because of the potential mutual benefit of predicting missing node features and edges, it is natural to formulate the NTS imputation as a multi-task learning problem which consists of the imputation task for node time series and the link prediction task for graph structures.
Let us analyze the benefit of modeling NTS imputation as a multitask learning problem from the perspective of information bottleneck in unsupervised representation learning [1,59], and formulate the objective of NTS imputation as: where  is the latent representation ,  (•; •) is the mutual information, G:+Δ is the data sample which represents a sliding window of NTS data and  is the Lagrange multiplier.This formulation closely relates to the objective of a -VAE [1,25].Here, the second term  (; G:+Δ ) in Eq. ( 1) constraints the amount of identity information of each data sample that can transmit through the latent representation .In -VAE, this is upper bounded by minimizing the Kullback-Leibler divergence  • KL[  (| )|| ()] [3].The first term  ( Ã, X; ) in Eq. (1) represents the reconstruction task of VAE which can be decomposed as [28]: where  ( Ã, X; ) represents the mutual information between the partially observed NTS G (i.e., the joint distribution of Ã and X) and , while  ( Ã; X; ) is the High-order Mutual Information [28,49], which measures the shared information among multiple different random variables (i.e., Ã, X, and ).It is worthy noting that when Ã and X are independent from each other (even given ), we have: where  (•) is the entropy.Compared with Eq. ( 2), it is clear that  ( Ã; X; ) now equals to 0. Under such circumstances, i.e., no correlation exists between features of any adjacent node pairs, the objective in Eq. ( 1) becomes modeling time series features and graph structures independently.However, in reality, this is often not the case.Figure 2 demonstrates an illustrative example from the AQ36 dataset [82] in which NTS imputation problem occurs when monitor stations fail due to system errors and lose data as well as connections with each other.To maximize Eq. ( 2), we further decompose  ( Ã; X; ): where the second equation holds because of symmetry [75].Combining Eq. (2) and Eq. ( 4), we can derive the objective term for the decoder as: where the first two terms can be bounded by the objective for VAE decoder as in [1].The last two terms represent the objective of conditional VAE (CVAE) since  ( X; The first term  ( X| Ã) on the right hand can be dropped because it is independent from our model, and maximizing the second term − ( X| Ã, ) is essentially the same as optimizing the decoder of CVAE with objective max  ( X| Ã).Similar analysis applies to  ( Ã; | X).Eq. ( 5) provies an important insight: we can use Ã and X as the conditions for each other's predictions since imputation over one of them might be instructive for the other.
To summarize, our analysis reveals that (1) when the features of adjacent nodes are uncorrelated, we can impute the node time series and graph adjacency independently (Eq.( 3)); however, (2) in real applications, node features and graph structure are often correlated (e.g., Figure 2), and in such a scenario, there might be a mutual reinforcing effect between node feature imputation and graph adjacency imputation (Eq.( 4)).Our analysis also provides novel and critical clues that can guide the design of the encoderdecoder framework for learning datasets with multi-modality such as NTS.For the encoder, Eq. ( 5) suggests that the latent representation  (i.e., the output of the encoder) should encode both the graph adjacency information and the node feature information (i.e., the VAE part of Eq. ( 5)) as well as the mutual interaction between them (i.e., the CVAE part of Eq. ( 5)).For the decoder, we will present a three-stage prediction method so that the (imputed) graph structures and the (imputed) node features can be used as each other's condition respectively (i.e., the CVAE part of Eq. ( 5)).

Encoder
The encoder aims to encode both the structural and the dynamic information of NTS data.Existing message-passing based GNNs typically only capture the local information from close neighbors.However, long-distance information between nodes is important in NTS data since the graph is constantly evolving and interactions between nodes can occur at any time step.Therefore, to capture this long-distance global information, we propose using position embeddings with random walk with restart (RWR) [20,60,73,74].

RWR-based Position
Embeddings.For a graph   at time step , the relative position vector for all nodes w.r.t.one anchor node  is computed by RWR as follows: where Â = (D −1  A  ) ⊤ is the normalized adjacency matrix of   , e  ∈ R  is a one-hot vector which only contains nonzero value at position  and  is the restart probability.After reaching the stationary distribution, we concatenate the position scores r , ∈ R  of all the anchor nodes as the final position embeddings R  ∈ R  × , where  is the number of nodes.
We next prove the expressive power of RWR-based position embeddings with following proposition and theorem.Proposition 3.1.Random walk with restarts (RWR) captures information from close neighbors (local) and long-distance neighbors (global) in graph learning.
Proof.See Appendix.□ The benefit of RWR-based position embeddings in temporal graphs is summarized in Theorem 3.2 from the perspective of message-passing based temporal graph networks (TGN) [55], which is a general class of GNNs designed for handling temporal graphs.It contains two main components: memory (through RNNs) for capturing the dynamics of each node; aggregate and update (through GNNs) for gathering topological information from neighbors.Theorem 3.2.Given a temporal graph G, TGN with RWR-based node position embeddings   has more expressive power than regular TGN   in node representation learning: D((), ()) ≥ D( (),  ()) where D(•, •) measures the expressiveness by counting the distinguishable node pairs (, ) in G based on node representations.

Proof. See Appendix. □
Finally, to capture the dynamic information in NTS data, we use a 2-layer gated recurrent unit (GRU) [10] as the encoder to model   (| X, M, R), where  is the latent representation and R = {R 1 , ..., R  } is the tensor of node position embeddings.For each R  , instead of treating all the nodes as anchor nodes, usually only a small subset of anchor nodes | | =  would be sufficient to distinguish nodes from each other in practice [78].Masks M and position embeddings R are concatenated with the input X at each time step before feeding into the GRU.

Decoder
We design the decoder as a GRU with 3-stage predictions.We use H  to denote node embedding matrix at time step  and H to denote node embedding tensor.Based on the analysis in Section 3.1, we model the complementary relation between feature imputation   ( X| Ã, M, ) and network imputation   ( Ã|M, R, ) at different prediction stages in the decoder as follows.
3.3.1 First-stage Feature Prediction.In the first stage, we use a linear layer to generate an initial prediction of the missing values in the time series: where H  −1 is the hidden representation of each node from the previous time step and H 0 is sampled from a normal distribution where  ℎ is the hidden dimension.Similar to [11], we then use a filler operator to replace the missing values in the input X with Ŷ1, to get the first-stage output O  : 3.3.2Second-stage Link Prediction.Our second-stage prediction imputes the missing weighted edges within graphs.O  is used with the mask M  , the position embedding R  and H  −1 to get the embeddings of all nodes at timestep  through a linear layer: where ∥ is concatenation.We directly use the hidden states from previous time step H  −1 as the embeddings for those missing nodes since no new features or graph structures of them are available at time step .In NTS, observations are usually obtained by irregular sampling and the imputation problem over them can occur at any future step in real world problems.Being able to handle such uncertainty and forecasting unseen graph structure/time series data in the future time step are two key characteristics of an NTS imputation model.Therefore, in order to capture the dynamics between different timestamps and enhance the expressiveness of PoGeVon, we also encode the time information with learnable Fourier features based on Bochner's theorem [68,69], whose properties are summarized in Proposition 3.3, as follows: where w 1 , ..., w  are learnable parameters.
Proposition 3.3.Time encoding function f(t) is invariant to time rescaling and generalizes to any future unseen timestamps.
Proof.See [33,69].□ Then, we concatenate node embeddings with time encodings through broadcasting as the input of a two-layer multi-layer perceptron (MLP) to predict the missing edges: The next step is to enhance node embeddings with updated graph structures.The general class of message-passing neural networks (MPNNs) [22] is used similar to the aggregate and update step in TGN to capture the graph topological information, which can be defined as: whose detailed design can be found in Appendix.to make a fine-grained imputation again over node features time series.Aiming to enhance the semantics of the node representations, we apply a self attention layer [61] to capture cross-node information in our third-stage prediction, which helps to encode richer node interaction information that is not captured in H and the first stage output O  as well as the masks M  are all concatenated and processed by a self attention layer with an MLP to get the final output imputation representations: Then a two-layer MLP is used for the third-stage prediction: A filler operator similar to Eq. ( 8) is applied to get the imputation output X out  from Ŷ2, .Finally, a single layer GRU is used similar to the memory step in TGN to update hidden representations based on the latent node representation Z, the output of second-stage X out  , the mask M  and the structural representation H A  for each node and move on to the next time step: 3.3.4Bidirectional Model.Similar to [11], we extend our VAE model to bidirectional by replicating the architecture to handle both the forward and backward sequences.An MLP is used over the output hidden representations from these two VAEs to produce the final imputation output Ŷ: where H out is the tensor of imputation representations from the final stage prediction,  and  stand for forward and backward directions respectively.Algorithm 1 summarizes the detailed workflow of the proposed PoGeVon.

Objective and Training
The Evidence Lower Bound (ELBO) objective function of a vanilla conditional VAE [12,14] over missing data imputations can be Encode X , M  , R  to get   based on Section 3.2.

16:
Update parameters , ,  by optimizing the loss in Eq. ( 19).17: end for 18: Obtain the predicted tensor X of node feature time series based on Eq. ( 8) by replacing missing values in X with Ŷ. 19: Obtain the predicted graph adjacency tensor A based on Eq. ( 8) by replacing missing values in Â with A out . 20: return the predicted tensor time series X and the predicted tensor of graph adjacency A.
defined as: Our goal is to learn a good generative model of both the observed multivariate node feature time series X and the observed graph adjacency Ã.Thus, we can treat the position embeddings R as an extra condition in addition to the mask M similar to [26].This is because, M and R are auxiliary covariates, and are given or can be generated through deterministic functions based on Ã and X respectively.Therefore, it is more natural to maximize log  ( X, Ã|M, R) as our objective, which is summarized in the following lemma.
Lemma 3.4.Under the condition that M and R are jointly independent of the prior  ():  () =  (|M, R), the new ELBO objective of the proposed PoGeVon for the NTS imputation problem is: where  denotes parameters of the link prediction module.
Proof.The derivation of ELBO new can be formulated as: since the node position embedding R can be generated from the observed graph adjacency Ã, This lemma generalizes the ELBO in Eq. ( 17) to the multi-task learning setting which ensures the learning objective of the proposed PoGeVon is consistent with our analysis in Section 3.1.That is, ELBO new corresponds to Eq. ( 1) by modeling dependencies between the observed node time series X and observed graph adjacency A similar to Eq. ( 5).
We use a similar strategy as in [12,51] to maximize ELBO new by training our model over observed data and infer missing ones based on  (X| X) ≈ ∫  (X|)(| X).We (1) use the mean absolute error (MAE) as the error function for the feature imputation and (2) use the Frobenius norm between the predicted adjacency matrices and the observed adjacency matrices as the link prediction loss.The model is trained by minimizing the following loss function which is composed of errors of all three stages:  + (X out  , : +Δ , X:+Δ , M  : +Δ ) + (X out , : +Δ , X:+Δ , M  : +Δ ) Error for the 3 rd stage prediction where  is the weight for KL divergence similar to [25] and  is the weight for the 2 nd stage prediction.The element wise error function (X pred , X label , M) outputs the average error by calculating the inner product between mask tensor M and |X label − X pred |.The loss L is optimized through each sample in the dataset which is a sliding window ( :  + Δ) of NTS data (i.e., G:+Δ ).

Complexity Analysis
The computational complexity of PoGeVon can be analyzed through the following aspects.First, calculating the position embedding R has the complexity O ( • Ē • log 1  ) [62] where Ē is the average number of edges and  is the absolute error bound for the power iteration of RWR.Second, with a standard bidirectional VAE based on GRU, MPNN increases the complexity by O ( Ē) with sparse matrix multiplications at each time step.Third, the self-attention used in the third-stage decoder has the complexity O ( 2 ).There are several ways to reduce the overall time complexity.For example, most of the computations can be parallelized.One computational bottleneck lies in the computation of self-attention.The existing techniques for efficient attentions [58] can be readily applied in the proposed PoGeVon, such as Linformer [64] which uses low-rank projections to make the cost of the attention mechanism O ( ) and Reformer [38] which applies locality sensitive hashing to reduce the complexity of attention to O ( • log  ).

EXPERIMENT
We apply the proposed PoGeVon to the networked time series imputation task, and evaluate it in the following aspects: • Q1. ) [8] to capture the dynamics and generate the graph sequence.Finally, we simulate the missing edges in the NTS imputation problem by masking edges when one of its end nodes contains missing features.Specifically, an edge weight  ,  between nodes  and  at time  can be defined as: where  is the positive threshold for graph dynamics and we choose  = 0.3 for COVID-19 dataset.We randomly mask out 25% of the node features in this dataset, and split the time axis to 70% for training, 10% for validation and 20% for test respectively.• AQ36: A dataset of AQI values of different air pollutants collected from various monitor stations over 43 cities in China [82].Following [5,11], we use the reduced version of the dataset which contains 36 nodes (AQ36) and pick the last four months as the test data.To construct the static graph  (A, X), we use the thresholded Gaussian kernel from [57] to get the pairwise distances A[, ] between stations  and  as the edge weight.The graph sequence is constructed using the similar method as Eq. ( 20) over normalized time series features and the threshold  is set to 0.8.We use the same mask setting as [76] which simulates the true missing data distribution.• PeMS-BA/PeMS-LA/PeMS-SD Three datasets contain traffic statistics based on the Caltrans Performance Measurement System (PeMS) [6], which cover the freeway system in major areas of California.We collect The missing rate of AQ36's time series features is about 13.24%, while for COVID-19 dataset and all the traffic datasets, the time series features have 25% missing values.Based on Eq. ( 20), the missing rates of edges for AQ36 is 28.06%, for COVID-19 is 43.23%, and for PEMS-BA/PEMS-LA/PEMS-SD are 43.75%/43.74%/43.71%respectively.
To be consistent with the dataset settings in previous works such as GRIN [11], we use the following window length to train the models: (i) 14 for COVID-19 dataset which corresponds to 2 weeks, (ii) 36 for AQ36 dataset which corresponds to 1.5 days and (iii) 24 for all the traffic datasets which corresponds to 2 hours of data.

Baselines.
We compare the proposed PoGeVon model with following baselines for the time series imputation task.All the methods are trained with NVIDIA Tesla V100 SXM2 GPU.
(1) Mean.Impute with node level feature average along the sequence.(2) Matrix Factorization (MF).Matrix factorization of the incomplete matrix with rank 10. (3) MICE [65].Multiple imputation by chained equations.The algorithm fills the missing values iteratively until convergence.We use 10 nearest features and set the maximum iterations to 100.(4) BRITS [5].BRITS has the similar bidirectional recurrent models as ours for time series imputation.It learns to impute only based on the time series features and does not consider the spatial information of the underlying graphs.( 5) rGAIN [11].A GAN based imputation model which is similar to SSGAN [50].rGAIN can be regarded as an extension of GAIN [77] with bidirectional encoder and decoder.( 6) SAITS [15].SAITS is a self-attention based methods with a weighted combination of two diagonally-masked self-attention blocks, which is trained by a joint optimization approach on imputation and reconstruction.
TimesNet [66].TimesNet transforms the 1D time series into 2D space and present the intraperiod-and interperiodvariations simultaneously.Its inception-block is able to discover multiple periods and capture temporal 2D-variations from the transformed data.(8) GRIN [11].GRIN is a state-of-the-art model for MTS imputation with the relational information from a static and accurately known graph, which uses MPNN to build a spatiotemporal recurrent module and solves the problem in a bidirectional way.(9) NET 3 [29].NET 3 is a recent work focusing on tensor time series learning and assumes that the tensor graphs are fixed and accurately known.
NTS imputation (i.e., Problem 1) also aims to solve the link prediction problem.We compare the performance of our method with following baselines: (1) VGAE [37].Vanilla variational graph autoencoder is the first work that brings VAE to graph learning, and has competitive performance on link prediction task over static graphs.(2) VGRNN [24].Variational graph recurrent neural networks extends VGAE to handle temporal information with the help of RNNs, and is a powerful baseline for the link prediction task on dynamic graphs.

Metrics.
We use mean absolute error (MAE), mean squared error (MSE) and mean relative error (MRE) to evaluate the imputation performance of all models over missing features.For the link prediction task, we use the Frobenius norm as the metric since all the edges are weighted.All the experiments are run with 5 different random seeds and the results are presented as mean ± standard deviation (std).

Time Series Imputation Task
Empirical results from Table 3 and Table 4 demonstrate that the proposed PoGeVon outperforms all the baselines over the time series missing values prediction task in the NTS imputation problem.In particular, PoGeVon achieves more than 10% improvement on all the datasets compared with the best baselines.Especially, PoGeVon has significant improvements over all the baselines over COVID-19 dataset where other neural network based models except TimesNet have even worse performance than traditional time series imputation methods such as MF and MICE.It is worth noting that, although equipped with modules to handle topological information from graphs, GRIN and NET 3 are less competitive than PoGeVon when the graph is constantly changing and contains missing edges.On the AQ36 dataset and the PeMS-SD dataset, they bear worse performance compared to BRITS and rGAIN, which do not leverage any topological information.PoGeVon outperforms BRITS and rGAIN by at least 12.92% and 10.55% on these two datasets respectively, which further indicates the effectiveness of our method.
Although TimesNet is the strongest model over most of the datasets except AQ36, there still exists a large gap between its performance and PoGeVon even with much more parameters (triple number of parameters of PoGeVon).The main reason PoGeVon fluctuates (with a large std) on AQ36 dataset compared with traffic datasets is that AQ36 has fewer training samples (time steps), which brings more uncertainty for the model and results in larger differences of performances using different random seeds.VGAE and VGRNN were originally designed for link prediction over unweighted graphs.However, all the graphs are weighted in our NTS imputation settings, and thus, we modify these models correspondingly and apply the same Frobenius loss function we use in PoGeVon to train them.All the results are listed in Table 5.Both baselines have relatively worse performance compared to PoGeVon in all the datasets, and even using RNNs, VGRNN only gains minor improvement over VGAE.This indicates that both VGAE and VGRNN may not be able to handle the link prediction task over weighted dynamic graphs very well.To evaluate the effectiveness of different components of our proposed method, we compare PoGeVon with following variants: (1) Replace RWR node position embeddings with the shortest path distance (SPD) based node embeddings by calculating the distance between each node with anchor nodes.(2) Replace RWR node position embeddings with the RWPE node position embeddings from [16].(3) Replace RWR node position embeddings with the PGNN node position embeddings from [78].(4) Remove the link prediction module in the 2 nd stage prediction.(5) Remove the selfattention module in the 3 rd stage prediction by replacing it with a linear layer.The results of the ablation study over AQ36 dataset are shown in Table 6.As we can see, the proposed method PoGeVon indeed performs the best which corroborates the necessity of all these components in the model.

Sensitivity Analysis.
We conduct sensitivity analysis to study the effect brought by increasing the masking rates.We consider the following mask rates: 15%, 25%, 35%, 45%.In order to keep a reasonable edge missing rate, for each edge with either end node being masked, they have 70% of chance being masked instead of using the setting from Eq. (20).The results are shown in Figure 4, in which the error bar demonstrates the standard deviation of MAE over 5 runs with different random seeds.The proposed PoGeVon consistently outperforms all the baselines in these settings which further demonstrates the effectiveness and robustness of our method.

RELATED WORK
In this section, we review the related works which can be categorized into two groups, including (1) multivariate time series imputation and (2) GNNs with relative position encodings.Multivariate Time Series Imputation.In addition to traditional methods such as ARIMA [2] and K-Nearest Neighbors (KNN) [7], deep learning models are widely adopted in recent years to solve the MTS imputation problem.BRITS [5] is one of the most representative methods which uses bidirectional RNNs.There also exist a wide range of methods using deep generative models such as generative adversarial nets (GAN) [23] and VAE [35].GAIN [77] is one of the earliest methods that use GAN to impute missing data, and later [46] applies GAN to the multivariate time series setting based on 2-stage imputation.E 2 GAN [47] is an end-to-end GAN and uses the noised compression and reconstruction strategy to generate more reasonable imputed values compared to previous works.SSGAN [50] proposes a novel method based on GAN to handle missing data in partially labeled time series data.VAE is used in GP-VAE [18] to solve the MTS imputation task with Gaussian process as the prior.
Other works handle MTS imputation problem from the perspective of spatial-temporal modeling, which takes the advantage of entities relations from the underlying graph.[4] is the first trial of using matrix factorization algorithm to recover missing values over MTS data with graph structures.More recently, GNNs have been used to capture the topological information in the MTS data.GRIN [11] proposes a novel bidirectional message passing RNN with a spatial decoder to handle both the spatial and temporal information.SPIN [48] uses sparse spatiotemporal attention to capture inter-node and intra-node information for predicting missing values in MTS.NET 3  [29] generalizes the problem to tensor time series where multiple modes of relation dependencies exist in the time series.It introduces a tensor GCN [36] to handle the tensor graphs and then proposes a tensor RNN to incorporate the temporal dynamics.One common limitation of all these methods is that they either ignore the topological information from graph or assume the graph is fixed and accurately known.GNNs with Relative Position Encodings.The expressive power of message-passing based GNNs has been proved to be bounded by 1-Weisfeiler-Lehman test (1-WL test) in [70].Many follow-up works have been done to improve the expressive power of GNNs which go beyond 1-WL test, and position-aware graph neural networks (P-GNNs) [78] is one of them.P-GNNs randomly picks sets of anchor nodes and learn a non-linear distance-weighted aggregation scheme over these anchor sets for each node.This relative position encodings for nodes are proved to be more expressive than regular GNNs.Distance Encoding [40] uses graph-distance measures between nodes as extra features and proves that it can distinguish node sets in most regular graphs in which message-passing based GNNs would fail.[16] proposes a novel module for learnable structural and positional encodings (LSPE) along with GNNs and Transformers [61], which generates more expressive node embeddings.Recently, PEG [63] is introduced for imposing permutation equivariance and stability to position encodings, which uses separate channels for node features and position features.Compared with these existing methods, our proposed RWR-based position embedding could capture more topological information from the entire graph, as our analysis in Section 3.2.1 shows.

CONCLUSION
In this paper, we focus on solving networked time series imputation problem, which has two main challenges: (1) the graph is dynamic and missing edges exist, and (2) the node features time series contain missing values.To tackle these challenges, we propose PoGeVon, a novel VAE model utilizing specially designed RWR-based position embeddings in the encoder.For the decoder, we design a 3-stage predictions to impute missing values in both features and structures complementarily.Experiments on a variety of real-world datasets show that PoGeVon consistently outperforms strong baseline methods for the NTS imputation problem.

A.4 Reproducibility
We introduce the detailed parameter settings of our models as well as baselines in this subsection.For PoGeVon, the restart probability  of RWR for position embeddings is set to the commonly used 0.15, and we picked the number of anchor nodes as  = log 2 ( ).We set the hidden dimension to be 64,  to be 0.2 and  to be 0.01.
All the experiments are based on codes from a open source library2 [56] and those provided by corresponding authors.We modify their implementations for the NTS imputation problem and the details of parameter settings are listed as follows: for the parameters for each baseline model in the time series feature imputation, we refer to previous works for their settings [5,11].For BRITS 3 , we use the same hidden dimension as [5] for AQ36 dataset, and for the traffic datasets, the hidden size is set to 256 which is aligned with the setting used in PeMS data in [11].For rGAIN, we follow the exact same setting as in [11,48] in which we use 64 as the hidden size for AQ36 dataset and 256 for traffic datasets.For SAITS and TimesNet, we follow the exact parameter settings in their papers [15,66].For GRIN 4 , we use the same hidden size for AQ36 as [11], while using 80 for the traffic datasets since they are much larger than AQ36.As for the dimension of NET 3 5 , we use 128 for AQ36 dataset and 256 for traffic datasets.For COVID-19 dataset, we use the same parameters settings of traffic datasets for all the models .
For VGAE and VGRNN in the link prediction task in NTS imputation, we use hidden size 256/128 for the AQ36 and 320/150 for the traffic datasets respectively.
We train all the models using PyTorch [53] with Adam optimizer [34], learning rates are set to be 0.001/0.01for time series feature imputation and link prediction baselines respectively with cosine annealing scheduler [45] to adjust the values dynamically.The batch sizes are all set to 32 and we use validation dataset for early stopping.

A.5 Additional Experiments
A.5.1 Visualization of Prediction.The prediction results of different selected baselines over PeMS-LA with 50% missing rates over test data can be found in Figure 5.It is clear that PoGeVon can achieve better predictions results compared with other baselines.In particular, GRIN and NET 3 sometimes suffer a lot from fluctuations due to the missing edges in the NTS data, which result in poor performance compared to our proposed PoGeVon.Besides, we can see that PoGeVon can have finer predictions compared with all the baselines when there exist abrupt change of data values which also has high missing rates (e.g., data from time step 60 to 65 in the figure of prediction of sensor 34).7 and 8 respectively.It is clear that replacing vanilla self-attention with Linformer self-attention as well as LSH self-attention from Reformer can still achieve better results on COVID-19 with the strongest baseline MICE.They can also have better or comparable results over AQ36 compared with the strongest baseline BRITS as well.) due to the self-attention module.As we have discussed in Section 3.5, we can reduce this complexity to either O ( ) by Linformer [64] or O ( log  ) by Reformer [38].Another limitation lies in the potential negative transfer effect, which might happen when negative correlation exists between the time series of adjacent node pairs.Under such circumstances, directly applying multi-task learning framework in PoGeVon might hurt the performance of NTS imputation.A possible solution is to resort to GNNs designed for graphs with heterophily [9,44,81,83] in the decoder of the proposed PoGeVon.There are several interesting aspects that are worth future study, including (1) generalizing the proposed PoGeVon for detecting anomalies on NTS, forecasting NTS [30,67] and assisting temporal graphs analysis [19,21]; (2) applying it to temporal knowledge graph completion [27] and alignment [72] as well as dynamic recommendations [42].

Definition 2 . 2 (
Networked Time Series (NTS)).Networked time series is an extension of multivariate time series, in which a sequence of graphs G(A, X) = { 1 ,  2 , ...,   } is given, and A models the node interactions as time goes by.Each graph   is represented as a weighted adjacency matrix A  ∈ R  × with the node feature matrix X  ∈ R  × .Definition 2.3 (Mask Tensor).A binary mask tensor M : {M 1 , M 2 , ..., M  } ∈ R  × × serves as the indicator of missing values in MTS data, in which the value M  [, ] indicates the availability of each feature  of entity  at time step : M  [, ] being 0 or 1 indicates the corresponding feature is missing or observed.Given the nature of NTS data, its missing data can occur in two parts: (1) missing values in node feature time series, and (2) missing edges in graph structures.The former is similar to missing values in traditional MTS, while the latter is unique in NTS which demonstrates the underlying dynamics of a graph sequence.Therefore, we can also define mask tensor for graph adjacency sequence similar to Definition 2.3.We formally define the partially observed NTS data and NTS imputation problem as follows: Definition 2.4 (Partially Observed NTS).A partially observed NTS: G( Ã, X) = { G1 , G2 , ..., G } consists of observed graph adjacency tensor Ã and observed node feature tensor X.The value of Ã [, ] and X [, ] can be observed only if M   [, ] = 1 and M   [, ] = 1 where M  and M  are the mask tensors for graph adjacency structure and node features respectively.

Figure 2 :
Figure 2: An illustrative example of mutual reinforcing effect between node feature imputation and graph structure imputation, based on 4 monitor stations in AQ36 dataset (See Section 4 for the details of the dataset).Correlation between three time series (1001, 1002, and 1003, indicated by three red boxes) helps impute the missing edges between them (the two red dashed lines).Meanwhile, the edges between 1001, 1002 and 1004 (the two purple lines) helps impute time series/node features by capturing the lagged correlation between them (the three purple boxes).Best viewed in color.

Figure 3 :
Figure 3: The model architecture of the proposed PoGeVon.

3. 3 . 3
Third-stage Feature Prediction.In the third-stage prediction, we utilize the structural information H graph graph  .The latent node representations Z, previous hidden state H  −1 , the structural representation H graph

First
and third terms in ELBO new

Figure 4 :
Figure 4: Sensitivity analysis for time series imputation with different masking rates on the traffic dataset.Lower is better.Best viewed in color.

Figure 5 :
Figure 5: Different models' predictions of traffic flow in sensor 23 and 34 from PeMS-LA dataset.Best viewed in color.

Table 2 :
Statistics of the datasets.Entity numbers of PeMS* datasets refer to the original number of sensors/stations in the corresponding dataset and only part of them are used to build the graphs.
[31]lar to[31], we choose infection cases of states as the time series data X and use mobility of people across different states to model the spatial relationship A between them.Then, we apply a Radial Basis Function (RBF)  (, , ) = 5-minute interval traffic flow data from 3 different stations 4, 7 and 11 between 01/01/2022 and 03/31/2022, which represent the traffic information from Bay Area, Los Angeles and San Diego respectively.For each dataset, we pick 64 sensors with the largest feature variance, and use their latitude and longitude values to calculate pairwise distances to build the static graph.We only keep edges with weight within certain threshold, and we use 15 miles for PeMS-BA/PeMS-LA and 10 miles for PeMS-SD.The graph sequence is constructed using the similar method as the AQ36 dataset, and the threshold  is set to 0.8.We use similar masking settings as COVID-19 dataset.

Table 4 :
Performance comparison over PeMS-BA, PeMS-LA and PeMS-SD datasets.Smaller is better.

Table 5 :
Performance comparison of the link prediction task in NTS imputation.Smaller is better.

Table 6 :
Ablation study of PoGeVon over AQ36 dataset on time series feature imputation.Smaller is better.

Table 7 :
Ablation study of self-attention in PoGeVon over COVID-19 datasets on time series feature imputation.Smaller is better.

Table 8 :
Ablation study of self-attention in PoGeVon over AQ36 datasets on time series feature imputation.Smaller is better.Ablations on RWR Restart Probability in PoGeVon.We have also conducted ablation studies on the RWR restart probability for our position node embeddings in PoGeVon.In literature, the restart probability is often set to be a small number (e.g., 0.15).A high restart probability makes RWR-based node embeddings near anchor nodes be more similar to each other which increases the local information while diminishes the global information.A low restart probability sometimes results in a sparse matrix because of the deadend nodes of graphs and RWR algorithm will degenerate to a vanilla PageRank/Random Walk equations.The experiment results on this aspect by setting the restart probability to different levels: 0.1, 0.2, 0.4 and 0.8 are shown in the Table9.Based on the experiment results, we do observe performance drop of PoGeVonwhen choosing less effective restart probability for RWRbased node position embeddings.However, PoGeVondemonstrates certain stability and can still outperform other baseline models.

Table 9 :
Ablation study of RWR restart probability  in PoGeVon over COVID-19 datasets on time series feature imputation.Smaller is better.Limitations and Future Works One limitation of the proposed PoGeVon model lies in its quadratic complexity O ( 2