History-enhanced and Uncertainty-aware Trajectory Recovery via Attentive Neural Network

A considerable amount of mobility data has been accumulated due to the proliferation of location-based services. Nevertheless, compared with mobility data from transportation systems like the GPS module in taxis, this kind of data is commonly sparse in terms of individual trajectories in the sense that users do not access mobile services and contribute their data all the time. Consequently, the sparsity inevitably weakens the practical value of the data even if it has a high user penetration rate. To solve this problem, we propose a novel attentional neural network-based model, named AttnMove, to densify individual trajectories by recovering unobserved locations at a fine-grained spatial-temporal resolution. To tackle the challenges posed by sparsity, we design various intra- and inter- trajectory attention mechanisms to better model the mobility regularity of users and fully exploit the periodical pattern from long-term history. In addition, to guarantee the robustness of the generated trajectories to avoid harming downstream applications, we also exploit the Bayesian approximate neural network to estimate the uncertainty of each imputation. As a result, locations generated by the model with high uncertainty will be excluded. We evaluate our model on two real-world datasets, and extensive results demonstrate the performance gain compared with the state-of-the-art methods. In-depth analyses of each design of our model have been conducted to understand their contribution. We also show that, by providing high-quality mobility data, our model can benefit a variety of mobility-oriented downstream applications.


INTRODUCTION
Widely adopted location-based services have accumulated large-scale human mobility data, which have great potential to benefit a wide range of applications, from personalized location recommendation to urban transportation planning [45].Nevertheless, since users may not allow the service provider to collect their locations continuously, individual trajectory records in such data are extremely sparse and unevenly distributed in time, which inevitably harms the performance of downstream applications, even when it has a notable user penetration rate and covers a long period of time [12].For example, with a limited number of trajectory records per day, it is difficult to predict an individual's next location and recommend proper points-of-interests [40].As for collective mobility behaviour, because personal location records are missing most of the time, it is also hard to produce exact estimates of hourly crowd flow in cities for emergency response [22].Therefore, it is desired to rebuild individual trajectories at a fine-grained spatial-temporal granularity by imputing missing or unobserved locations.One commonest solution to this problem is to impute the missing value by treating individual trajectories as two-dimensional time series with latitude and longitude at each timestamp [2,26].As such, smoothing filters [2,26] and LSTM-based models [6,36] have been proposed.Their performance is acceptable when only a small percentage of locations are missing due to the limited movement during a short time span.However, in highly sparse scenarios, their performance could degrade significantly, as they fail to effectively model complex mobility regularity.Another line of study is to model users' transition regularity among different locations, to generate the missing locations according to the highest transition probability from observed location [24,36].But this kind of strategy is still insufficient in the sense that the observed records are unevenly distributed in time in LBS data, and thus transition rule is incapable of inferring those locations which are continuously unobserved.
Luckily, human mobility has some intrinsic and natural characteristics like periodicity and repeatability, which can help to better rebuild the trajectory [14,29].In this regard, a promising direction is to leverage long-term mobility history, i.e., mobility records prior to the targeted trajectory, considering both those daily movements are spatially and temporally periodic, and that collecting long-term data is achievable in LBS as shown in Fig. 1.In addition, previous work has also shown that explicitly utilizing historical trajectory can help with next location prediction [8], which is similar to our task.All of those inspired us to design a history enhanced recovery model.
However, trajectory recovery is still challenging for the following reasons: first of all, the high sparsity of both targeted day and historical trajectories hinder us from inferring the missing locations by spatial-temporal constraints, i.e., how far a user can move in the unobserved periods.Because of the high uncertainty between two consecutive records in a sparse trajectory, a framework that can better model the mobility pattern and reduce the number of potentially visited locations is needed.The second is how to distil periodical features from huge historical data effectively, considering that real-world historical data includes a great deal of noise.Researchers have proposed detecting the locations of the home and the workplace from historical trajectories, to build the basic periodic pattern.This is straightforward but insufficient because other locations are neglected [7].Another way is to directly exploit the most frequently visited location at the targeted time slot from multiple historical trajectories as imputation [24].However, historically more popular locations may not be the ones that are missing on any targeted day because mobility is influenced by many factors and thus some locations can only be visited occasionally.Thus, deciding how well to rely on history is the third challenge.
Keeping the above challenges in mind, we aim to propose a novel attentive neural network-based mobility recovery model named AttnMove, which is dedicated to handling sparsity and exploiting long-term history.For clarity, we define the targeted day's trajectory as the current trajectory and any trajectories before the targeted day as historical trajectories.The proposed AttnMove model can be broken down into three key components, which address the main challenges accordingly.Firstly, to capture the mobility pattern and indicate the most likely visited areas for the missing records, we design a current processor with intra-trajectory attention mechanism to initially fill in the blanks of the current trajectory.We choose attention mechanism rather than RNN structure for the reason that sparse trajectories have few sequential characteristics and all the observed locations should be considered equally regardless of the visited order.For example, a user's locations at 6 AM, 9 AM, 3 PM and 6 PM are observed: to recover the location at 9 PM, we should give priority to the location at 6 AM and the adjacent area (i.e., home area) instead of at 6 PM (i.e., commuting time).For the second challenge, we design a history processor with another intra-trajectory attention to distil periodical features from multiple historical trajectories.By aggregating, more information from the long-term history can be leveraged for recovery.Finally, to fuse the features from current and history processors and generate missing locations, we propose a trajectory recovery module with inter-trajectory attention and location generation attention mechanisms.To be specific, the former mechanism yields attention weights to select locations from history based on current mobility states, and then later further considers spatial-temporal constraints from observed locations to better rebuild the trajectory.
Additionally, trajectory recovery, severing as a data augmentation technique, is required to be reliable, since confident yet inaccurate generation may harm downstream applications instead of bringing benefits.Nevertheless, deep neural networks are known as poorly calibrated, i.e., they yield high confidence for even wrong predictions [13,27].To mitigate the risk of incorrect location generations, we further purpose to quantify epistemic uncertainty via a Bayesian approximation of our AttnMove.Specifically, we aim to learn a distribution over the model parameters, and thus we are able to estimate the expected generation confidence from infinite models instead of one deterministic model.We adopt a Monte Carlo Dropout method proposed in [10] for its flexibility and efficiency.With the quantified uncertainty, we demonstrate that we can identify and exclude some unreliable recoveries from AttnMove's outputs.
Overall, our contributions can be summarised as follows: • We study the problem of rebuilding individual trajectories with fine-grained spatial-temporal granularity from sparse mobility data.To the best of our knowledge, we are the first to explicitly exploit long-term history to address the problem.. • We propose a novel model AttnMove for missing location recovery.Its core is to utilize an attention mechanism to model mobility regularities from sparse observations and distil periodical patterns from multiple trajectories of different days, and then fuse them to rebuild the trajectories.• We conduct extensive experiments on two real-life mobility datasets.Results show that our AttnMove model significantly outperforms state-of-the-art baselines in terms of improving recovery accuracy by 4.0%∼7.6%1 .• We further extend our AttnMove into an uncertainty-aware trajectory recovery model.By estimating the uncertainty of each imputation, it is shown that a more reliable trajectory recovery is enabled.
This paper is an extension of our published conference paper [42].The main differences are listed below.First, we modify the deterministic model into a probabilistic one, enabling uncertainty quantification to enhance the robustness of the imputation.With the estimated point-wise uncertainty, we show that we can filter out and discard some locations with very low confidence.This is significantly beneficial for down-streaming applications as it prevents the incorrect imputation from distorting following tasks (See Sec.4.4.4).Second, we elaborate on different sub-module combinations to suggest how to identify the most efficient and effective model architecture.This also can provide meaningful model design insights into related mobility research (See Sec.3.1&3.4,and 4.4.1).Last but not least, we reflect on how our machine learning model can be deployed in real industrial applications and evaluate it in a realistic scenery.This highlights the practical value of our study (See Sec. 5).
The rest of this paper is organized as follows.After formulating the problem in Section 2, we introduce our model in detail in Section 3.After that, we apply our model to two real-world mobility datasets and elaborate on evaluations to answer six research questions in Section 4. We also discuss the applications and deployment in Section 5, followed by a related-work discussion in Section 6.Finally, we conclude our paper in Section 7.

PRELIMINARIES
In this section, we introduce the definition and notations we use in this paper.
Definition 1 (Trajectory).We define a trajectory as a user's time-ordered location sequence in one day.Let T : ,1 → ,2 ... → , ... → , denote user 's -th day's trajectory, where , is the location of -th time slot for a given time interval (e.g., every 10 minutes).Note that if the location of time slot is unobserved, , is denoted by null, named missing location.
Definition 2 (Current and Historical Trajectory).Given a targeted day and user 's trajectory T , we define T as the user's current trajectory, and the historical trajectories are defined as 's trajectories in the past days, i.e., {T 1  , T 2 , ..., T −1

}.
When most of the locations in current day's trajectory T are missing, exploiting history is beneficial for the recovery.For this reason, we are motivated to formulate the investigated problem as: Problem (History Enhanced Trajectory Recovery).Given user 's trajectory T with the historical trajectories {T 1  , T 2 , ..., T −1 }, recover the missing locations, i.e., ∀ null in T , to rebuild the current day's complete trajectory.

ATTNMOVE: MODEL DESIGN
To solve the above-defined problem, we devise a novel model AttnMove, which first processes historical and current day's trajectories separately, and then integrates their features to comprehensively determine the location to be recovered.The architecture of AttnMove is illustrated in Fig. 2. In our AttnMove, to project sparse location and time representations into dense vectors which are more expressive and computable, a trajectory embedding module is designed as a pre-requisite component for other modules.Then, to extract periodical patterns, multiply historical trajectories are fed into a history processor to be aggregated.Paralleled with history processor is a current processor.We design it to enhance spatial-temporal dependence to better model mobility regularities.Finally, to fuse historical and current trajectories from the above modules and generate locations as recovery, a trajectory recovery module is proposed as the final component.
In the following, we elaborate on the details of these four modules.To clarify, we summarize the notations in Table 1 and show the key idea behind four attention mechanisms in Fig. 3.

Trajectory Embedding Module
To represent the spatial-temporal dependency, we jointly embed the time and location into dense representations as the input of other modules, defined by where ∈ R is the embedding vector for each location ∈ L. We also set up an additional embedding vector for the missing location in the trajectories.All location embedding vectors are denoted as a matrix ∈ R | L+1| × . is the embedding vector for time slot and denotes the embedding for user , both having the same dimension with .We take several combinations of them to achieve the best performance, summarized in Table 2.For graph-based location embedding, we construct the graph with locations as nodes and the distance between two locations (− , ) as weights.LINE [33], two-order graph representation is applied to learn the embedding.Note that, in this way, geographic proximity can be captured within the location embedding.While the second way for is to learn a trainable dense vector directly, which is randomly initialized and jointly updated with the whole model.For each time slot , following Transformer [35], we first try trigonometric function based embedding as follows, (2) = sin(/10000 2/ ), (2 + 1) = cos(/10000 2/ ), where denotes the -th dimension.The time embedding vectors have the same dimension with location embedding.Like location, we also set up completely learnable dense vectors for time.While finally, we sum the time and location embedding vectors into a single one, denoted by , with or without the trainable user embedding.Consequently, the joint embedding, as dense vectors, can ease the follow-up computation in attention modules.

History Processor
When exploiting the historical trajectory, information provided by a single historical trajectory with one day's observed locations is insufficient to capture periodical characters as each day's records are sparse.To utilize multiple trajectories jointly, we design a history aggregator as follows, , from query and key time slots.Note that superscript denotes history while is for the current trajectory.
where ⊕ means to extract the location with highest visiting frequency in the corresponding time slot.Then, T is embedded to with each time slot represented by , according to (1).As such, T is less sparse than any single historical trajectory.However, as it is possible that user did not generate any record at a certain time slot previously, there is no specific location with the highest frequency in the historical trajectories.In this case, we consider to leverage the continuity property of trajectories to generate a denser aggregated trajectory in the presentation space by using the mechanism as below: Historical Intra-trajectory Attention.Intuitively, the missing location can be identified by its time interval and distance to other observed locations.Take a simple example, .In the embedding space, this can be expressed by .However, the actual scenario would be much more complex than this as people do not move uniformly.Therefore, we use a multi-head attentional network to model the spatial-temporal relation among trajectory points.
First, we covert the discrete trajectory T into the continual representation space by using the trajectory embedding module (Sec.3.1).In the case of a missing location, as defined in Sec.3.1, will be used represent the location.Without lost of generalisation, we formulate the correlation between time slot and under head ℎ as follows, where 1(ℎ) , 1(ℎ)   ∈ R ′ × are transformation matrices and ⟨, ⟩ is the inner product function.Next, we generate the representation of time slot under each head via combining all locations in other time slots guided by coefficient (ℎ)  , : where 1(ℎ)   ∈ R ′ × is also a transformation matrix.Furthermore, we make use of different combinations which models different spatial-temporal dependence and collect them as follows, where ∥ is the concatenation operator, and is the number of total heads.To preserve the representation of raw locations, we add a standard residual connections in the network.Formally, where 1 ∈ R ′ × is the projection matrix in case of dimension mismatching, and () = max(0, ) is non-linear activation function.
With the historical intra-trajectory attention layer, the representation of history trajectory is updated into a more expressive form , which is shown in Fig. 3.We can stack multiple such layers with the output of the previous layer as the input of the next one.By doing this, we can learn more complex mobility features from historical trajectories.

Current Processor
When recovering the trajectory for user , it is necessary to consider the locations visited before and after the missing time slots, which enclose the geographical area of the missing locations.Since locations can be missed for several consecutive time slots, the spatial constraint is weak.Therefore, we conduct intra-trajectory attention on the current trajectory T to capture the current day's mobility pattern: Current Intra-trajectory Attention.The first step is to embed T into a dense representation via the aforementioned mobility embedding module.Next, we conduct an attention mechanism on , which has same network structure as given in ( 4)-(3.2) with input history pattern replaced by current trajectory representation .We denote the relevant projection matrices as 2 , 2 , 2 , 2 , respectively.We also stack this layer for multiple times to fully capture the spatial-temporal correlation.Then, after updating, we obtain an enhanced current trajectory represented by , .

Trajectory Recovery Module
After extracting the historical and current features, the problem is whether to depend on the interpolation via current observed locations or rely on the candidates generated from historical trajectories.Intuitively, a good solution is to compare the current mobility status with the historical one and combine the historical information with current interpolation results according to their similarity.To achieve this, we propose the following mechanism: Inter-trajectory Attention.We define the similarity of current and historical trajectory denoted by as the correlation between the enhanced representation in corresponding time slots, i.e., between , and , (∀, = 1, 2, ..., | |).Then we combine history candidates by the similarity , followed by a residential connection to remain the raw interpolation results.Finally, the fused trajectory is generated from , and , .We formulate the mechanism as: )) ∥ where 3(ℎ)   , 3(ℎ)   , 3(ℎ)   , 3 are the projection matrices.With the fused trajectory that contains both history mobility information and current spatial-temporal dependence, we are ready to recover the missing location.We use the following mechanism to generate the representation of the missing location, and then use it to identify the specific location.
Location Generation Attention.To generate the final representation denoted by , , we define a temporal similarity among the current trajectory represented by and the fused trajectory represented by , as , which can be derived by ( 9)-( 10) with , and , replaced by , and , , respectively.Then, , for time slot is a combination of , according to , which is the same as ( 11)- (13).The projection matrices are denoted by 4 , Once we obtain , , we are able to compute the probability that user visits location at time slot .We proposed two ways to calculate probabilities as follows, • The first the Similarity-based: we compare the embedding , with all locations embedding vectors one by one and measure their similarity based on their inner product, where ⟨, ⟩ is the inner product function, and , ∈ R | L | denotes the normalized probabilities of all locations visited at time slot .
• The second the Classifier-based: we use a fully connected layer with softmax activation function to map , into a distribution vector, the size of which is equal to the number of locations, When comparing these two output methods, Similarity-based is parameter-free but more computational time is more costly.But for both, in practice, the location with maximum probability is identified as the missing location (the output).

Training
Overall, the parameters of AttnMove include projection matrices and location embedding matrix, denoted by = { , , , , , = 1, 2, 3, 4}.To train the model, we use cross entropy as the loss function: where ⟨, ⟩ is the inner product, , is the one-hot representation of user 's location in -th day's -th time slot, T M denotes the missing time slots, and is a parameter to control the power of regularization.Training algorithm is illustrated in Algorithm 1, and the process is done through stochastic gradient descent over shuffled mini-batches across Adam optimizer [20].In addition, our model is implemented by Python and Tensorflow [1].We train our model on a linux server with a TITAN Xp GPU (12 G Memory) and a Intel(R) Xeon(R) CPU @ 2.20GHz.

Uncertainty Estimation
Our model can densify trajectories via data augmentation.Like other data generation studies [34,43], generation uncertainty is important for system robustness, in the context that incorrectly added data will cause unexpected damage to downstream applications.However, deep learning models are usually black boxes and lack explanation, leading to unreliable and sometimes overconfident output [31].Both the training data source and the model selection can introduce uncertainty.Specifically, in our study, GPS sampling errors or unseen trajectory patterns during model training will increase the uncertainty of the generation.Hence, it is necessary to measure and report uncertainty as auxiliary information of output location.For practice, when the uncertainty is higher than a given threshold, we can exclude the generation to guarantee the overall accuracy of the recovered trajectory.
To quantify the uncertainty of model outputs, Bayesian neural networks and approximate Bayesian neural networks have been proposed [31], and the predictive distribution of the model output is formulated as follows, where T and T * are the training input and recovery trajectory set, is the query trajectory, * denotes the corresponding model prediction, and presents model parameters.However, it is intractable to get the posterior distribution ( |T , T * ).Recently, Gal [9] proved that a gradient-based optimization procedure on the dropout neural network is equivalent to a specific variational approximation on a Bayesian neural network.Following [9], uncertainty can be estimated by averaging stochastic feed-forward Monte Carlo (MC) sampling during inference.We apply and evaluate this idea in our system: dropout is kept open during inference and each testing trajectory with locations missed will be fed into the trained model times, and the mean value of predictions is taken as the final output.
Practically, the expectation of , , i.e., the probability for user 's location in -th day's -th time slot is called the predictive mean of the model over MC iterations: where , can be similarity-based or classifier-based as we introduce in Sec 3.4.This mean value also presents the confidence of the model.In addition, predictive entropy , , the entropy of predicted probability over all locations has used a measurement of uncertainty [9].

PERFORMANCE EVALUATION
In this section, we conduct extensive experiments to evaluate our model.We will start from the datasets baselines introduction, followed by experiments setup, and then we will elaborate on the results we achieve.

Datasets
We experiment with two real-world datasets, and codes with the second dataset are published2 : • Tencent3 : This data is collected from the most popular social network and location-based service vendor Tencent in Beijing, China from June 1st∼30th, 2018.It records the GPS location of users whenever they request the localization service in the applications.• Geolife4 : This open data is collected from the Microsoft Research Asia Geolife project by 182 users from April 2007 to August 2012 over all the world.Each trajectory is represented by a sequence of time-stamped points, containing longitude and altitude [46].Pre-processing: To represent the location, we crawl the road network of Beijing from the online map 2 and divide the area into 10,655 blocks.Each block is regarded as a distinctive location.We show these block boundaries in Fig. 4. Following [7], we set the time interval as 30 minutes for both two datasets.For model training and testing, we filter out the trajectories with less than 34 time slots (i.e., 70% of one day) and the users with less than 5 day's trajectories for Tencent, and filter out the trajectories with less than 12 time slots and the users with less than 5 day's trajectories for Geolife.The final detailed statics are summarized in Table 3.

Baselines
We compare our AttnMove with 8 representative baselines.Among them, the first four are non-deep learning models and the last four are the state-of-the-art deep learning models: • Top [24]: It is a simple counting-based method.The most popular locations in the training set are used as recovery for each user.• History [23]: In this method, the most frequently visited locations of each time slot in historical trajectories are used for recovery.• Linear [16]: This is a rule-based method.It recovers the locations by assuming that users are moving straightly and uniformly.
• RF [23]: RF is a feature-based machine learning method.The entropy and radius of each trajectory, the missing time slot, the location before and after the missing time slot are extracted as features to train a random forest classifier for recovery.• LSTM [24]: It models the forward sequential transitions of mobility by the recurrent neural network, and use the prediction for the next time slot as the recovered location.
• BiLSTM [44]: It extends LSTM by the bi-directional recurrent neural network to consider the spatial constraints given by all observed locations.• DeepMove [8]: Besides modelling sequential transitions, DeepMove incorporates the historical trajectories by attention from next location prediction.Also, it uses the prediction result for recovery.• Bi-STDDP [40]: This is the latest method, which jointly models user preference and the spatial-temporal dependence given the two locations visited before and after the targeted time slot, to identify the missing location.
Apart from those baselines, to evaluate the effectiveness of our designed method to exploit history, we also compare AttnMove with its simplified version AttnMove-H, where the history processor and inter-trajectory attention layer are removed.

Experimental Settings
To evaluate the performance, we mask some time slots as ground-truth to recover.Since about 20% locations are missing in the raw datasets, we randomly mask 30 and 10 time slots per day for Tencent and Geolife.It is worth noting that the original trajectories are sparse, i.e, some time slots present no location records, and thus when generating masks, we avoid selecting those slots to guarantee that we know the ground-truth locations.
We sort each user's trajectories by time and take the first 70% as the training set from the fourth day (to guarantee that each trajectory has at least three days as history), the following 10% as the validation set and the remaining 20% as the test set.Linear, Top and History are individual models, while other models are shared by the users in Table 4. Recall comparison on different component combinations, where the first field Graph/Train is the method for location embedding, the second Trig/Train denotes time embedding, the third With/Without is for user embedding, and the last Similarity/Classification distinguishes two output methods.

Group Combination
Tencent Geolife Our model is implemented by TensorFlow and the training is run on a Linux server with TITAN Xp GPU and Intel(R) Xeon(R) CPU @ 2.20GHz.To avoid over-fitting and to get the uncertainty, a fully connected layer in the model is trained with dropout (dropping rate = 0.2).During inference time, dropout is still turned on to get the mean and standard deviation for uncertainty estimation.
We employ the widely used metrics Recall and Mean Average Precision (MAP).Recall is 1 if the ground-truth location is recovered with maximum probability; otherwise is 0. The final Recall is the average value over all instances.MAP is a global evaluation for ranking tasks, so we use it to evaluate the quality of the whole ranked lists including all locations.The larger the value of those two metrics, the better the performance.We also make use of the metric of Distance, which is the geographical distance between the centre of recovered location and the ground truth.The smaller the Distance, the better the performance.

Experiment Results
Through the two datasets and the above-defined evaluation metrics, we aim to answer the following research questions: • RQ1: Which combination of components in our model can lead to the best performance?• RQ2: How does our proposed AttnMove model perform compared with the state-of-the-art methods?• RQ3: How does each attention component contribute to the overall performance?• RQ4: What role does the estimated uncertainty play input framework?• RQ5: How robust is AttnMove w.r.t different trajectory sparsity and spatial-temporal context?• RQ6: Whether AttnMove is sensitive to parameters?  4, from which the following conclusions can be achieved: • When comparing Group 1 to 2, or comparing Group 3 to 4, we can observe that trainable time embedding cannot help to improve accuracy on both two datasets against the simple trigonometric function.A plausible explanation is that in our studied problem, trajectories are extremely sparse and it is intractable to learn any sequential regularity and thus the learned time embedding cannot capture any semantics to help recovery.As such, we choose trigonometric function as the way to set up time embedding.• When we look at Group 1 and 3, or compare Group 2 and 4, a significant performance gain can be observed when the location embedding is learned with the mode instead of being fixed by graph representation.
Although sequential regularity cannot be modelled, the geographic approximation can be captured and the learned location embedding is more expressive than the Line representation.To further validate this, we cluster the regions by using their embedding as features via k-means with Euclidean distance.Fig. 5 shows their geographic distribution with clustering results and Fig. 5(a) and Fig. 5(b) presents the correlation between the embedding we learned and the embedding from graph embedding.We can observe that adjacent locations generally share the same colour indicating they are also closed in the embedding space, demonstrating that the spatial correlation has been modelled but not only spatial proximity is learned.In conclusion, better and more semantic-aware imputation can be achieved by trainable location embedding.• For user embedding, from Group 4 and 5, we can conclude that it is not necessary.Since we use the history as an enhance-input, which captures informative and personalized characters, it is reasonable that an additional user embedding vector is unable to make any further improvement.• For the output method, from Group 5 and 6, we can observe an insignificant difference.Hence, considering the memory cost of classification-based way, we use similarity-based for experiments in the following sections.
To summary, trainable location embedding and fixed trigonometric time embedding as input, and similarity-based probability as output is the most effective and efficient combination in our model.

Comparison to baselines. (RQ2).
We report the overall performance in Table 5 and have the following observations: (1) AttnMove outperforms all the baseline methods for the majority of the cases, which shows the effectiveness of our proposed framework and the exploitation of historical information.Specifically, Recall and MAP of AttnMove outperforms the best baselines by 4.0%∼7.6%on these two datasets.One plausible reason for larger Distance of AttnMove than Linear may be that Geolife is collected from continuously moving devices rather than LBS users.Although linear with the assumption of moving straightly cannot accurately hit the correct location, it can recover within the right area, and thus can achieve a smaller distance to some extend.And Geolife is small-scale, we believe AttnMove can surpass linear in all metrics with more training instances.(2) The utilization of historical information significantly improves the performance.When comparing AttnMove-H with AttnMove, Recall and MAP decline by 3.1%∼4.2%.Distance errors also rise more than 100 meters.In addition, AttnMove outperforms DeepMove that also uses history, demonstrating AttnMove can extract more useful information from history for recovery by the proposed history processor and inter-trajectory attention mechanism.
Besides quantitative analysis, we make some visualization studies.We visualize the attention weights of each head in each layer for different users and present the average attention weights of all trajectories in the final current intra-trajectory attention layer in Fig. 6.Overall, the diagonal entries are highlighted indicating that the target location more likely depends on the locations of adjacent time slots.The bright part in the bottom-left corner indicates apart from adjacent time, the location at each night time slot is related to the locations of the whole night.It means that our deep attentive network can learn reasonable physical rules of human mobility.However, Fig. 6

Ablation Analysis (RQ3).
We analyze the contribution of each trajectory attention mechanism.We create ablations by removing them one by one, i.e., using the embedding of the corresponding time slot directly instead of using a weighted combination.We report the results on Tencent dataset in Table 6.As expected, AttnMove outperforms all the ablations, indicating each attention can improve the recovery performance.Specifically, when removing current intra-trajectory attention, the performance declines significantly.This suggests that the intratrajectory attention can effectively strengthen the spatial constraints for the missing locations, leading to more accurate imputation.Among those attentions, location generation attention provides the least notable performance gain.Hence, when model size and efficiency need to be considered, it is feasible to remove attention.Besides, although the inter-trajectory attention does not significantly boost the recall and MAP, it is effective in reducing Distance.This validates our hypothesis that historical trajectory contains useful information for the missing location recovery.
It is worth noting that the datasets we experimented on covered various mobility patterns, some of which might be non-periodical.When we remove the intra-trajectory attention, our model mainly leverages the inner mobility constraints for recovery.Specifically, the recall of 0.7002 is very close to the recall of 0.7037 yielded by the state-of-the-art baseline Bi-STDDP (refer to Table 5).The comparison verifies that even though no periodical history information can be used, our AttnMove is still competitive.On the other hand, this indicates that AttnMove is a general model that can be deployed without excluding any trajectories.

Uncertainty Understanding (RQ4)
. By passing each trajectory into the trained model 30 times, we get the mean probability of the recovered locations.So in addition to returning the location with the highest probability of recovery, we are also able to provide the point-wise uncertainty.The masked location in the test trajectory, we report the distribution of predictive entropy , and confidence , in Fig. 7. From Fig. 7(c), we can observe that   false recovery could have higher predictive entropy, indicating higher model uncertainty.This can be caused by data noise or an unseen trajectory pattern.Besides, two separate peaks can be observed from Fig. 7(c): a central value of 0.5 for true recovery while 0.6 for the false.This also suggests we can set a threshold for confident trajectory recovery, that is if the entropy of a predicted location is higher than the threshold, the generation is not recommended to use in downstream applications as a disturbance.The same conclusion can also be obtained by using the mean softmax probability (i.e., the model confidence) as a measurement of uncertainty, but smaller confidence indicates higher uncertainty.
Based on those findings, we further look at the performance gain when excluding some less confident location generations.To achieve this, we set the threshold for entropy as 0.0, 0.2, 0.4, 0.5, 0.6, and 0.65, and for confidence as 0.5, 0.6, 0.7, 0.8, 0.9, respectively.Note that 0.0 in terms of entropy and 0.5 of confidence means no generation will be removed.For entropy, we exclude the generation with entropy higher than the threshold, while for confidence, generation with lower confidence is excluded.We show the performance in Fig. 8. MAP surges significantly with the increase of threshold.As a consequence, considering the generation amount to effectively densify the trajectory, a threshold for entropy at around 0.5 or confidence at 0.8 is a good trade-off.

Robustness Analysis (RQ5).
We also conduct experiments to evaluate the robustness of AttnMove when applied in different scenarios.Firstly, we study the recovery performance w.r.t missing ratio,i.e., the percentage of missing locations.The results are presented in Table 7.As the missing ratio rises from 50% to almost 100%, it is harder to recover correctly, while our model can maintain a significant gap (i.e., more than 3%) with the state-of-the-art baseline.Second, we present the performance in different a day and trajectories with a different number of locations visited.As we can observe from Fig. 9, during the daytime when people are more likely to move or for these trajectories where numerous locations are visited, it is more difficult to recover correctly and thus Recall declines.Nevertheless, our Recall can always outperform baselines by more than 5%.Those results demonstrate the robustness of our proposed model in different scenarios for the general population.We can observe that as the dimension increases, the performance is gradually improved, and when it is larger than 128, the performance becomes stable.This is why we select embedding size as 128.Then, we conduct a grid search for and .Fig. 10(b) partly shows the results.We find that a larger number of layers generally achieves better performance while the impact of the head is not significant.Considering that more layers makes the model more expressive but requires more computational cost, to make a compromise between performance and efficiency, we finally fix the number of layer and head as 4 and 8, respectively.

DEPLOYMENT AND APPLICATION
Overall, AttnMove is promising in recovering missing records in mobility datasets, and the enhanced mobility data can benefit a wide spectrum of practical applications.Large-scale mobility data generated by LBS in Tencent Incorporation has great potential to benefit various applications.However, trajectories of individuals are unexpectedly sparse: if dividing each day into 1-hour time slots, 32.5% of them are recorded with locations on average, and for 30-minute, 10-minute, and 1-minute time slots, the number is 24.5%, 15.3%, and 2.5%, respectively.The sparsity weakens the usage of such trajectories for practical problems including dynamic population sensing, POI recommendation, etc. in terms of accuracy and robustness.To this end, AttnMove works as an essential pre-processor module between raw data and down-streaming tasks as shown in Fig. 11.We take travel mode detection as an example to illustrate how AttnMove is deployed and what are the effects.Application Task.Travel mode detection task is to uncover the transportation mode of individuals when the recorded location changes.With few records one day, the initial solution is to search all the feasible paths between each pair of locations and determine one path that satisfies the travel time constraints.However, detecting travel mode from such sparse trajectory is of great uncertainty with limited information [38].With AttnMove to recover more locations the user visited, the uncertainty can be significantly reduced as these recovered locations present more constraints to the travel route.
Deployment.As shown in Fig. 11, when employing AttnMove, a trajectory filter is first put at the very beginning of the framework.The filter is designed to divide trajectories into dense and sparse ones controlled by a threshold according to the following applications.For travel mode detection, trajectories with more than 80% time slots recorded (10-minute granularity) per day are dense ones, while the rest are sparse.After getting dense trajectories, the data into training and validation sets, and mask 80% of the records as ground-truth for AttnMove model learning, as introduced in Section 4.2.1.Then, the trained model is employed on both the dense and sparse trajectories to recovery all the missing records.After enhancement, each location pair with the recovered locations are used to detect the travel mode.
It is also worth noting that although most of the above-mentioned location-based services are real-time, our model is not necessarily to be deployed online and in real-time.As a data augmentation technique, our model can be deployed to infer the missing locations of the current day's trajectory when the location service is not frequently used, e.g., at midnight.During inference, recovering one trajectory costs around 30ms, while the procedure for multiple users can be run in parallel.Hence, the model is efficient enough to prepare the data for next day's utilization.We have added this to Sec 5.
Effects.As AttnMove can achieve MAP of over 80% in the real-world dataset, the available location records increase by 4.3 folds because the missing ratio is 84.7% on average (originally only 15.7% is available and we can recover 80% of the missing 84.3% correctly, so the ratio rises from 15.7% to 83.1%).Validated by bus smart card data and online map navigation data, it is proved that AttnMove can improve the accuracy of travel mode detection by 40%∼50% compared to that without enhancement.It also significantly narrows the variance of different metrics as the uncertainty is reduced by more locations recovered.

RELATED WORK 6.1 Trajectory Recovery
Recovering the missing value in time series has been extensively investigated, among which deep learning models have achieved promising performance [25].But they are insufficient to apply in mobility data due to the inability to model spatial-temporal dependence and users' historical behaviours.Apart from these recovery methods, some works for mobility prediction can also be adopted for recovery.Feng et al. incorporated the periodicity of mobility learned from history with the location transition regularity modelled by a recurrent neural network (RNN) to predict the next location [8].However, the performance declines because it cannot model the spatial-temporal dependence from the sparse trajectory.As such, models for mobility data recovery have been particularly studied.Li et al. used entropy and radius such trajectory-specific features for spatial-temporal dependence modeling [23], while it failed to exploit history.By presenting a user's long-term history as a tensor with day-of-the-month, time-of-the-day, and location-of-the-time three dimensions, a tensor factorization-based method was proposed [7].However, it requires the tensor to be low-rank, and thus cannot model the randomness and complexity of mobility.Recently, Bi-STDDP was designed to represent users' history by a vector and combine it with the spatial-temporal correlations for recovery [40].However, the expressiveness of the vector is limited as it is unable to reflect the dynamic importance of history.Overall, both the general time series recovery methods, trajectory prediction methods, and the existing mobility data recovery methods are incapable to tackle the instinctive challenges of the mobility recovery problem.In contrast, we propose an attentive neural network-based model, which can better model spatial-temporal dependence for sparse trajectory and exploit history more efficiently.

Attentive Neural Network
The recent development of attention models has established new state-of-the-art benchmarks in a wide range of applications.Attention is first proposed in the context of neural machine translation [4] and has been proved effective in a variety of tasks such as question answering [32], text summarization [28], and recommender systems [15,30].Vaswani et al. [35] further proposed multi-head self-attention, renewed as Transformer, to model complicated dependencies between words for machine translation.It makes big progress in sequence modelling by the fully attention-based architecture, which disposals RNN but outperforms RNN-based models.A very relevant research with our methodology proposed to leverage Transformer for sentence imputation [17].They have verified that the attention mechanism can effectively capture the semantics from a few tokens.Researchers also showed consistent performance gains by incorporating attention with RNN for mobility modelling, such as location prediction [8], and personalized route recommendation [37], where attention can make up RNN's limitation in capturing long-term temporal dependence.Different from prior works, we are the first to adopt the fully attentive neural network, following Transformer, to tackle the mobility data recovery problem with both intra-and inter-trajectory attentions.

Uncertainty Quantification
It is known that deep neural networks are poor at providing uncertainty and tend to produce overconfident predictions [3].Therefore, uncertainty quantification (UQ) is receiving growing interests in deep learning [19].There are roughly two types of UQ methods in deep learning.One leverages frequentist thinking and focuses on robustness.Perturbations are made to the inference procedure in initialisation [11] or datasets [21,41].However, this type of methods usually yield very high computational or memory cost.The other type is Bayesian, which aims to model posterior beliefs of network parameters given the data [18].Nevertheless, the posterior is intractable to derive with a vast number of parameters of modern deep neural networks.To ease the computing, variational inference [5], and Monte Carlo Dropout (MCDropout) [10] were proposed as an approximation.
Deep learning is presenting increasing promise for spatial-temporal modelling.However, most prior works focused on point estimates without quantifying the uncertainty of the predictions.In high stakes domains include mobile sensing and urban computing, being able to generate probabilistic forecasts with confidence intervals is critical to risk assessment and decision making.Yet, this is under-studied in spatial-temporal data.Wu et al. overviewed and empirically studied several uncertainty quantification baselines for continual spatial-temporal forecasting [39].Through the comparison, they have shown that being able to generate probabilistic forecasts with confidence intervals is critical for risk assessment and decision make, and even simple uncertainty quantification methods outperform deterministic models that only generate point estimation.Comprised to their work, we proposed to incorporate MCDropout in our transformer-based trajectory recovery framework to achieve uncertainty-aware missing location generation.

CONCLUSION AND FUTURE WORK
In this paper, we investigate the problem of trajectory recovery for mobile users with sparse location records.To fill in the missing locations at a given time interval, we propose an attentive neural network-based model AttnMove, which can model mobility patterns from sparse records and exploit history for recovery efficiently.We evaluate our model on two real-world datasets.Results demonstrate significant improvements (4.0%∼7.6%)compared with state-of-the-art methods.Our work pays the way for a wide range of downstream mobility-related applications such as travel mode detection and population density estimation.
One limitation of our study is that we only used random masks for experimental evaluation.Since our study aims to address the issue of sparse GPS records, the locations that are intrinsically missing are expected to be recovered.However, due to the lacking of the data, i.e., we cannot find a mobility dataset with both complete and naturally incomplete GPS records at the same time, so we cannot train and test our model on the locations that are realistically missing.For a practical evaluation and a fair comparison to baselines, we used random masking operation to simulate the characteristics of intrinsic sparsity and we tested different missing ratios to evaluate our model holistically.We plan to conduct a more realistic evaluation as future work when more data becomes available.
Another direction of future work will be to examine the mobility recovery performance of different demographic groups.In this study, we have demonstrated that our proposed method outperforms a wide range of baselines for the general population, but it is unclear how much the history-based strategy can benefit various users, such as citizens and tourists.

Fig. 1 .
Fig. 1.An example illustrating the characteristic of mobility data, where one day's location records from an individual user are sparse but it is possible to recover the user's fine-grained trajectory by aggregating one week's records.

Fig. 3 .
Fig. 3.The architecture of proposed trajectory attention mechanisms under one head, where the new representation is a combination of embeddings of value time slots conditioned on attention weights (ℎ)

Fig. 4 .
Fig. 4. Illustration of spatial resolution for trajectories we aim to recover.The city is Beijing, where solid lines present block boundaries and the area of each block is 0.265 2 on average.

4. 4 . 1
Identify the best combination.(RQ1).To answer the first research question, we conduct experience based on different sub-modules proposed in Sec 3.1&3.4and compare Recall on two datasets in Table

( a )
Location embedding with colour denoting cluster.(b) Correlation with geographic distribution.

Fig. 5 .
Fig. 5. Visualization of location representation.More than 60% location embedding has a high similarity (0.85) to graph embedding.

Fig. 6 .
Fig. 6.Attention weight distribution in the final decoder layer, where the color denotes the weight and brighter means larger.
(a) is distinguished from Fig. 6(b), indicating that the attention heads in our model capture various semantics: Fig. 6(a) shows high relation between working hours while Fig. 6(b) indicates a whole day's general mobility pattern.Therefore, multi-head attention is of great importance to model semantic characters in our model.

Table 6 .
Impact of attention mechanisms on Tencent dataset, where denoted the performance decline compared with complete AttnMove.CDF of Entropy.(b) PDF of Entropy.(c) PDF of Confidence.
(a) Excluded by Entropy.(b) Excluded by Confidence.

( a )Fig. 9 .
Fig. 9. Model performance in different scenarios, where the solid lines denote Recall and the shades show the probability distribution.

( a )Fig. 10 .
Fig. 10.Performance of AttnMove with varying embedding size, number of attention head and model layer.

4. 4 . 6
Parameter Analysis (RQ6).We investigate the sensitivity of dimension , the head number , and the number of layers , which determine the representation ability of the model.Fig.10(a) presents the performance on different embedding dimension values.

Table 1 .
Notations used in model design.

Table 2 .
Summary of Embedding Strategies.

Table 3 .
Basic statistics of mobility datasets.

Table 5 .
Overall performance comparison.The best result in each column is bold, while the second is underlined.