Explainable Trajectory Representation through Dictionary Learning

Trajectory representation learning on a network enhances our understanding of vehicular traffic patterns and benefits numerous downstream applications. Existing approaches using classic machine learning or deep learning embed trajectories as dense vectors, which lack interpretability and are inefficient to store and analyze in downstream tasks. In this paper, an explainable trajectory representation learning framework through dictionary learning is proposed. Given a collection of trajectories on a network, it extracts a compact dictionary of commonly used subpaths called"pathlets", which optimally reconstruct each trajectory by simple concatenations. The resulting representation is naturally sparse and encodes strong spatial semantics. Theoretical analysis of our proposed algorithm is conducted to provide a probabilistic bound on the estimation error of the optimal dictionary. A hierarchical dictionary learning scheme is also proposed to ensure the algorithm's scalability on large networks, leading to a multi-scale trajectory representation. Our framework is evaluated on two large-scale real-world taxi datasets. Compared to previous work, the dictionary learned by our method is more compact and has better reconstruction rate for new trajectories. We also demonstrate the promising performance of this method in downstream tasks including trip time prediction task and data compression.


INTRODUCTION
The development of information technology and the widespread use of mobile devices have produced a large amount of GPS trajectory data.Raw trajectory data typically appears as variable-length ordered sequences, which cannot be directly input into common data mining algorithms.Trajectory representation learning, which means transforming a trajectory into an embedding vector, can standardize trajectory data, extract valuable information from redundant original data, and benefit various downstream tasks including trajectory compression, trip time estimation [1].
Recently, various deep learning based models for trajectory representation learning has been developed.For example, Yang et al. [2] introduced a model based on self-attention (T3S) that automatically adjusts the importance of spatial and structure information for different similarity measures.And they showed the effectiveness for trajectory similarity computation.In addition, in [3] the authors proposed a trajectory encoder-decoder network based on graph attention mechanism to obtain trajectory embedding and evaluate in vehicle trajectories prediction task.Before the emergence of these deep learning based methods, researchers also attempted to explore this field using traditional algorithms, including [4], wherein the authors introduce a pipelined algorithm that extract frequent underlying paths called corridor from trajectories and evaluate it using Minimum Description Length (MDL) score.Besides that, Zou et al. [5] extracted middle level features from trajectories for clustering using a cluster specific Latent Dirichlet Allocation Model.However, the representations generated by previous methods are usually dense vector whose dimensions lack semantic meanings.As a result, it is difficult to interpret the learned representation.In this paper, we introduce an explainable trajectory representation method through dictionary learning for trajectories on a network.The network is usually a road map for vehicle trajectories or a grid network for unstructured trajectories, on which trajectory can be projected using map matching [6].The basic idea is demonstrated in Figure 1.Given a collection of trajectories on a network, it extracts a compact dictionary of commonly used subpaths called "pathlets".Each trajectory can then be reconstructed by concatenating pathlets from the dictionary, similar to the process of constructing a sentence by assembling a group of words.The resulting trajectory representation is a sparse binary vector, where each dimension corresponds to a pathlet in the learned dictionary and each binary variable indicates whether the corresponding pathlet is used to reconstruct the trajectory.Such design is motivated by the observation that people's travel behavior exhibits remarkable regularity, enabling us to reconstruct majority of trajectories using a small set of movement patterns.
The pathlet representation of trajectories was first explored by Chen et al. [7], who formulate the pathlet learning problem as a combinatorial optimization problem.Solved approximately using dynamic programming, the original formulation is costly to compute and lacks theoretical guarantee.We propose an algorithm using a novel dictioanry learning formulation that provides better optimality and scalability for large trajectory datasets.Specifically, in our formulation, the objective function minimizes the size of the pathlet dictionary and the average number of pathlets required to reconstruct each trajectory at the same time.We propose an efficient solution to this integer programming problem, by first solving its relaxed version and find the integer solution using randomized rounding.To ensure the scalability to large-scale road networks, we further propose a hierarchical representation scheme that compute pathlets of different granularity in multi-scale spatial partition of the map.This algorithm is evaluated on two real-world taxi datasets and some frequent mobility patterns are visualized.We also demonstrate the promising performance of this method in downstream tasks.For example, our method outperforms neural-network based methods by 4.7% in prediction accuracy on trip time prediction.

PRELIMINARY
Terminology.Given a dataset  and a roadmap that can be formed as a directed graph  = (,  ), a trajectory  ∈  is defined as a sequence of edges  on .For each , a path  on  is a candidate pathlet if  is a subpath of .We denote the set of all candidate pathlets traversed by T as .
Given a pathlet dictionary  and a trajectory ,   is a subset of  so that  can be represented by concatenating  ∈   .This process is denoted by  =  (  ).Furthermore, the representation cost  (, ) refers to the minimal number of elements required to represent  , which is defined as:  (, ) = min Problem definition.The goal is to find an optimal dictionary  that minimizes the following two objectives at the same time: 1) the size of the dictionary, as a smaller dictionary contains less redundant information and is therefore more desirable.2) the average number of elements required to reconstruct trajectories.We use hyperparameter  to control the trade-off between these two objectives.Therefore, similar to [7], in this paper the pathlet dictionary learning problem is defined as:

METHODOLOGY 3.1 Problem Formulation
To formulate the problem defined above using vector notations, we introduce three matrices ) refers to the maximum value of  th row of , which is equal to 1 if any trajectory utilizes   to represent itself.In other words,  ( ,: ) = 1 means that candidate pathlet   is selected as an element of the dictionary.We reuse the notation  to represent the matrix form of the dictionary, which is a submatrix formed by selected columns of ,  =  [:, { |  (  , :) = 1}] and therefore  () =

|𝑃 |
=1  ( ,: ).The constraint  =  corresponds to the setting that each trajectory should be reconstructed using pathlets.In this optimization problem, the dictionary and the assignment relationship will be optimized at the same time.It is worth noting that the pathlet learning problem described above is NP-hard in most cases.Therefore, an effective algorithm is required to obtain good approximated solutions.

Pathlet Dictionary Learning with Randomized Rounding
The proposed algorithm consists of two main steps.Firstly, we relax the binary constraint, which transforms the original optimization problem into a convex optimization problem.Therefore, the global optimal solution  * can be found easily by the projected gradient descent algorithm.Then a randomized rounding step is carried out to obtain the final solution   .The whole procedure is shown in the following pseudocode of Algorithm 1.
Algorithm clip the result to make sure 0 Probabilistic Bound.Given constant matrices (, ) and hyperparameters (, ) , We claim that the final solution   satisfies In practice, | | can be quite large.We pre-filter out less frequently used candidates to alleviate computational burden.Please refer to Appendix B for details.
This inequality means that the probablity that a solution with low cost can be found and all trajectories will be covered at the same time is lower bounded by a positive constant.Therefore, we can repeat the randomized rounding process to get a series of {  1 ,   2 ...} until find a satisfactory solution.The proof can be found in appendix part A.

Hierarchical Pathlet Learning
Candidate pathlet space consists of all segments of trajectories from dataset, whose size is usually huge in real-world dataset and make it time-consuming to get the solution.On the other hand, Multi-scale dictionaries of pathlets and trajectory representations can help people gain a deeper understanding of traffic characteristic.To enhance the scalability of the original algorithm, we introduce a hierarchical method called "pathlet of pathlets" to reduce the computation complexity and generate multi-scale trajectory representations.
Specificly, we first partition the roadmap into different levels of granularity using axis-aligned binary space partitioning.Starting from the bottom of the partition tree, we compute the -th level pathlet dictionary   as the union of dictionaries computed in all -th level cells.Next, we use the -th level pathlet representation of each trajectory as the input, and compute the ( − 1)-th level pathlet dictionaries.This iterative process can be repeated to generate multi-scale pathlets that capture movement patterns.

Representing New Trajectories
Once we obtain a set of dictionaries at multiple scales, we can use them together in representing new trajectories.We define a unified dictionary matrix  ′ by concatenating the dictionary matrices by column.The size of dictionary  ′ is therefore equal to the number of columns of  ′ .For any new trajectory, it can be mapped to a new representation space using  ′ .To be specific, representation vector is obtained by solving: min Here  represents the vector recording the edges covered by a new trajectory, and  denotes the representation vector that we aim to solve for.This problem can be viewed as a simplified version of the original problem because the dictionary is fixed at this moment.We solve it using the same strategy described before: first compute the optimal fractional solution  * using gradient descent and then round it to get the final binary solution.in Shenzhen [8] and Porto [9].Our research largely follows the problem formulation described in [7] but we adopt different formulation and method.In that paper, the authors first relaxed  ( ,: ) to  , , and then solved it using dynamic programming, which is simple and effective.However, this relaxation operation resulted in an redundant dictionary, providing us with room for improvement especially when  is small.

EXPERIMENTS 4.1 Numerical Performance
In this experiment, hyperparameter  and  are set as 0.1 and 1 4 (2| |) respectively, and we only randomly sample 3 times using strategy described before.As is shown in Table 1, our approach generates a more compact and effective dictionary compared to dynamic programming methods, reducing the dictionary size by 43.01% and 36.36%respectively on two datasets and the representation cost is relatively lower.At the same time, it is observed that the cover ratio is very close to 1, here  is set as 1  4 (2| |) instead of (4| |) because in the experiment we found that the method can still produce a feasible solution with low cost within 3 random sampling cycles, which further validates the effectiveness of previously derived probability bound.
4.1.2Reconstruction using Multi-scale Dictionary.The hierarchical framework enables us to learn multi-level pathlet dictionaries on arbitrarily large maps and datasets with limited computational resources.In this section, we validate the above statement by comparing the performance of the dictionary directly learned on the whole map (denoted using   ) and the dictionaries generated by hierarchical framework on test data.Specifically, we randomly selected 10,000 trajectories from the Futian district as train set to learn the dictionary and tested it on another 10,000 trajectories.In Table 2,  2 represents dictionaries learned on regions of the 2-th layer and  1 +  2 refers to multi-scale dictionaries.The performance of   can be considered as ground truth to some extent, although it comes with significant computational resource consumption.We can observe that compared to only using  2 , the reconstruction cost is much lower when using the multi-scale dictionary.The performance of the multi-scale dictionary is closer to that of   , but consumes only 54% of the GPU memory resources compared to the training of   and the computation time is reduced by 20%.

Visualization of Pathlet Dictionary
Some frequent pathlets are visualized in Figure 3 to intuitively verify whether the algorithm finds common mobility patterns or not.For example, Figure 3 (c) is a pathlet corresponding to turning left on the overpass.Figure 3 (e) depicts Praça Mouzinho de Albuquerque, which is one of the famous attractions in Porto.These pathlets have semantic meaning consistent with our cognition in life and reveal common mobility patterns shared by numerous trajectories.

Application in Trip Time Prediction.
To demonstrate the effectiveness and usability of the representation vector, we utilize a simple GBDT model to predict trajectory time whose input is the combination of trajectory embedding vector and the time encoding vector.We use mean absolute error (MAE) between the predicted result and the ground truth (in seconds) as the metric to train the simple GBDT model.The performance of all evaluated models are  summarized in the table 3. It can be observed that our proposed algorithm ensures explainability of the results without compromising accuracy.One possible reason why our method outperforms others is that our vectors are naturally sparse, which makes it more robust on the test set and easier to train the model.This demonstrates the simplicity and effectiveness of our method, as well as its broad prospects in the field of application.Learning a dictionary and reconstructing trajectories using elements from this dictionary can also be considered as a process of data compression.In [4] the authors described an evaluation method based on Minimum Description Length (MDL) to measure the compression performance: Here L (.) refers to the size of a data collection in bits.D and C are used to denote the dataset and the corridor set, a concept similar to pathlets.D | C refers to the representation of the original trajectory using corridor.Compared to the previous method's score of 0.27 reported in [4], our method achieved a score of 0.21.One possible reason is our objective function and MDL score are consistent, whereas method in [4] based on LDA does not optimize the MDL score explicitly.This experiment indicates that transforming trajectories into pathlets form can effectively compress data, facilitating easier storage and transmission.

CONCLUSION AND FUTURE WORK
In this study, we reformulated the problem of pathlet learning from a collection of trajectories and solved it using a novel dictionary learning based method, resulting in a hierarchical and explainable representation of trajectories with theoretical probability bound.We tested our algorithm on two large-scale datasets.The output dictionary of pathlets provides us with deeper insight into mobility patterns.We also demonstrate how the pathlet could benefit downstream tasks such as trip time estimation and trajectory compression.In future work, we will adapt our algorithm to represent trajectories in other domains, improve the numerical optimization, and further advance the theoretical analysis.
A PROOF OF PROBABILISTIC BOUND.=1  ,   , is an integer greater than or equal to 0. If  , = 0, then constraint will be satisfied automatically; For each element  , = 1 in , the probability that constraint on  , is not satisfied is: (which means the possibility that  is not covered for a  ∈ )   , = 0) (5) For  ≥ 0, we have 1 − (, 1) ≤  − , therefore We have
Based on Lemma1, we have Therefore, On the other hand, Therefore, Therefore, E[ (  )] <  +1   ( * ) Step 3 of proof.Based on the Markov inequality, we have Assume that constraint is not satisfied or  (  ) > 2 +1   ( * ) are bad events, by the naive union bound, the probability that one of these two bad events happens is less than 1/2 + | |exp(− ).Thus, if  ≥ (2| |), with positive probability there are no bad events happen and the cost of the final solution   is at most 2 +1   ( * ).

B EXPERIMENTS SETUP B.1 Dataset
The following describes the trajectory datasets used in our study and some key statistics of datasets are summarized in Table 4.
Shenzhen.Zhang et al. [8] released this dataset containing approximately 510k dense trajectories generated by 14k taxi cabs in Shenzhen, China, which can be downloaded at [13].
Porto.This dataset describes trajectories performed by 442 taxis running in the city of Porto, Portugal [9].Each taxi reports its location every 15s.This dataset is used for the Trajectory Prediction Challenge@ ECML/PKDD 2015. Figure 4 displays the spatial distribution of two datasets, it can be observed that there is a significant spatial imbalance in the distribution of trajectories.In our experiments, we focus on densely populated areas.To implement our hierarchical algorithm, we first need to partition the map into smaller regions.Specifically, For the Porto dataset, we select a 15.3km x 13.5km area in the city center and divide it into six regions.Similarly, for the Shenzhen dataset, we choose the city center area encompassing Nanshan, Futian, Bao'an, and Luohu districts, and divide it into 32 grids.
For these two datasets, We remove trajectories with less than 20 GPS sample points and use the method proposed in [14] to convert trajectory to a series of edges on roadmap.Then the matrices  and  are generated as described in Section 4.

B.2 Evaluation Protocols and Platform
We randomly sampled 30% trajectories as our test dataset, and use the rest 70% as the training dataset.We evaluate the quality of a resulting pathlet dictionary from the following aspects: • Size of the pathlet dictionary, i.e., the number of pathlets in the pathlet dictionary.This characterizes the compactness of a pathlet dictionary.• Representation cost, i.e., the average number of pathlets used to reconstruct a trajectory, which measures the efficiency of using pathlet dictionaries to explain trajectories.• Coverage Ratio, this measures whether the dictionary can cover the possible trajectories as comprehensively as possible.Our method is implemented in Python and trained using a Nvidia A40 GPU.All experiments are run on the Ubuntu 20.04 operating system with an Intel Xeon Gold 6330 CPU.

B.3 Pre-filtering Method
In real-world scenarios, the number of candidate pathlets | | is quite large, meaning that the size of the matrices  and  is huge.Consequently, this poses significant challenge for both computation and storage.At the same time, the pathlets we aim to identify are shared mobility patterns among multiple trajectories.Therefore, we can proactively filter out infrequent candidate pathlets without significantly impacting the results.Specifically, for each candidate pathlet   , we traverse the trajectory dataset to count the number of trajectories that pass through   , denoted as   .Then a threshold   is set and only those candidate pathlets whose corresponding count exceeds this threshold are retained.A filtered candidate set  ′ = {  |   ∈ ,   >   } can be obtained.
In our implementation,   was set as 3. To evaluate the effect of adopting pre-filtering, we randomly selected 10,000 trajectories from the Futian district and the result is shown in Table 5.We can observe that the filtering operation significantly reduces GPU memory usage and computation time, while not significantly affecting the loss.

B.4 Implementation Details for Trip Time Prediction Task
The prediction of travel time is a regression task aimed at forecasting the duration of a trajectory's journey and the result are often strongly correlated with both the chosen route and the departure time.Our approach for encoding departure time in travel time prediction tasks is inspired by the positional encoding mechanism proposed in [15].Specifically, we encode this information using sine and cosine functions and concatenate it with the trajectory representation vector as input for the GBDT model.Given the departure time , the formula for the time encoding vector   is as follows: where  .ℎand  .refer to corresponding hour and minute of  respectively.To demonstrate the effectiveness and usability of the representation vector, we utilize a simple GBDT model to predict trajectory time whose input is the combination of trajectory embedding vector and the time encoding vector.We use mean absolute error (MAE) between the predicted result and the ground truth (in seconds) as the metric to train the simple GBDT model.The workflow is illustrated in Figure 5.The key parameters of GBDT are set as follows: tree max depth is set as 5; number of estimators is set as 100.
To maintain fairness in the comparison, we also generated a short trip version dataset from original Porto dataset following the sampling method described in [12].We conducted algorithm testing on both of these datasets and the metrics include MAE, MAPE, RMSE and RMSLE.

C SUPPLEMENTARY EXPERIMENTAL RESULTS C.1 Effect of 𝜆
We conducted the algorithm under different .It can be observed from Figure 6 that when  increases, the average number of pathlets need to construct a trajectory will decrease.At the same time, the size of the dictionary increases, which means that the algorithm prefers a more compact dictionary with a smaller .

C.2 Visualization of Hierarchical Pathlets
After obtaining the local pathlet dictionary, we can generate highlevel pathlets based on the previous result using the method described in Section 3.3.Figure 7 shows pathlets of different levels.The three rows correspond to pathlets at different levels.Since higherlevel pathlets are generated by concatenating lower-level pathlets, we can mine long-distance movement patterns from higher-level pathlets.

C.3 Partial Reconstruction
The pathlet dictionary for previous evaluation is a complete dictionary that can reconstruct every trajectory in previous results.However, if we accept a small portion of trajectories that are not rebuilt, the size of the pathlet dictionary can be significantly reduced.As shown in Figure 8, the ratio of uncover (the proportion of edges that can not be covered using pathlets from dictionary) decreases rapidly and drops below 5% when only 50% of most frequent pathlets are preserved, which means the majority of trajectories still can be reconstructed using half of the pathlet dictionary.On the other hand, it can be observed that each trajectory needs more pathlets to reconstruct itself when part of pathlets are preserved compared to representing using the complete dictionary, which reveals that there is a trade-off between redundancy and efficiency.In [7] the authors found the challenge of problem solving in scenarios involving large-scale datasets.To address this challenge, they proposed modifying the objective function, allowing the problem to be solved independently for each trajectory.This approach significantly reduced the complexity of the problem-solving process.Specifically, they transformed the original problem into solving a The primary distinction here is the substitution of   variable with  , .However, for a specific  ∈ , it is quite common that some  ∈  () do not use  to represent themselves.Consequently, there exists a considerable disparity between the solution obtained by this approach and the optimal solution.Especially when lambda is small, the size of a dictionary is a crucial factor in gauging its overall quality.

Figure 1 :
Figure 1: Illustration of pathlet learning: A pathlet dictionary is learned from dataset and each trajectory can be represented by concatenating pathlets chosen from this dictionary.

Figure 2 :
Figure 2: Illustration of the hierchical pathlet representation.Here    refers to the -th region of the -th layer.

4. 1 . 1
The Performance Comparison with Previous Work.The proposed method is evaluated on two datasets collected separately Due to the space limit, details of experiments setup can be found in appendix part B.

Figure 4 :
Figure 4: The visualization of the real-world datasets and trajectories in L1 grid.

Figure 5 :
Figure 5: Trip time estimation using representation vector.

Figure 8 :
Figure 8: The representation cost and uncover ratio when varying the size of dictionary.
, ,  to record the cover relationship among trajectories, edges, and candidate pathlets respectively.Matrix  has dimensions of || by | |, where each element  , is equal to 1 when the i-th trajectory passes through the j-th edge and 0 otherwise.Matrix  with a size of || × | | is constructed in the same way for the relationship between all candidate pathlets and edges.Similarly, matrix  is a | | × | | decision matrix, each entry  , = 1 if   is used to represent   , and  , = 0 otherwise.

Table 1 :
The performance comparison with previous work.

Table 2 :
The performance using different dictionaries.GPU memory here refers to the size of GPU memory needed for training instead of storage of dictionary. *

Table 3 :
The performance comparison with previous work.
Bernoulli random variable which means the maximum of set   ,: .

Table 5 :
The effect of adopting pre-filtering.