A Data Augmentation Algorithm for Trajectory Data

The growing prevalence of location-based devices has resulted in a significant abundance of location data from various tracking vendors. Nevertheless, a noticeable deficit exists regarding readily accessible, extensive, and publicly available datasets for research purposes, primarily due to privacy concerns and ownership constraints. There is a pressing need for expansive datasets to advance machine learning techniques in this domain. The absence of such resources currently represents a substantial hindrance to research progress in this field. Data augmentation is emerging as a popular technique to mitigate this issue in several domains. However, applying state-of-the-art techniques as-is proves challenging when dealing with trajectory data due to the intricate spatio-temporal dependencies inherent to such data. In this work, we propose a novel strategy for augmenting trajectory data that applies a geographical perturbation on trajectory points along a trajectory. Such a perturbation results in controlled changes in the raw trajectory and, consequently, causes changes in the trajectory feature space. We test our strategy in two trajectory datasets and show a performance improvement of approximately 20% when contrasted with the baseline. We believe this strategy will pave the way for a more comprehensive framework for trajectory data augmentation that can be used in fields where few labeled trajectory data are available for training machine learning models.


INTRODUCTION
In recent years, driven by the widespread adoption of locationbased devices, we have witnessed a burgeoning interest in research concerning the analysis of movement data [10,15,17].Nonetheless, the abundant availability of such data from tracking product vendors is juxtaposed with the scarcity of publicly accessible real datasets for research purposes.This challenge, compounded by the necessity for extensive labeled databases to train machine learning and deep learning techniques, underscores a pivotal obstacle in the progression of research within this eld [4,5,14].
Data augmentation has emerged as a powerful technique in machine learning, strengthening model robustness while mitigating overtting and undertting issues by generating diverse synthetic data that can be used to overcome the scarcity of labeled data.Data augmentation is the process of generating synthetic data by applying transformations to the original examples [13].Data augmentation not only increases the eciency of the machine learning models but they have also been shown to make the machine learning models more robust [9].
In the literature, we have seen dierent techniques for data augmentation in dierent domains such as image geometric transformations (e.g., data jittering, cropping, ipping, distortion, and rotation) [7], Fourier transformations (e.g., fast Fourier transform and Gaussian noise injection) [8], time-series augmentation (e.g., time warping, slicing, permutation, and interpolation) [8], Generative Adversarial Networks (GANs) [1] and Encoder-Decoder Networks [11].Despite its success in other domains, data augmentation's potential remains largely untapped in mobility data analysis, primarily due to the intricate nature and unique format of trajectory data.A trajectory is a spatio-temporal data consisting of temporally and spatially spaced points associated with location information, as detected by location devices.
Data augmentation techniques developed in other domains, like geometric augmentation methods for image processing, cannot be directly applied to trajectories since they would barely impact the features extracted from a trajectory.Similarly, Fourier transformations, predominantly employed in image and wave analysis, nd limited use outside these domains and may need a careful redesign to be used in trajectory analysis.A notable parallel lies in the resemblance between time series and trajectory data, given that trajectories also comprise temporally spaced points.This similarity encourages the adaptation of augmentation techniques from the time-series domain to the trajectory analysis domain.
Our proposal is based on the concept of adding noise to data before feature extraction.In the case of trajectory data, simply adding random noise to trajectory point features may create attributes with inconsistent values, like increasing the object's speed and, at the same time, decreasing the acceleration.To resolve the matter at hand, our approach entails the implementation of a random geographical variance into the original trajectory data.This is accomplished by adjusting the geographical coordinates of the trajectory points within a predetermined circular region.We achieve this by selectively shifting the location of specic points either within the circle or along its perimeter.By utilizing this spatial threshold, we aim to prevent any potential discrepancies from arising.
In summary, this work's contributions are the following: • We propose two geographical noise strategies that randomly move trajectory points inside a geographical circle (in-circle) or on the boundaries of a geographical circle (on-circle).• We propose an algorithm that randomly selects trajectory points and applies geographical noise using two techniques (e.g., in-circle and on-circle) to generate consistent augmented trajectories.• We test and properly evaluate our proposed method in three datasets, showing that the technique creates augmented trajectories that can improve machine learning methods' performances.

METHODOLOGY
The objective of trajectory data augmentation is to enhance a given set of trajectories by introducing geographical noise, thereby generating new synthetic trajectories that oer increased diversity that data mining methods can use.We can apply the strategy to augment these data by moving trajectory points within a limited geographical neighborhood, thus creating consistent data.The proposed methodology for data augmentation for trajectory data is based on choosing a circular spatial buer around each trajectory point and then moving the point based on two strategies: (1) inside the circle or (2) at the border, detailed below.

Geographical Noise Strategies
Our rst strategy to apply geographical noise in trajectory points is to create a geographical circle in selected points and randomly move this trajectory point inside (i.e., in-circle geographical noise) to this geographical boundary.An example of such a technique is seen in Figure 1, where two trajectory points are selected, and two trajectory points are moved inside the circle (red dots).
Our second strategy to apply geographical noise in trajectory points is to create a geographical circle in selected points and randomly move this trajectory point on the border (i.e., in-circle

Trajectory Data Augmentation Algorithm
This section details Algorithm 1, which performs the data augmentation in raw trajectory data.The algorithm takes as input several parameters: (i) ?aug , representing the percentage of trajectories to be augmented; (ii) = gen , indicating the number of trajectories to be generated for each selected trajectory; (iii) ?noise , which species the portion of trajectory points to undergo geographical noise; (iv) A chosen geographical noise strategy, oering two distinct options: in-circle or on-circle; and (v) A circle , denoting the radius percentage to be used in the noise strategy.
Upon execution, the algorithm systematically proceeds as follows.Initialize an empty set, 0D6, which will be used to store the augmented trajectories (line 1).Select a subset of trajectories (line 2) from the input dataset to be augmented, with the subset size determined by the input parameter ?aug .For each trajectory C within the chosen subset, we start by conducting = gen iterations to create a specied number of new trajectories based on the selected one (lines 3 to 11).We rst duplicate the currently selected trajectory C, storing it as C 2?~( line 5).After, we randomly choose a percentage of trajectory points from C 2?~a ccording to the parameter ?noise (line 6).Then, we apply the designated geographical noise strategy (in-circle or on-circle) on the selected points, generating a set of modied trajectory points =4F ?CB (line 7).We then replace the trajectory points in C 2?~w ith the modied points =4F ?CB (line 8).Finally, we add the adapted trajectory C 2?~t o the set of augmented trajectories, 0D6 (line 9).After processing all trajectories in the selected subset, the algorithm concludes by returning the set of augmented trajectories, 0D6 (line 12

Datasets and Feature Extraction
In our experiments, we utilized three distinct datasets to assess the performance of our proposed methods.The rst dataset, named Geolife, is a subset of the comprehensive Geolife dataset [18], comprising 36 trajectories for a total of 355,181 trajectory points.These trajectories encompass various transportation modes, including airplanes, boats, subways, and taxis.The second dataset, referred to as Trac 1 , consists of 125 trajectories, encompassing 44,905 trajectory points, specically involving large and standard vehicles.The third dataset, called Birds, comprises 58 trajectories, incorporating a total of 528,488 trajectory points attributed to geese [6], gulls [16], and vultures [12].
In our experimental setup, we examined the combined impact of the geographical changes introduced by our proposed algorithm on the performance of trajectory classication models, as detailed in Section 3.3.
In the machine learning space around spatio-temporal data, it is usually preferred that the data is in segment-based form where each tuple in the dataset represents an entire trajectory or a subset of trajectory containing the statistical description of the trajectory or the subset.We calculate the Kinematic trajectory features from the entire trajectory and encompass metrics such as average speed and percentiles of direction variation using PTRAIL [2,3] In total, PTRAIL extracts 72 trajectory features from the trajectory itself.It is important to note that these trajectory features are utilized in both facets of our experiments: as attributes to calculate Euclidean distances and as the feature vector for training the machine 1 https://zen-trac-data.net/english/outline/dataset.htmllearning models.Due to the limited space to present our contributions, details about these 72 features can be found in the original documentation of PTRAIL2 .

Feature Vector Average Distance Analysis
When evaluating strategies for reducing geographical noise, we calculate the average Euclidean distance between the original trajectory features selected for modication and the novel trajectory features modied by our geographical noise reduction strategies.To gain initial insight into the eectiveness of our geographical noise strategies, we randomly chose 30% of the trajectories for testing.We ensure that there are no signicant overlaps between the trajectory points in both experiments by using a 50% radius for the geographical circle (denoted by A 28A2;4 in Algorithm 1).Subsequently, we generate 20 augmented trajectories (= 64= in Algorithm 1) for each randomly selected trajectory, aiming at evaluating the average impact in the feature vector from each of our proposed strategies.We modify 20%, 40%, and 60% of trajectory points with geographical noise strategies (? =>8B4 in Algorithm 1) to analyze changes in the feature vector.We repeat this experiment 20 times using the decimals of c taken four by four (e.g., 1415, 9265, etc.) as seeds to ensure variability in the random selection of trajectories and the geographical noise application, and the results are reported in Table 1.
Several conclusions and insights can be extracted from the results on the average Euclidean distances and standard deviations between original data points and points with injected noise.The average Euclidean distances generally tend to increase as the percentage of trajectory points selected is increased from 20% to 60%, suggesting adding more noise to trajectory points results in greater distance (i.e., higher dierences in the trajectory features) compared to the original trajectory.The choice of injection method (on-circle vs. in-circle) also has a noticeable impact on the average distances.In many cases, the on-circle method tends to result in more considerable distances than the in-circle method, indicating that injecting noise on the outer boundary of the original data points results in more signicant perturbations.One may choose oncircle or in-circle strategies depending on the specic application and goals.The on-circle method may be suitable when introducing more spread-out noise and potentially signicantly impacting the trajectory feature.In contrast, the in-circle approach may be preferred when you wish noise closer to the original data points.It is also noticeable that each dataset exhibits its own characteristics regarding how noise aects the average distances.For instance, the Geolife dataset generally shows more considerable distances than the other datasets, indicating that noise signicantly impacts this dataset.
The Geolife dataset, particularly with the on-circle method, shows higher standard deviations at 20% noise, indicating more variability in how noise aects this dataset at that specic noise level and injection method.Standard deviations for both the on-circle and in-circle methods are relatively consistent across dierent noise levels for the Trac dataset.The highest standard deviation in this dataset is 0.4933 for the on-circle method at 20% noise.The Bird dataset generally exhibits lower standard deviations than the Geolife and Trac datasets.This suggests that noise has a more consistent impact on the Bird dataset.This dataset's standard deviations are relatively low, indicating that the distances between original and noisy data points are relatively consistent.Even at 60% noise, the standard deviations remain relatively low (0.2233 for on-circle and 0.2477 for in-circle)

Classication Performance Analysis
This experiment evaluates the impact of introducing augmented trajectories in a trajectory classication problem.We establish a baseline model for the trajectory classication performance evaluation by splitting the original data set into 80% for training and 20% for testing and repeating such split 20 times using the decimals of c taken four by four.We then average these performances to establish such a baseline value.After, we start testing the impact of introducing augmented trajectories in the training data.To train a model, from the training set, we randomly select 30% of the trajectories (designated as ?0D6 in Algorithm 1) to test our proposed strategies.We test our algorithm with a 50% radius (A 28A2;4 in Algorithm 1) for the geographical circle in both experiments to ensure we do not have signicant overlaps between trajectory points.We generate 20 augmented trajectories (= 64= in Algorithm 1) for each subselected one.We again test the values of 20%, 40%, and 60% of trajectory points to be modied (? =>8B4 in Algorithm 1) by our geographical noise strategies to evaluate the impact in classication performances for distinct classication models.Aiming at evaluating the impact of the newly created augmented trajectories in trajectory classication problems, we evaluate the impact on the F-scores of training supervised machine learning models with and without augmented trajectories in the training set.We use three supervised machine learning models named Random Forest, Extra Trees, and XGBoost.The results of the experiments are detailed in Table 2.
Using the ExtraTreesClassier in the Geolife dataset, in most cases, F-scores are below the baseline, indicating that the augmentation techniques (on-circle and in-circle) generally result in no improvements over the baseline data with no augmentation.The only exception is the on-circle strategy applied to 20% of the points, leading to a marginal gain of 0.29%.For the GradientBoostingClassier, the F-scores are more consistently above the baseline, with gains ranging from 4.72% to 2.27%, suggesting that the augmentation techniques also improve performance for this model.Similarly, the F-scores for the RandomForestClassier are consistently above the baseline, with gains ranging from 1.67 to 1.33%, indicating that the strategies generally improve model performance on the Geolife dataset.
The F-scores are all below the baseline for the ExtraTreesClassier in the Trac dataset, indicating that the modication techniques lead to no improvement for this model on the Trac dataset.The F-scores for GradientBoostingClassier show improvements for all percentages using the on-circle strategy, ranging from 2.86% to 1.59%.For the RandomForestClassier, the F-scores are consistently above the baseline, suggesting that the techniques result in improvements for this model on the Trac dataset and a maximal gain of 2.58%.Therefore, for the Trac dataset, GradientBoosting-Classier with the on-circle method and RandomForestClassier with any strategy consistently benet from the modication techniques, while GradientBoostingClassier shows mixed results.
In the Birds dataset, the ExtraTreesClassier shows F-scores substantially above the baseline in all cases, with gains ranging from 19.61% to 7.85%.The GradientBoostingClassier produced F-scores below the baseline, suggesting that the modication techniques do not improve performance for this model on the Bird dataset.The Fscores are mostly above the baseline for the RandomForestClassier, indicating improvements, except 60% of trajectory points modied by the in-circle strategy.For the Bird dataset, ExtraTreesClassier and RandomForestClassier show improvements with the augmentation techniques, while GradientBoostingClassier struggles to benet from them.
In summary, the analysis in relation to the baseline suggests that the impact of the data augmentation algorithm techniques varies depending on the dataset and the machine learning model, and our proposed algorithm indeed achieves performance improvements.For the Geolife and Trac datasets, the techniques often lead to improvements when using GradientBoostingClassier and Ran-domForestClassier.For the Bird dataset, ExtraTreesClassier and RandomForestClassier beneted the most from the techniques, while GradientBoostingClassier showed no improvement.

CONCLUSIONS AND FUTURE WORKS
In this study, we introduced a new algorithm for enhancing trajectory data by incorporating geographical noise, employing two distinct strategies -in-circle and on-circle.Our fundamental assertion is that the augmentation of trajectory data should precede feature extraction in trajectory classication tasks.Applying random noise to the trajectory features may lead to data inconsistencies.Hence, our algorithm is devised to randomly select trajectory data points and introduce geographical noise to these points, thereby generating augmented trajectories.This augmentation enables feature extraction to be conducted subsequently, ensuring data consistency and reliability.In our experiment, we evaluated how the incorporation of augmented trajectories aects the performance of machine learning algorithms in trajectory classication tasks.We observe that introducing augmented trajectories via our algorithm can positively impact the performance of machine learning models in many instances, thus paving the way for more robust strategies in trajectory data augmentation.
As future work, we envision this algorithm as the cornerstone for a comprehensive and robust framework for augmenting trajectory datasets.While our initial results have demonstrated promising outcomes even with a completely random selection of trajectories and trajectory points for noise injection, we recognize the importance of systematically controlling this noise to enhance its applicability in classication problems.Furthermore, we believe that innovating new techniques for ranking trajectories and trajectory points presents an exciting possibility for augmenting trajectories.

Figure 1 :
Figure 1: In-Circle Geographical Noise Example

Table 1 :
Average Euclidean distances for dierent percentages of trajectory points selected to be geographically modied by our strategies.In the table, bold values indicate the smallest value in each column.For this experiment, the lower the values, the better.

Table 2 :
Baseline and average F-scores for dierent percentages of trajectory points selected to be geographically modied by our strategies.