Reconsidering utility: unveiling the limitations of synthetic mobility data generation algorithms in real-life scenarios

In recent years, there has been a surge in the development of models for the generation of synthetic mobility data. These models aim to facilitate the sharing of data while safeguarding privacy, all while ensuring high utility and flexibility regarding potential applications. However, current utility evaluation methods fail to fully account for real-life requirements. We evaluate the utility of five state-of-the-art synthesis approaches, each with and without the incorporation of differential privacy (DP) guarantees, in terms of real-world applicability. Specifically, we focus on so-called trip data that encode fine granular urban movements such as GPS-tracked taxi rides. Such data prove particularly valuable for downstream tasks at the road network level. Thus, our initial step involves appropriately map matching the synthetic data and subsequently comparing the resulting trips with those generated by the routing algorithm implemented in OpenStreetMap, which serves as an efficient and privacy-friendly baseline. Out of the five evaluated models, one fails to produce data within reasonable computation time and another generates too many jumps to meet the requirements for map matching. The remaining three models succeed to a certain degree in maintaining spatial distribution, one even with DP guarantees. However, all models struggle to produce meaningful sequences of geo-locations with reasonable trip lengths and to model traffic flow at intersections accurately. It is important to note that trip data encompasses various relevant characteristics beyond spatial distribution, such as temporal information, all of which are discarded by these models. Consequently, our results imply that current synthesis models fall short in their promise of high utility and flexibility.


INTRODUCTION
The field of synthetic mobility data generation has experienced rapid growth, primarily fueled by privacy concerns that hinder the release of sensitive personal movement data.In essence, respective algorithms learn statistical distributions from raw data and employ them to generate synthetic data that emulate a similar distribution, while (supposedly) safeguarding sensitive personal information.Within the domain of mobility data, a variety of algorithmic approaches exist that can be categorized in terms of the specific type of mobility they aim to model.A frequently considered type of mobility in respective literature refers to 'trips' which consist of fine granular routes connecting an origin and a destination, such as taxi rides or GPS-tracked bicycle tours.This paper is focused on such trips, in contrast to mobility data that entails per person a sequence of stay locations over the course of a longer period of time, such as check-ins at restaurants or other points of interest.
One primary objective of trip data generation algorithms is to produce 'plausible' trips.Respective evaluations typically involve comparing aggregate statistics between raw and synthetic data, such as spatial distributions where space is commonly discretized based on a grid of a selected resolution.This operationalization often falls short in meeting real-life requirements: depending on the underlying (coarse) grid, the resulting synthetic trips usually do not follow the road network and instead jump over buildings and rivers in an unrealistic manner [21].
However, the benefit of trip data, unlike other types of mobility data, is its fine granularity such that it can be mapped to a road network.This especially enables analyses such as calculating average speeds or traffic volumes per road segment, which are valuable for applications like routing and urban planning.Furthermore, accurate representation of traffic flow at intersections facilitates optimized traffic light management [23].Bicycle trips, collected by smartphone apps like SimRa [13], enable the identification of road traffic hazard zones.
Therefore, we evaluate state-of-the-art synthesis algorithms by initially matching their generated trips to the road network.Subsequently, we compare the matched trips with those generated by the routing engine in OpenStreetMap which is a simple, efficient, and privacy-sensitive way to produce road-level movement data based on origin-destination (OD) pairs.We argue that synthetic data generation algorithms need to provide a higher utility than common routing engines to justifiably provide added value.
In summary, we address the following research questions: • (RQ1) What constitutes high utility of trip data and how can it be measured?• (RQ2) What level of utility do state-of-the-art synthesis models reach on these measures in comparison to a routing baseline?• (RQ3) Can a satisfactory utility level still be provided given (differential) privacy guarantees?
The remainder of this paper is structured as follows: We commence by presenting an overview of the five evaluated state-of-theart synthesis algorithms.In Section 3, we address RQ1 and propose a formulation of utility for trip data and suitable approaches for its measurement.Section 4 focuses on the experimental setup, while the results are presented in Section 5. Finally, we conclude by discussing our findings and possible future research.

SYNTHESIS ALGORITHMS
In recent years, a series of models for generating synthetic mobility data have emerged, aiming to facilitate the sharing of fine-granular datasets in a privacy-preserving manner.These algorithms learn statistical distributions from a raw dataset and generate a synthetic counterpart based on the learned distributions.Nonetheless, without additional privacy measures, there is no guarantee that the synthesis algorithm does not inadvertently reproduce real trips or reveal sensitive information through the preserved statistical distributions [17].To address this concern, many models incorporate additional privacy mechanisms, often in the form of Differential Privacy (DP) guarantees.
DP provides mathematical guarantees for preserving individual privacy [7].The core idea behind DP is that the output of an algorithm remains largely unaffected if the records of a single individual are either removed or added.This mechanism effectively limits the influence of a single individual on the overall analysis outcome, thus preventing the reconstruction of their specific data.Generally, a typical approach to achieving DP in numeric functions consists of adding calibrated noise drawn from a Laplace distribution to the function's output.In the context of synthesis algorithms, noise is commonly added to the underlying distributions such as that of origins or OD pairs before synthetic samples are generated.
Amongst the published synthesis algorithms, we have carefully chosen five models for our evaluation to ensure a diverse representation of modeling techniques.Our selection process took into consideration various factors, especially the clarity of reasoning behind their approaches, promising results demonstrated in respective evaluations, and the availability of source code, either as open source or obtainable upon request.The selection process resulted in the following models: AdaTrace [9], PrivTrace [22], BiLSTM [4], DP-Loc [14], and TrajGAIL [6].
AdaTrace [9] is a frequently cited and benchmarked differentially private model.Coordinates are discretized using an   grid of uniform cells.Synthetic trips are generated in three steps: First, OD pairs are sampled from a differentially private OD distribution.Second, the number of points per sequence is sampled from a differentially private distribution of the respective OD pair.Third, a sequence between a start and an endpoint is constructed by repeatedly sampling the next location until the sequence length is reached.The next-location sampling is based on a Markov model which holds DP transition probabilities for each location, i.e., grid cell, to each other location, given the previous location and the destination.The exact coordinates of the synthetic locations are sampled uniformly at random from within the grid cell.
To ensure a more accurate spatial distribution while maintaining a high level of privacy, the authors make use of a 'density-aware grid'.When a grid cell contains a small number of records, it is retained at a coarse top-level resolution.On the other hand, when the number of records in a grid cell is large, it is split into a finergrained resolution.Note that the Markov model relies on the coarse top-level grid, with the density-aware grid coming into play only during the final step of sampling coordinates.This comes with some implications for the resulting synthetic data: Despite using a coarse top-level grid, the aggregate spatial distribution is well maintained due to the preservation of hotspots through the densityaware grid (see Appendix B of [9]).However, it is worth noting that the distance between two consecutive points in the generated data is determined by the resolution of the coarse top-level grid.The compatibility with DP is an essential aspect of the model's architecture, and therefore, the implementation does not foresee the possibility of training the model without DP.
The recently proposed model PrivTrace [22] aims to improve upon AdaTrace, which supposedly lacks sufficient transition information due to its first-order Markov chain model.Additionally, it addresses the limitations of DPT [10], where the high-order Markov chain model introduces excessive noise due to the incorporation of DP.PrivTrace follows a similar approach as AdaTrace but with some notable distinctions.In PrivTrace, synthetic sequences are generated using a first-order and a second-order Markov chain.Sequence sampling is stopped based on 'virtual ends' in the Markov model instead of a sampled sequence length and a drawn destination.Additionally, PrivTrace incorporates transition probabilities of more fine-granular grid cells into the Markov Model, claiming to thereby provide better transition information.Thus, unlike Ada-Trace, the distance between consecutive points is not determined by the top-level grid resolution.Again, this model always provides DP guarantees.
Lestyán et al. [14] propose DP-Loc, a differentially private model that initially performs a dimensionality reduction by only considering cells that have been visited more often than a certain threshold.Then, similar to AdaTrace, DP-Loc first generates OD pairs and then constructs a trip based on transition probabilities.For the OD generation, a variational autoencoder is used, while transition probabilities are captured using a feedforward neural network (FNN).The authors argue that the basic FNN is adequate, as the information of the current location alone is sufficient to predict the next location, unlike, for example, models that specifically aim to model sequences.To guarantee DP, Gaussian noise is added to the distribution of visit counts per cell [8], and both neural networks are trained using Differentially Private Stochastic Gradient Descent (DP-SGD) [1], which adds noise to clipped gradients during training.Notably, unlike the two other models, DP-Loc includes the generation of start times of trips.
Blanco-Justicia et al. [4] propose the utilization of a BiLSTM, a bidirectional long short-term memory network, a recurrent architecture that has proven superior performance in modeling sequences such as time series or natural language.Considering a visited location as a word and a trip as a sentence, the principles of autoregressive text generation using RNNs can be adopted to generate trips.For the task of trip generation, a start location is selected at random.Using the BiLSTM, the location is extended in an autoregressive manner to generate subsequent locations until the sequence length sampled from the respective distribution is reached.The authors propose a privacy mechanism that samples one of the top three next-location predictions uniformly at random from the probability distribution of the next locations.This approach comes with a few downsides.Firstly, privacy is not formally guaranteed.Second, utility is not well maintained, as the results in the original paper indicate major jumps between consecutive points.We thus follow the authors' advice for future work and apply a DP mechanism to the model using DP-SGD, more specifically, the Tensorflow implementation of a DP Adam optimizer.
TrajGAIL [6] is based on inverse reinforcement learning that makes use of the representation of an agent that moves around according to a set of actions and a learned policy.The original paper considers a chessboard-like road network, for which only the actions 'forward', 'left', and 'right' are included.Generally, the set of actions needs to be specified and for each intersection, all options need to be defined.In chapter 4.1 we elaborate on our adaptions to allow for more heterogeneous road networks.The model does not provide any privacy guarantees or mechanisms.However, we still include the model in our evaluation to elaborate on its suitability and the potential for further DP advancements.

UTILITY MEASUREMENT
Defining and evaluating the utility of synthetic mobility data is a complex task.Unlike some other domains such as medical data, where typical applications like diagnosis prediction can be framed as classification tasks, typical mobility data applications do not lend themselves easily to such formulations.As a result, utility is often assessed based on the degree of statistical similarity between the synthetic and raw data for different mobility characteristics.A mobility characteristic, like the spatial distribution, is operationalized by concrete measures, like the Jensen-Shannon divergence (JSD) of visits per grid cell.Accordingly, hitherto executed evaluations of the considered synthesis models are based on similarity measures concerned with all or some of the following characteristics: spatial distribution, OD pairs, frequent patterns of sequences, and trip lengths.However, the measured level of utility is highly dependent on the selected characteristics and their operationalization.For example, a high similarity of the spatial distribution does not necessarily ensure that the generated trips consist of reasonable OD combinations.Furthermore, achieving high similarity based on, e.g., the JSD given a coarse grid with large cells (e.g., 2 x 2) provides only limited insight into the preservation of spatial distribution at a more detailed level.
Moreover, assessing utility based solely on these measures may not provide a comprehensive understanding of how well the synthetic data performs in real-life scenarios, like the examples given in the introduction.To overcome this limitation, we propose adopting a practitioner's perspective to specifying high utility for synthetic trip data and identifying appropriate measurements.By doing so, we aim to capture the utility of synthetic data more effectively in real-world applications.
Mobility data often encompasses more than just geo-locations, including temporal information, mobility modes, and user demographics.These additional aspects contribute to a broader range of characteristics that practitioners may find essential [11].However, none of the five evaluated models considers user-related information, and only one model generates timestamps.Consequently, only characteristics addressing solely geo-locations can be meaningfully evaluated, reducing the set of characteristics to the distribution of OD counts, trip lengths, and the spatial distribution of records.
Since the distribution of OD counts does not depend on the specifics of trip data, we decided to exclude respective evaluations and focus on trip lengths and spatial distributions, as they rather allow to capture the inherent complexity of fine-granular sequences resembling street-level movements.
Map matching.Tasks based on trip data are commonly based on a street-level granularity, such as assessing traffic volumes on roads or occupancy information of public transit lines.However, all models under evaluation are based on grids that do not consider road networks and thus produce implausible trips that, e.g., cross buildings or rivers.Thus, we introduce an additional processing step of map matching such that all synthetic trips are 'snapped' onto the closest road, and consecutive records are connected according to the laws of physics and road regulations.We refer to data as produced by the model as original and its matched variation as matched.In Section 5.1, we evaluate the suitability of considered synthetic datasets for map matching.
Routing as baseline.Furthermore, for the respective comparisons, we introduce routing as a baseline.Routing engines such as Google Maps, the Open Source Routing Machine (OSRM), or OpenTrip-Planner use road and public transit network data to generate routes, typically optimized for the shortest or fastest paths.Thus, these engines offer an efficient and privacy-preserving approach to create realistic routes between a start and an endpoint.However, it's important to note that routing does not reflect real-life route choices comprehensively, as it cannot consider all factors relevant to individuals.For example, a cyclist might prefer a route through a park, even if it takes a bit longer, in order to avoid cobblestone streets or streets with too much car traffic.This highlights the value of actual user data for the work of practitioners [5,16,19].We argue, that synthetic data generation algorithms need to outperform routing engines to provide justifiable additional value.A routed variation of a dataset is created by routing the identical start and destination location of each included trip.
Trip lengths.Existing assessments of trip lengths generally indicate that generated synthetic data reaches a satisfactory level of similarity to the raw data.However, these evaluations only consider implausible unmatched trips, thus the trip length says little about the actual traveled distance if one were to follow the route.We argue that utility concerning trip length can only be considered high when taking into account the trip length of the route matched to the road network.To identify synthetic trips that are implausible due to unreasonable loops and unnecessary winding paths, we suggest conducting a secondary evaluation of trip length which is based on comparing the synthetic trips to the straight-line (SL) distance between their origin and destination.
Spatial distribution.To capture the spatial distribution of trips on a street level, a sufficiently fine-grained grid is required for evaluations.As a rough guideline, we consider grid cells with dimensions of 1 x 1 to capture a neighborhood and cells of 40 x 40 provide details on the level of roads.As a proxy for traffic volumes per road segment, we suggest evaluating the spatially aggregated trips based on a grid with a resolution on street level (of 40) by computing the Jensen-Shannon-divergence (JSD) between the matched raw dataset and the variants of synthetic datasets, where each cell is considered as a 'segment'.
It is difficult to interpret the values of the JSD in terms of practical utility.We thus additionally propose the downstream task of identifying avoided and preferred roads.For each trip, the matched and routed versions are compared; see Fig. 1 for a visual example.Each road segment  that is traversed by the matched but not the routed trip is considered 'preferred' and the counter npref  is increased by 1. Conversely, each segment that is traversed by the routed but not the matched trip is considered 'avoided' and its counter navoid  is increased by 1.We define the preference score of  as where   equals the total number of times that  has been passed by a matched or a routed trip.Preference scores can be assessed statistically, e.g., using the Pearson correlation coefficient between scores from raw and synthetic data.Secondly, the correct classification (preferred > 0, avoided < 0, neither = 0) can be analyzed in terms of standard classification metrics such as accuracy and F1-scores.To approach the downstream task from a practical perspective, we suggest a measurement of human inference, such as a survey.See Section 4.4 for our realization.
Finally, we propose to assess the spatial distribution with respect to traffic flow at intersections which is another relevant application [15] for example for traffic signal management [23].In this context, we refer to traffic flow as the entirety of movement patterns at an intersection.A movement pattern is described by the arrival road segment, the turning behavior, and the destination road segment.Similar patterns are grouped into movement clusters.As trips are already matched to the road network, movement patterns can easily be aggregated by clustering trips spatially based on a suitable distance metric.We refer to Section 4.5 for our utilized implementation.
In summary, we propose to measure the utility of trip data by comparing synthetic data, after it has been matched with the road network, to its routed counterpart with respect to the following aspects: (1a) distribution of trip lengths, (1b) distribution of trip lengths in relation to the OD straight-line distance, (2) distribution of traffic volumes on road segments, (3) identification of avoided and preferred roads, and (4) traffic flow at large intersections.

EXPERIMENTAL SETUP 4.1 Raw dataset
For our evaluation, we use the openly available SimRa dataset [2,3,12].SimRa [12] is a smartphone app for cyclists that records GPS traces, sampled every few seconds, to track near-miss incidents.We use the trips collected in the central area of Berlin limited to records within longitudes of 13.267 and 13.482, and latitudes between 52.445 and 52.576, thus covering about 14x14  2 .The dataset comprises 28,345 trips between 2019 and 2022.Near-miss incident information is discarded for our evaluation.

Configurations for synthesis models
For each evaluated synthesis algorithm, including both the differentially private (DP) and non-DP variants, when applicable, we generate a synthetic dataset containing 5,000 trips.To account for the non-deterministic nature of model-based data generation, we repeat this procedure five times and average the results.A privacy budget of  = 2 is used for the DP condition, which is the highest privacy budget setting in the evaluations of AdaTrace and PrivTrace.As the implementations of AdaTrace and PrivTrace do not provide a non-DP option, we set a very large privacy budget of  = 1, 000, 000 for the base condition 'without DP'.Datasets are generated on a machine equipped with an Nvidia A30 GPU, an Intel(R) Xeon(R) Gold 6346 CPU, and 251 GB RAM.
The choice of spatial granularity is pivotal to our utility formulation.Thus, we employ at the very least the finest granularity proposed by the respective authors.When computationally feasible, we opt for even finer resolutions.It is noteworthy that AdaTrace uses an adaptive grid, adjusting the spatial granularity based on the density within each grid cell.However, as explained in Section 2, the distance between consecutive points is determined by the top-level grid, which defaults to 6 x 6 cells in the respective source code.For the utilized SimRa dataset this would result in a top-level cell size of 2.5 x 2.5 and an average sequence length of 3.6 points.Such distances between consecutive points would not meet the requirements with respect to our utility measurement, as stated in Section 3. We thus adjust the top-level grid to  = 28 which corresponds to a cell size of 500 x 500.A finer-grained grid with a cell size of 250 x 250 exceeds the available memory.
The spatial granularity of the top-level grid of PrivTrace is determined by a parameter .We use a resolution of 500 ( = 28) for the sake of computational feasibility.Using a finer grid with a resolution of 250 significantly increases the runtime from 15 minutes to 34.5 hours.We further adjust AdaTrace and PrivTrace to generate coordinates with a precision of four decimals, as both round to only two decimals, which would correspond to a granularity of ∼ 1.
The BiLSTM uses a similar neural network architecture as [4].To speed up training and prevent overfitting, we implement an early stopping criterion causing the training to stop if the validation loss increases or stagnates for more than one epoch.Also, to enhance utility, we draw start locations from the distribution of start locations in the raw data instead of relying on random sampling.We use a grid resolution of 250, which is roughly equivalent to the authors' fine-granular choice.Additionally, we tested a finer resolution of 100, however, we discarded it due to computational infeasibility as it takes 8 hours for training and about 25 hours for trip generation, compared to 75 minutes for training, and 8 hours for generating when a resolution of 250 is employed.Like the authors, we modify the final dense layer based on the 3,188 cells that have been visited at least once.We also conduct a preprocessing step where all trips above a given trip length are discarded to save training resources and set the maximum length to 50 steps.In our setup, we yield an early stopping after 16 epochs, a training loss (accuracy) of 0.44 (86.67%), and a validation loss (accuracy) of 0.54 (85.68%).
To ensure DP, we apply the Laplace mechanism to the distribution of start locations and trip sequence lengths, respectively.As a fine-granular grid is expected to decrease utility in a DP setup due to a higher level of required noise, we initially test the DP variation with a coarser resolution of 500.The BiLSTM final dense layer is accordingly adjusted to 900 cells.The BiLSTM utilizes the DP Adam optimizer with an L2 clipping norm of 1, a noise multiplier of 1.4, and 64 micro-batches.The privacy budget is split such that  = 1 is allocated for the BiLSTM and  = 0.5 for each distribution.The batch size is reduced to 128 (due to memory limits), and 20 epochs are trained without early stopping.It takes 46 hours of training, thus a multitude of the non-DP version, yielding a training loss (accuracy) of 5.59 (10.56%) and a validation loss (accuracy) of 5.41 (12.82%).Due to unusable results for map matching, as we will show in the following Section 5.1, we refrain from evaluations with a finer grid resolution.
We run DP-Loc with a 500 grid resolution.As the output produces multiple datasets for different iterations of a Metropolis-Hastings algorithm, we follow the authors' advice for smaller datasets and use 100 iterations.Again, due to results unsuitable for map matching (see Section 5.1) we do not test further resolutions.
To employ TrajGAIL without translating the Berlin road network to the TrajGAIL network notation, we define the network based on a grid.For each cell, its eight neighboring cells are considered to be within reach according to one of eight respective actions.We project the dataset to a grid with a 500 resolution which results in a significantly larger network compared to the dataset evaluated by the authors of TrajGAIL, which only encompasses 9 intersections, and thus substantially higher computation times.While one iteration for their dataset only takes 8 seconds, the processing time for the SimRa dataset extends to approximately 3.5 minutes.Consequently, running 20,000 iterations, similar to the authors' evaluation, would require over 48 days to complete.Due to this computational infeasibility, TrajGAIL is excluded from further evaluations.

Matching and routing implementation
We use the OpenStreetMap Routing Machine (OSRM) [18] match service (API) to snap trips onto the OpenStreetMap road network.
The parameter  of the API is intended to set the GPS precision.We chose a setting of 10 meters for the raw and 20 meters for synthetic datasets which proved to be a reasonable setting within acceptable computation time. 1 To set up OSRM, we used the Open-StreetMap data excerpt for Berlin, provided as  file by Geofabrik 2 , from 17th November 2022 using the bicycle traffic mode.There are cases in which OSRM map matching does not produce entirely accurate results; see Appendix 6.1 for potential issues and respective validation analyses.For the routing baseline, we use the OSRM Route Service based on a similar OSRM setup.See Figure 2 for an example of created trip variations.

Road preference survey
We evaluate the downstream task of road preference detection with a survey on 40 selected meaningful examples of roads.Participants are asked to decide for each road, whether they consider it to be 'avoided', 'preferred', or 'not recognizable', based on the displayed preference score.To select roads, we initially filter segments that have clearly either been preferred or avoided by restricting to those with a preference score > 0.5 or < −0.5.Next, we limit the selection to the top 10% of visited and routed segments to focus on highly frequented areas.Finally, to obtain coherent clusters representing entire roads that are either avoided or preferred, adjacent remaining segments of the same class are merged.20 avoided and 20 preferred roads are selected from the largest clusters.The survey consists of map cutouts displaying preference scores alongside selected roads (see Figure 3).To enhance clarity, only preference scores from the top 25% of frequented segments are shown which are more likely to be robust and representative.The survey is conducted online, in German, and for each condition, at least 10 people are recruited.
Each questionnaire contains only one condition to prevent learning effects.

Traffic flow evaluation
We evaluate traffic flow preservation based on ten major intersections.They are selected by initially filtering the top 15 most visited cells from the raw data based on a grid with 500 resolution and manually setting ten meaningful bounding boxes within them.For each intersection and each routed and matched dataset, trips are cut according to the respective bounding box.The pairwise distance between these trips is computed with the Hausdorff distance (HD).Complete-linkage agglomerative hierarchical clustering is employed based on the resulting distance matrix.To determine movement clusters, the hierarchical tree is cut at a height of 5 meters 3  To assess the preservation of movement patterns by synthesis algorithms, the clusters of each synthetic dataset are linked with the clusters of the raw data.To link clusters, we employ a simple greedy matching algorithm that iterates over all clusters in the raw data.For each raw data cluster, we find the synthetic data cluster with the smallest HD.If the distance is smaller or equal to 5, they are considered a match, otherwise, the raw data cluster remains unmatched.To ensure computational feasibility, not all trips, but randomly selected representatives from each cluster of both datasets are used for distance computation.Finally, we compare the counts for each movement pattern using the normalized discounted cumulative gain (nDCG) which is a common measure to assess ranking quality in information retrieval.Only the  highest ranked scores are considered.The normalized version has a range from 0 to 1, where a score of 1 signifies perfect retrieval.

Matching ability
The longer the distance between consecutive points the less suitable is the application of map matching; in the extreme case where a trip 3 In experiments using various cutoff values, the relatively strict value of 5 proved to be the most suitable, as it effectively prevents false positives while minimizing false negatives (since trips are matched to the same road network, resulting in minimal variance).would only include origin and destination points, map matching would be equal to routing.Thus, for synthetic data to be usable for map matching we require the distance between consecutive points for the majority of trips to be within the magnitude of the spatial resolution used for data generation. 4Table 1 shows the distribution of distances between two consecutive points for each synthetic dataset (excluding repetition of identical locations).While AdaTrace (DP), PrivTrace (DP), and BiLSTM yield distances in the required order of magnitude, DP-Loc proves not suitable for map matching with a median distance of 1.4 (1.5 for the DP version).An examination of the trips generated by DP-Loc revealed that after only a few initial points a large jump to the destination followed.As our evaluation is based on road network matching, we discard DP-Loc from further analyses as well as the DP version of BiLSTM, which only consists of random jumps with a median distance of 5.This yields the following final set of synthetic datasets for the evaluation: AdaTrace, AdaTrace DP, PrivTrace, PrivTrace DP, and BiLSTM.
See Appendix 6.2 for visualizations of spatial distributions and example trips of each original dataset (without DP) and Appendix 6.3 for spatial distributions of matched and routed variations.

Trip lengths
As shown in Table 2, all synthetic median SL distances tend to be shorter than the raw data's SL distance of 4.57.AdaTrace is most similar and within a reasonable distance, followed by BiL-STM, while the other three synthetic data are too far off.More precisely, PrivTrace is much shorter with a median distance of 1 and even only 470 for its DP version, while the SL distance of AdaTrace DP is a lot longer than that of the raw data.In contrast to the described tendency for SL distances, the original trips of AdaTrace and BiLSTM tend to be longer than the original raw trips.Interestingly, despite AdaTrace DP having the furthest deviation in terms of absolute value, its SL ratio is closest to that of the raw data.While PrivTrace (DP) remains significantly below the respective value of the original data, the increase compared to the straight line is the highest with a median of 216% (184%) of the SL distance.These observations suggest that synthetic data is more circuitous  than real-life trips.This effect is even more evident in the matched datasets, as the SL ratio increases even more across all models.
As expected, the routing algorithm reliably produces trips with a median ratio to the SL distance of 120%-140% for all models.In comparison to the matched synthetic trips, the routed lengths are thus much closer to those of the raw trips and, for AdaTrace (DP) and BiLSTM, also with regards to the absolute value.Thus, none of the models succeeds at producing plausible trip lengths.

Traffic volume
Table 3 shows the JSD between matched raw data and all dataset variations, based on aggregations on a grid with a 40 resolution 5 .The JSD of the synthetic data SL versions are all in the same order of magnitude as the JSD of the raw SL version, providing a baseline upper bound of the JSD for each synthetic dataset.(Recall that the higher the value the lower the similarity of the respective datasets.)The spatial distribution of AdaTrace and PrivTrace original datasets, with and without DP, improve compared to the straight-line distribution, though only marginally.The JSD for the original BiLSTM dataset even increases.
All JSD values for the matched versions show the expected decrease, supporting our claim that map matching is needed for usable results on a street level.For AdaTrace and PrivTrace the results outperform their routed counterpart, at least slightly.On the contrary, the routed version of BiLSTM outperforms its matched one.The better performance of AdaTrace and PrivTrace comes likely due to their adaptive grid which captures 'hotspots' on a more fine granular level than the top-level grid.The low performance of the BiLSTM model is presumably rooted in the static grid utilized for training and data generation, and the difference between the grid resolution chosen in training (250) vs. evaluation (40) that reflect street-level relations.Experiments show that a JSD based on a coarser grid using a 250 resolution yields a superior value also for the matched BiLSTM data in comparison to the routed version.Thus, the BiLSTM might only obtain superior results for its matched data on a street level if a more fine-granular resolution is used for data generation.It is thus interesting to note that the map matching is not sufficient to compensate for the coarse resolution used for data generation.
AdaTrace shows the best overall results.It yields the only competitive JSD with its matched version to the routed raw version.Also, the JSD values of AdaTrace DP are only slightly worse than AdaTrace without DP.Unlike PrivTrace, which shows a majorly decreased performance for its DP version.

Road preference
5.4.1 Statistical similarity.We assess the preservation of road preference in synthetic data by comparing their preference scores with the ground truth obtained from the raw data.We test variations of two hyperparameters for the classification task.(1) Rarely visited segments are discarded based on a frequency threshold.Different thresholds are evaluated: top-100%, top-75%, and top-25% of frequented segments, based on the series of all   , are considered.Note that a threshold of top-100% frequented segments means that all segments with at least 1 record are included.(2) To account for cases where values are close but do not fall into the same class (e.g., the preference score for a segment is 0.1 according to the raw and -0.1 according to the synthetic dataset), we set a tolerance level such that two values are considered correctly classified if they do not differ more than the tolerance level.We consider values of 0 and 0.3.As shown in Table 4, the BiLSTM clearly outperforms Ada-Trace and PrivTrace on all settings with respect to the correlation coefficient, the accuracy, and the F1-score of the class 'avoided', while it is outperformed in terms of the F1-score of the 'preferred' class in three out of four hyperparameter settings though with less pronounced differences.As expected, setting a tolerance level increases the scores, as classes are assigned more generously.Further, intuitively almost all scores increase with an increasing frequency threshold, as highly frequented cells are expected to be captured with higher accuracy by the models.
The class 'preferred' is detected best by all models.This is especially true for AdaTrace (DP) and PrivTrace (DP) and can likely be explained with the selection of coordinates in their generation algorithm: Recall that data generation is based on discrete grid cells.While the BiLSTM always uses the center of the grid cells to create coordinates, AdaTrace and PrivTrace sample a random point from within the cell.This produces more variation and thus more visited cells (see Appendix 6.3 for maps of respective spatial distributions).Therefore, there is a bias towards 'preferred' classification: AdaTrace (PrivTrace) classifies 79% (79%) of considered cells as preferred and only 17% (13%) as avoided, while only 67% of cells are preferred in the raw data and 36% avoided.Thus, the high 'preferred' F1-score is especially driven by a high recall.In summary, the BiLSTM can be considered the best-performing model in terms of road preference and avoidance.
Considering DP models for both AdaTrace and PrivTrace, there is only a slight decrease in terms of accuracy, while the correlation coefficient decreases to a larger extent.Interestingly, both PrivTrace DP and Adatrace DP detect avoided roads slightly better than their non-DP counterparts.Further investigation showed that this is a noise-induced artifact causing more variation in OD pairs and thus a higher share of routed cells (24% for both, AdaTrace and PrivTrace) which leads to an increased recall of 'avoided' and a decrease for 'preferred', while the precision either decreases or remains the same for all classes.
The outcomes do not indicate a perfect classification but suggest that models have the potential to identify preferred and avoided roads to a certain extent, particularly when focusing solely on heavily trafficked sections.

Survey.
We assess the homogeneity of answers per condition with Krippendorff's alpha   (see Table 5) which is high for raw data and AdaTrace, while only medium for all other datasets (  ≥ 0.66).However, these discrepancies are seldom because participants directly contradict each other's assessments.Instead, they typically arise from certain participants adopting a more conservative stance, opting for 'not recognizable' rather than making a definitive choice between 'avoided' or 'preferred'.In particular, the results from the majority vote are robust when compared with sums of (in)correct answers.Table 5 presents the percentage of accurately classified, incorrectly classified, and unrecognizable roads among the 40 selected roads, based on the majority vote of participants.An accuracy of 100% in raw routes supports our defined ground truth.Even for all synthetic datasets, misclassification of roads was minimal, with most discrepancies attributed to being classified as 'not recognizable'.This trend was particularly noticeable in Priv-Trace (DP).Generally, preferred roads are more often classified correctly than avoided ones, except for BiLSTM.AdaTrace demonstrates the highest performance, closely followed by BiLSTM.Even in its DP version, 75% of the roads are still correctly classified.However, this also suggests that about every fourth road is misclassified or not recognizable.Considering that we only included the top 10% most frequented cells for the selection of roads, these results presumably do not suffice in practice.
Overall, the assessment of road preference indicates that especially AdaTrace and BiLSTM demonstrate a certain level of success in preserving spatial distribution information.We also find a set of inconclusive results with regard to different analyses concerned with spatial distribution: With regards to PrivTrace, even though its JSD slightly outperforms the BiLSTM's, it performs much worse in road preference classification.This is mainly driven by a bad detection of avoided streets and likely due to implausible short-distanced OD pairs, as seen in Section 5.2.Also, even though the BiLSTM obtains a slightly better F1-score for preferred segments than Ada-Trace (DP) for the comparable hyperparameter setting of a top-25% frequency threshold, the majority of individuals could identify 15 percentage points (5 percentage points) more of preferred roads based on AdaTrace (DP) data than on BiLSTM data.Once again, it is evident that the adaptive grid performs better than the static grid used by BiLSTM.The spatial distribution of matched trips (refer to Appendix 6.3) also reflects the visual characteristics of the grid.Although a similar distribution can be observed as in the raw data, there are numerous turns and edges.Employing a finer grid to project the training data could enhance the results.

Traffic flow at intersections
To assess the practical significance of the results, we analyzed the distribution nDCG values, visualized in Figure 5.The routed version of BiLSTM outperforms the matching on almost all intersections.Even for the top-performing AdaTrace, matching outperforms routing only in 6 out of 10 instances.Additionally, in half of the cases, the nDCG scores for 'matched' results are equal to or below 0.5.These findings highlight that while certain intersections may exhibit accurate traffic flow representation, there remains a 50% likelihood of erroneous outcomes.Consequently, the practical utility of the results becomes limited.

DISCUSSION AND CONCLUSION
The potential of synthetic data lies in its (promised) ability to provide full flexibility in analyses and modeling endeavors while ensuring a high level of privacy.In our research, we delved into the practical implications of 'full flexibility' for trip data consisting of fine-granular routes as typically recorded using GPS devices.
To address the question of what constitutes high utility for trip data and how it can be measured (RQ1), we assert that the metrics  employed should prioritize the detailed, sequential nature of trip data.Our proposal suggests evaluating the lengths of trips, traffic volumes, preferences for specific road segments, and the flow of traffic at intersections (in addition to standard high-level metrics already used in the literature).These measurements (along with the utilization of synthetic trip data at the street level) necessitate an additional computationally intensive step known as 'map matching'.We further contend that synthetic data must surpass those generated by standard routing algorithms in order to claim a significant advantage for subsequent analyses.
It is worth noting that the models examined in this study, similar to the majority of models used for generating synthetic mobility data, largely exclude attributes that are crucial for real-life applications, such as timestamps, traffic mode, and user-level information.This particularly includes the inability to group trips per user.This limitation restricts the potential applications of synthetic data to spatial information alone, which prompted the consideration of the aforementioned metrics.
We evaluated the utility of five state-of-the-art models, AdaTrace, PrivTrace, DP-Loc, a BiLSTM-based model, and TrajGAIL, using the designated utility metrics on a dataset comprising approximately 30,000 bicycle trips in Berlin.Regarding RQ2, we observed that Tra-jGAIL failed to generate data within a reasonable computation time for city-scale scenarios, while DP-Loc frequently generated jumps that were unsuitable for map matching.The remaining three models maintained a certain degree of spatial distribution, as evidenced by the analysis of road preferences.AdaTrace showed overall the best results.It allowed survey participants to identify preferred roads with 90% accuracy and an F1-score of ≥ 0.7.Additionally, both AdaTrace and PrivTrace slightly outperformed the routing baseline in analyzing traffic volumes.The adaptive grid employed likely succeeds in maintaining 'hotspots' of the spatial distribution.The BiLSTM-based model might potentially yield superior results if trained on a grid with a finer resolution and a sufficiently large dataset.However, the resulting increase in computational cost raises concerns regarding its practical feasibility.Overall, the BiLSTM results remain inconclusive since, contrary to the findings regarding traffic volumes on road segments, it performed best in terms of statistical similarity of road preferences and achieved results similar to AdaTrace in human classification of road preferences.This underscores the need to consider multiple metrics as a reliable measure of the success of arbitrary downstream tasks related to the respective distribution.
Additionally, we found unsatisfactory utility when evaluating the ratio of trip length to straight-line distance, with clear superiority of the routing baseline for all models.With respect to traffic flow at intersections, only AdaTrace managed to surpass the routing baseline substantially among the evaluated models, but even then, the provided level of utility remained questionable.
With regards to (RQ3), only AdaTrace and PrivTrace generated useful differentially private data.As the application of a BiLSTM without DP raises major privacy issues [17], further improvements are needed for it to serve as a sensible privacy technique.Notably, despite increasing AdaTrace's spatial resolution from the default setting of 66 to a more-detailed 2828 top-level grid, the DP version still provided superior results compared to PrivTrace without DP.It also outperformed the BiLSTM model in two analyses and was only slightly inferior in two other evaluations.However, the practical utility remains doubtful.In addition, it should be noted that all evaluated models only offer item-level DP as user IDs are discarded.This raises concerns about the provided privacy level, particularly if only few power users contribute a significant portion of the data.
We further acknowledge that there may be more suitable map matching algorithms for synthetic data than the utilized OSRM implementation.However, we rather suggest the advancement of synthesis algorithms such that road networks are inherently considered instead of optimizing post-processing techniques.
The SimRa dataset employed in this study is of significant importance from an urban planning point of view, given its representation of a substantial number of bicycle trips and riders' perceived safety.While the evaluation on other datasets that differ regarding size, city structure or traffic modes might vary to a certain extent, we expect the overarching trends to remain consistent.We aim to broaden our analyses in subsequent research to further explore the models' suitability, especially for taxi trips and larger datasets.
In summary, our results raise the question of whether models claiming high flexibility truly provide sufficient benefits or, on the contrary, may do more harm by failing to deliver reliable results.If such models merely manage to maintain a moderate level of spatial granularity, there might be more accurate and privacy-preserving methods available through the provision of aggregate data.It is likely that generating synthetic data offering both high flexibility and strong privacy guarantees is not possible [20].Instead of striving for full flexibility, it is advisable to clearly indicate the specific applications for which the respective model is well-suited and those for which it is not.For instance, synthetic trip data could prove valuable for development purposes or to gain a preliminary understanding of raw data before utilizing it in a controlled and secure environment for actual analyses.In such cases, the focus may be on maintaining the sampling rate and accuracy of GPS data rather than precisely mimicking the actual spatial distributions.For such applications, it may be more important to retain all attributes instead of solely focusing on spatial information.

APPENDIX 6.1 Map matching validity
There are cases in which the matching API is not suitable for capturing actually traveled bicycle routes.Figure 6 provides examples of potential issues.Also, it is possible for the API to entirely fail in producing a matched trace.A solid map matching is crucial for ensuring the validity of our results.Thus, we assess the frequency of complete matching failures and matched trips shorter than 90% of their original counterpart, indicating termination in the midst of the trace.Table 7 suggests that for all datasets, matching entirely fails only for at most 4% of trips and partly for 18%, which is acceptable for the purposes of our evaluation.To further assess the validity of the chosen matching algorithm, we evaluate the map matching using the Hausdorff distance.In an accurate map matching scenario, the Hausdorff distance should be small, indicating that no points in the matched trip are far away from their corresponding point in the original trip.As expected for successful map matching, all synthetic datasets show a Hausdorff distance in the order of magnitude of the selected grid resolution, and the raw data has a median Hausdorff distance of only 33.

Figure 1 :
Figure 1: Schematic illustration: preferred segments (green cells) are traversed by the matched trip (green line), but not its routed counterpart (purple line), and vice versa, they are considered avoided (purple cells).

Figure 2 :
Figure 2: Example for raw (left) and AdaTrace (right) trip in four variations.

Figure 3 :
Figure 3: Examples of survey questions.The preference score is overlaid on a map cutout alongside selected roads.Answers are given via radio buttons for each road.

Figure 4
Figure 4  shows an example of an intersection displaying the top five traffic movements for each dataset of their matched and routed version.In general, for the synthetic data to offer added value, the nDCG of the matched version should surpass that of the routed version.In Table6, the averaged results across all intersections are displayed.It can be observed that AdaTrace performs better than

Figure 4 :
Figure 4: Example intersection showing the top five traffic movements for each matched and routed dataset without DP.Line thickness represents the number of trips.

Figure 5 :
Figure 5: Traffic flow retrieval according to nDCG with  = 3: each point corresponds to one intersection for selected models.Points above (below) the diagonal indicate a superior routed (matched) value.

Figure 6 :
Figure 6: Examples of map matching inaccuracies.Left: A cyclist used a footpath that was matched to a legally accessible path.Center: Start or end of the trace is off the road network -in this example, at a train station.Right: Matching terminates when there is no nearby path available -here, a cyclist rides on an uncharted path in a park.

Figure 7
Figure 7 visualizes the original datasets without DP with respect to their spatial distribution as well as example trips.

Figure 7 :
Figure 7: Spatial distribution (left) of original datasets without DP and respective example trips (right).

Figure 8
Figure 8 shows spatial distributions for each dataset without DP in matched and routed versions.

Figure 8 :
Figure 8: Spatial distribution of matched (left) and routed (right) datasets without DP.

Table 1 :
Distribution of distances between two consecutive points (excluding repetition of identical locations) in meters.

Table 2 :
Median of trip lengths in  for each dataset and their variations.In brackets: median ratio of respective trip length to its straight-line (SL) distance counterpart.

Table 3 :
JSD values for traffic volumes between each dataset variant and the baseline of matched raw data.Bold font indicates the best variation.

Table 4 :
Preference score evaluation for different frequency thresholds (as top-% frequented segments) and tolerance levels based on Pearson correlation coefficient  , accuracy and F1-score, precision, and recall for the classes 'avoided' and 'preferred'.For each hyperparameter setting, i.e., each -th row per block, the best value is depicted in bold font.

Table 5 :
Results of the road preference survey for 40 selected roads, presented as percentages of accurately classified roads in total and separated by class (preferred, avoided), the percentages of unrecognizable and incorrectly classified roads, and group homogeneity (Krippendorff's   ).

Table 6 :
Averaged nDCG for traffic flow retrieval at 10 intersections, computed for  = 3 and  = 5.An nDCG of 1 signifies perfect retrieval.Bold font represents the superior value between the matched and routed versions for each comparison.