Generative Models for Synthetic Urban Mobility Data: A Systematic Literature Review

Although highly valuable for a variety of applications, urban mobility data are rarely made openly available, as it contains sensitive personal information. Synthetic data aims to solve this issue by generating artificial data that resembles an original dataset in structural and statistical characteristics, but omits sensitive information. For mobility data, a large number of corresponding models have been proposed in the past decade. This systematic review provides a structured comparative overview of the current state of this heterogeneous, active field of research. A special focus is put on the applicability of the reviewed models in practice.


INTRODUCTION
Urban human mobility data is crucial for various applications, including urban planning [126], traffic management [79], and smart city applications [26].The COVID-19 pandemic further highlighted its usefulness in pandemic analysis [3,41] and simulations [87].However, there are limited openly available datasets, mainly due to privacy concerns.For instance, Culnane et al. [27] showed recently that as few as three time-stamped locations per person were sufficient to identify most of the individuals in a dataset consisting of public transit smart card records in Melbourne provided for a hackathon.
Another example is the publication of the New York City (NYC) taxi dataset, in which celebrities and their travel routes could quickly be identified [104].
While aggregated data can be used for some applications, innovation is limited without access to raw data.For instance, machine learning algorithms typically require granular data.These are currently employed to develop, e.g., next-location prediction models used to provide seamless mobility offers [18] or traffic mode recognition that enables more precise urban mobility analyses from Global Positioning System (GPS) data [96].
So far, classical anonymization techniques for location data such as obfuscation or cloaking have not been able to sufficiently balance privacy and utility or to scale to large datasets [40].The generation of synthetic data, i.e., artificial data that resembles an original dataset in structural and statistical characteristics, has the potential to overcome this issue.It is considered especially useful when data release is required not only for the provision of open data, but also for Authors' addresses: Alexandra Kapp, alexandra.kapp@htw-berlin.de,Hochschule für Technik und Wirtschaft Berlin, University of Applied Sciences, Berlin, Germany; Julia Hansmeyer, Hochschule für Technik und Wirtschaft Berlin, University of Applied Sciences, Berlin, Germany; Helena Mihaljević, Hochschule für Technik und Wirtschaft Berlin, University of Applied Sciences, Berlin, Germany.

Manuscript submitted to ACM
Manuscript submitted to ACM 1 arXiv:2407.09198v1[cs.CR] 12 Jul 2024 internal sharing, software testing, development of machine learning models, or the deployment of privacy-preserving machine learning tools [57].
Synthetic data has been used in other fields, such as health and finance, for diagnostic classification [8,21] and fraud detection [6].However, unlike tabular data, generating synthetic urban mobility data poses specific challenges.The sparsity and high dimensionality that result from the combination of time series with rich spatial information make it difficult to preserve complex semantic dependencies while ensuring privacy.Time-series health and finance data are also highly complex and thus cause a set of challenges [6,29], though they are of different nature and thus the respective approaches cannot be transferred to mobility data.
In the past few years, a plethora of research articles addressing synthetic urban human mobility data generation has been published with currently more than 50 relevant approaches.It is thus hardly surprising that the rapid increase in knowledge production in a multidisciplinary field at the intersection of privacy, urban mobility, and data science has led to a state of research that is difficult to survey, especially regarding the applicability of the respective methods: the success of the heterogeneous technical approaches is defined and measured differently; while some approaches focus on providing formal privacy guarantees, others do not include any privacy considerations; the synthesis output varies, from models directed at generating fine-granular taxi trips to those striving to produce representative motions within an entire city; some algorithms utilize solely information from the given mobility dataset, while others incorporate knowledge about human mobility behavior or additional data sources such as the road network or census data.
We systematically review research addressing the generation of synthetic mobility data to provide a structured overview and comparison of existing approaches, foster further improvement of algorithms, and evaluate the applicability of the frameworks.The latter is particularly beneficial for practitioners who strive to generate synthetic mobility data in practice.Our survey allows for an informed assessment of the usefulness of methods for specific scenarios, as synthesis methods are often constructed with certain assumptions and heavily rely on the type of dataset and (implicitly) targeted applications.
Practitioners seeking to anonymize data typically search for a synthesis method that fits their specific application scenario and dataset.The following questions, which serve as the main structure of this review, are fundamental to guide such a decision-making process: (Q1) Is the synthesis method suitable for the given data?(Q2) How does the method work?(Q3) Does the method provide a required or satisfactory level of (a) privacy and (b) utility for the intended use case?We thus extract relevant information on datasets the developed methods are supposed to be applied to (Q1), the underlying algorithmic approaches (Q2), privacy considerations (Q3a), and utility evaluation measures (Q3b).In addition, we have organized the published approaches into groups based on the type of mobility they aim to address.For example, a 20 minutes taxi trip is considered a different type of mobility compared to the movement patterns between meaningful locations of a person over a longer period of time.Grouping the approaches in this way allows for easier assessment of their suitability for a given application scenario.After presenting the main aspects of each method, we compare, as far as possible, the approaches within each of the categories.
The main contributions of this article are the following: (1) We systematically collect and review existing models for the generation of synthetic urban human mobility data, providing an overview of the current state of research.(2) We group, classify and compare different approaches according to used datasets, algorithms, privacy considerations, and evaluation measures.(3) We categorize each coded publication based on the type of mobility that it targets.We then briefly describe the reviewed models and compare the models within each group, thus disentangling this heterogeneous field in terms of practical focus.
Manuscript submitted to ACM Synthetic data refers to artificial data generated from a real dataset using a generative model that is fit to reproduce certain structural and statistical characteristics of the original data.Moreover, assuming that the models resemble certain statistical properties of the original dataset, they can be applied to generate an arbitrary amount of new trajectories.
Current literature provides various approaches to developing a generative model, from Markov models to deep learning algorithms.Our literature review will enclose all approaches in the above broad sense of a generative model that can be applied to generate a mobility dataset (cf.[57,71]).At the same time, we restrict to methods that create synthetic data to resemble an existing 'static' dataset, in contrast to solutions for location data protection in an online location-based service application such as a restaurant recommendation smartphone app (cf.[37,91,105]).
One of the main purposes for creating synthetic data is to preserve the privacy of individuals represented in the original data and thus enable wider access to relevant information and patterns without compromising individuals' privacy.It is considered one of the main privacy-enhancing approaches for the near future [89].At the same time, statistical distributions of synthetic data are expected to approximate well those of the original data, thus, analyses performed on both should yield similar results.The process, called synthesis, consists of building a model that captures structural and statistical properties such as multivariate relationships and interactions.Once a model is built, it can be used to sample or generate artificial data.The degree to which the synthetic data can accurately represent the real data is generally denoted as utility of the respective model (for the given dataset).The synthesis is an important factor for the levels of both utility and privacy that can be achieved [33,36].
Several existing surveys address research on privacy of mobility data.While some include methods for the generation of synthetic mobility data, none of them provide a comprehensive overview of the current state-of-the-art in this area which has grown remarkably in the last years.Errounda et al. [38] review approaches for securing differential privacy in the context of location data, with only a small portion of works dealing with the publication of private trajectory data.Fiore et al. [40] on the other hand focus on privacy-preserving publishing of trajectories and elaborate on research about attacks against anonymized trajectories as well as solutions proposed to protect databases from such attacks.The authors list a few models for synthetic data generation (that do not utilize machine learning).Jiang et al. [55] examine privacy-preserving techniques in location-based services, thus focusing on real-time data and location-preserving privacy mechanisms based on obfuscation, cryptography and cooperation and caching.
with synthesis methods based on generative adversarial networks (GAN), convolutional neural networks (CNN) and recurrent neural networks (RNN).Benarous et al. [9] systematically address the trade-off between privacy and accuracy when generating synthetic data for long location sequences by evaluating basic implementations of methods based on long short-term memory networks (LSTMs), Markov chains, and variable-order Markov models.Shin et al. [97] review and summarize models for mobility data synthesis based on generative adversarial networks.

RESEARCH METHODOLOGY
We systematically collect and review all published research literature addressing the generation of synthetic urban human mobility data.Our survey is based on the Digital Science Dimensions platform [95], since it shows the most exhaustive journal coverage among typically utilized scholarly databases [72,99].Our process steps are described below and summarized in Figure 1.The process from the search query compilation to reference checking was completed in OR trajectory OR movement OR trip OR 'sequential data') AND synth* AND generat*2 .This query was applied to the title and abstract fields and we limited the time range to the years 2012 to 2023.In addition, we limited the search to articles in English language, published in the research fields Information and Computing Sciences or Urban and Regional Planning.The respective structured search resulted in 2,035 articles, with 1,775 remaining after deduplication.The screening of titles and abstracts of these articles was performed by the first two authors.Every record without an explicit reference to the generation of synthetic urban human mobility data was excluded.A parallel independent assessment of the screening procedure with a random test set comprising 140 articles (∼10% of all articles) was performed by both researchers, yielding a Cohen's Kappa value of 0.925 and thus a very high inter-rater reliability.Any remaining discrepancies on the exclusion of papers were discussed and resolved.This step resulted in 115 articles included for an evaluation of their full texts.
The full texts of the remaining articles were skimmed through to further asses the quality and eligibility of the presented studies.The task was split between all three authors and performed independently.A total of 34 articles remained based on the following inclusion criteria: (1) The publication must propose a generative synthesis model as defined in Section 2. (2) The model must use and produce urban human mobility data (see Section 2).This excludes papers that solely or primarily address other types of movements such as robotic movements, indoor mobility, or animal traces.(3) Input and output data need to have the format of single trajectories, i.e., consist of sequences of spatial or spatio-temporal points (see Section 2).(In particular, we excluded papers that generate synthetic mobility networks, e.g., [73].)(4) The publication needs to propose a novel model.We excluded papers that solely apply or compare existing models, or vision papers that have not explicitly formulated their method (e.g., [70]).

STRUCTURED OVERVIEW
As outlined in the introduction, one of our main objectives is to facilitate the decision-making process for practitioners when selecting synthetic data models.This requires addressing inquiries regarding the suitability of one's data, understanding how a method works, and determining whether the method ensures a satisfactory level of both (a) privacy and (b) utility.
From each coded publication, we thus extracted information on the following topics: (1) utilized datasets; (2) algorithmic approach of the generative model; (3) privacy guarantees and evaluations; and (4) measures used to evaluate the utility.

Datasets
The datasets used in the reviewed literature constitute 'urban mobility data' as they reveal urban movements of humans (cf.Section 2).However, they differ substantially, with inherently different characteristics regarding, e.g., the typical level of (geo-temporal) granularity, additional attributes, such as demographics or traffic mode, ways they are recorded or the typically associated use cases.Nevertheless, rather few papers specify formal requirements for the input data or explicitly state for what kind of data their method is suitable.This information is often only to be implied from the datasets chosen for evaluation or from the method description.At the same time, the properties of the datasets are paramount from an application perspective, because, for most practitioners in search of a suitable method, the dataset to be synthesized is already fixed.
Table 1 provides an overview of all mobility datasets used for evaluation in the reviewed literature including information on the dataset size, time range, and information on its availability.The indication of the dataset size is either given as the number of users, number of trajectories, or number of records; some provide a combination, some none of these.We categorize the datasets into GPS traces, social media data, mobile phone data, smart card data, surveys and simulation datasets.For detailed information on the general peculiarities of different data sources, we refer to Luca et al. [71].
The heterogeneity of the size and range of datasets is striking: The user size ranges from 100 to almost 7 million and the time range goes from one day to 4.5 years.Even though the amount of data is a relevant factor for the performance of a model, almost none of the publications indicates how the dataset size relates to the utility of their model.Overall, there is little discussion of the types of datasets and properties such as the number of trips per user or temporal resolution for which the proposed procedure is (not) well suited.

GPS traces.
The GPS [1] makes use of satellites to determine the geo-spatial positioning of electronic receivers, nowadays built into a wide range of devices such as smartphones, cars, or shared mobility vehicles.The majority of datasets used in the reviewed publications are GPS related, originating from trips of shared bicycles, taxis, private vehicles, or smartphone apps.A wide range of taxi datasets was used, though it should be noted that they may vary greatly, as their size ranges from 60,000 [25] to 5 million trajectories [119,120].The most commonly used datasets collected with mobile phones are GeoLife (Beijing) [124] and the Lausanne Data Collection Campaign (LDCC) [59].
GeoLife can be freely downloaded, thus its high accessibility explains its popularity.It only includes 182 users who were tracked over more than five years, but contributed trajectories rather sparsely, thus yielding a total of 'only ' 17,621 trajectories.LDCC, on the other hand, is based on the continuous tracking of 168 individuals over a year using mobile Manuscript submitted to ACM These records contain information on time and location (the position of the cell tower the user was connected to) during a phone call or any other billable telecommunication transaction such as sending or receiving text messages.
None of the CDR datasets used in the reviewed articles is publicly available; instead, they have only been shared with the respective researchers directly by the providers.The lack of data management plans and information on the data source inhibits the reproducibility of the respective studies [71].
Social media data.Location-based social networks (LBSN) reflect real-life social networks through websites or mobile phone applications where users can share ideas, photos, activities, events, and interests.Such LBSN provide either a precise geo-tag or the position of a predefined location, like a city, an area, or a restaurant.Based on a sequence of published posts by a user, the spatial and temporal information can be used to construct a user's trajectory [71,123].
There are three LBSN datasets from the platforms Foursquare, Gowalla, and Brightkite, the latter two being, meanwhile closed, location-based friendship networks where users shared their location by checking in.The corresponding datasets Gowalla [24] and Brightkite [24]consisting of ∼ 6 million and ∼ 4 million user check-ins, respectively, were made publicly available through third parties who collected the data via publicly available application programming interfaces (API).
Foursquare is a recommendation app for restaurants, stores, and other points of interest, based on users' current location and historic preferences.Each check-in is associated with a timestamp, GPS coordinates, and a semantic meaning which is represented by fine-grained venue categories.As Foursquare did not provide a public API, check-ins that were also shared on Twitter were retrieved, yielding three Foursquare datasets: (1) the NYC Restaurant Rich dataset [113][114][115], (2) the NYC and Tokyo check-in dataset [116] which includes check-ins to various types of locations in NYC and Tokyo, and (3) the Foursquare Weekly dataset [74] which is a subset of the latter and comprises only 193 users and ∼ 3, 000 trajectories in NYC.
Smart card (SC) data.Datasets on public transportation are typically collected by operators and local government bodies for a range of commercial and planning purposes.The locations are usually limited to those of bus and metro stops and the records commonly consist of check-ins and -outs using smart cards at the first and last stop of a public transport trip [86].The STM Montreal dataset [31]

Algorithms
The 'traditional' approaches to generating synthetic mobility data are based on extracting and combining particular mobility aspects from data to generate time-and location-based components of trajectories.This is often achieved by approximating distributions of trip lengths, speed, start locations, start times, or start-end-locations [45,66,69,75,93,102] and drawing samples from those.For instance, the beginning element of a trajectory can be generated by drawing independent random samples from the distribution of start locations and start times, respectively.In order to preserve a certain level of mobility behavior in the synthetic data, transition probabilities (and other information) are often estimated with, e.g., Markov chains, higher-order Markov models (computing -grams and prefix trees) or hidden Markov models [19,32,34,42,44,48,106,117].Typically, all such 'model-based' approaches are rather complex, making assumptions on types and semantics of human mobility [11,58] and extracting and combining different kinds of information from the data, typically utilizing methods such as sampling, clustering, and addition of noise in order to achieve a certain level of privacy [11,28,48,67,100,102,106].
Viewed over time, the first deep learning architectures are appearing as modeling approaches from 2017 onwards, increasingly replacing the hitherto predominant statistical and probability-theoretical models.The utilized deep learning architectures are mainly inspired by existing approaches to generate sequential data from human language such as sentences of documents.RNN architectures like LSTMs [52], well known for training language models and thus able to predict probable words or characters given a sub-sequence, have been adopted by considering locations analogous to words and, correspondingly, trajectories to sentences [10,13,22,61,118].The generation of new sequences then starts by providing context consisting of one or more locations to predict the next one in the sequence, and then iteratively expanding the sequence by adding newly generated synthetic locations to the context for the next location generation.
The model generates a sequence until, for example, an 'end of trajectory' character has been produced or a pre-defined length has been achieved.
GANs are probably an even more obvious architecture for synthesizing trajectories.A GAN consists of two networks, a generator  and a discriminator .The generator attempts to learn the distribution of the data, while the discriminator learns to decide whether a sample record comes from the training set or was produced by .During the training,  should thus learn to maximize the probability that  makes a mistake.Synthetic data is then produced by the trained generator that has learned an approximation of the trajectory distribution [43].
Other generative approaches such as (variational) autoencoders are also found in the coded literature [23,53,64,94,121,125].These types of neural networks learn to encode a set of data using a compressed representation (as latent vectors or parameters of a pre-specified distribution in the latent space) from which the original data can be reconstructed without losing too much important information (and thus reducing noise when learning the lowerdimensional representation).
Since most approaches are inspired by existing architectures developed for problems on text or image data, they require certain preprocessing steps to match data format requirements.The main differences occur in (1) the handling of spatial points and ( 2) the treatment of time information.
This removes an important aspect of mobility data as temporal patterns can be highly relevant, for example, the information at what time a location is visited most or if street segments are jammed at certain times.Approaches that include a notion of time usually use one of three options: (1) For each user a location is provided for each fixed time interval, like every hour [10,11,39,81,84,103,122] (this is only possible if there is continuous data of each user that allows such an assignment); (2) a start time is assigned to each trajectory and, if timestamps are created for consecutive points, a constant sampling rate is used [17,32,64]; (3) a timestamp is modeled for each point within a trajectory, mostly reduced to the hour of day [4, 5, 32, 67, 75, 80, 92-94, 100, 112].Some models also consider the stay duration at a location as temporal information [5,42,109].

Privacy
The release of fine-grained mobility data presents significant privacy concerns, as evidenced by numerous scientific and journalistic investigations (cf., e.g., [27,30,104]).It thus appears not too surprising that, of the 51 encoded publications, 43 explicitly mention privacy as a motivation for synthesizing mobility data.In fact, in the majority of the coded literature, privacy serves as the primary or sole motivation.Additionally, it is argued that synthetic data can be used to augment small existing real datasets, e.g., to improve training and evaluation of machine learning models [25,53].
Those not mentioning privacy, stem mainly from the field of urban planning, which initially employed mobility models based on predefined rules rather than real-world observations to simulate human movement [50].Over time, these models have become more sophisticated and are now incorporating real-life data to generate more realistic mobility patterns that can be used for simulations and what-if analyses [60].For these models [7,16,28,58,80,81,84,109], the main concern is how well the generated data represents a greater population.However, overall, privacy is still the main motivation for the generation of synthetic mobility data and around half of the encoded papers include a formal privacy guarantee or a privacy evaluation, which we will summarize within this chapter.
A formal privacy guarantee provides some level of certainty that the synthetic dataset differs from the original data to a given degree and does not reproduce the original data too precisely.Privacy evaluations can be used alternatively or in addition to privacy guarantees to examine to what extent models preserve privacy under certain types of attacks.
Manuscript submitted to ACM Privacy guarantees.Differential privacy (DP) [35] is almost exclusively used among the coded publications if a guarantee is provided.Broadly speaking, differential privacy guarantees that the output of an algorithm remains nearly unchanged if the data of one individual is removed or added to the dataset.In this way, differential privacy limits the impact of a single individual on the algorithm's outcome, preventing the reconstruction of an individual's data.
Differential privacy comes with a parameter, usually denoted by , which defines the 'privacy budget' and captures the extent of potential privacy leakage.A typical method of ensuring differential privacy is adding noise from a suitable kind of noise distribution to the data.Roughly speaking, the less privacy budget is provided, the higher the privacy remains but the more the utility suffers since this usually requires a higher amount of noise.Formally, an algorithm A provides -differential privacy if for any two datasets  1 and  2 differing in at most the data of one individual, and for Differential privacy was originally designed to provide privacy of aggregations, e.g., query counts or histograms, by adding noise to the counts in a differentially private manner.To make use of differential privacy for synthetic data generation, most of the models obfuscate the learned (latent) distributions by applying differentially private noise; synthetic trajectories are then generated by sampling from obfuscated distributions [19, 20, 22, 32, 42, 44-46, 48, 67, 69, 75, 93, 94, 102, 107, 122].Recently, models with local differential privacy have been proposed [34,117], where data is perturbed locally, typically on users' devices, and the untrusted collector and aggregator can only access the noisy records.
The authors of [64] and [118] both propose neural networks that make use of differentially private stochastic gradient descent [2].In every model update, the gradients of all model parameters are clipped and Gaussian noise is added to the clipped gradients.Alatrista et al. [4] use the architecture of DP-GAN [51] which is a novel approach to introduce DP into GAN architectures by extending the classical GAN setting to a three-player game with an additional classifier that aims to label data with respect to differential privacy constraints.[13] also consider privacy in their LSTM architecture without any privacy guarantees or privacy evaluations.They choose one of the top-k predictions uniformly at random for next-point generation to mitigate the learning of a 1-to-1 representation of the raw data.
The original definition of differential privacy is based on the assumption that every user contributes only one record to a dataset as, for example, in a customer database.This does not necessarily hold for mobility data as a user can usually contribute an arbitrary number of trajectories, especially if waypoint trajectories are considered.This is a noteworthy issue, as differential privacy mechanisms based on the above assumption would only protect the privacy of a single trajectory (item-level privacy), but not the privacy of a user (user-level privacy) as formulated in the definition given above.Most coded publications that utilize waypoint trajectories do not consider this issue (e.g., [42]).Lestyán et al. [64] at least point out that their approach only provides item-level differential privacy and only Yu [118] provides a solution to guarantee user-level privacy.During the training phase of the neural network, they adjust differentially private stochastic gradient descent such that not a single record, i.e., trajectory, is used for training; instead, for each sampled user the gradient of the model parameters is obtained for a batch of the user's data.The second publication that carefully considers user-level privacy is by Mir et al. [75].They provide differential privacy by applying controlled noise to a set of defined probability distributions, such as the distribution of home locations, and probabilities of a record at each location per hour.For each probability distribution, they evaluate the potential maximum user contribution and apply noise accordingly to satisfy user-level privacy.All other publications only provide user-level privacy assuming that each user only contributes a single trajectory, which can potentially hold depending on the dataset, likely more so for staypoint trajectories.When models are applied in practice, the structure of the utilized dataset should always be revised in that respect.
Manuscript submitted to ACM Bindschaedler and Shokri [11] provide plausible deniability [12] which ensures that there are  alternative trajectories that could have produced a similar synthetic trajectory generated from a seed.This notion is based on the condition that each synthetic trajectory originates from a specific seed trajectory within the real data.As stated by the authors, this privacy guarantee protects against location inference attacks [98] and membership inclusion attacks (which learns whether a particular individual with certain semantic habits has been in the seed dataset).
Privacy evaluations.Privacy evaluations use attack scenarios such as location inference attacks mentioned above [11,122] to estimate the success of a potential attack with and without privacy enhancement.The trajectory-user linking task identifies users from trajectories and links trajectories to them, thus a decrease in the accuracy of a state-of-the-art trajectory-user linking algorithm can be interpreted as an increase in privacy [92].Zhan et al. [121] evaluate re-identification inaccuracy and Zhao et al. [122] social relationship based de-anonymization attacks.Du et al.
[34] investigate their approach with regard to re-identification attack and the outlier attack.
Another approach compares the similarity between real and synthetic traces assuming that a high similarity resembles low privacy.Different methods are used for this comparison: (1) measuring the distance between any real to any synthetic trajectory [10], (2) quantifying how many trajectories in the real dataset exactly match trajectories in the synthetic dataset [100], and (3) measuring the mutual information [22,67].The sensitive locations disclosure attack is designed by [32] which indicates how much the sensitive locations (points) of an original trajectory are similar to those of synthetic trajectories.
Kulkarni et al. [62] evaluate privacy with a location-sequence attack which shows to what level of accuracy trajectories can be reconstructed, and a membership inference attack which evaluates the adversary's accuracy of inferring whether a target individual contributed to a dataset.

Utility evaluation measures
There are mainly two different approaches to evaluate the utility of a synthetic data generation algorithm: (1) downstream tasks and (2) (dis)similarity of raw and synthetic data.The utility is considered high if the synthetic dataset is similar to the original dataset or performs similarly well on the downstream task.
Downstream tasks are a common method for utility evaluation of synthetic data in other areas, such as fraud detection [6] in the financial sector or diagnostics [8,21] for health data.Mobility data, however, is rarely used for similar prediction tasks, making it less straightforward to evaluate on downstream tasks.Nevertheless, a few of the coded papers include such evaluations based on the following tasks: road map updating [17], a task that aims to discover uncharted road segments in digital maps, next-location prediction [56,61,109,121], a COVID-19 spreading simulation [39,110,112] and traffic control simulation [56], evaluating how a one-day construction site would change the mobility in that area.
The similarity of raw and synthetic mobility data is more commonly evaluated.While there are standard (dis)similarity measures that have been proven to perform close to a human evaluation of synthetic data like synthetic images, the evaluation of synthetic mobility data is rather difficult and heterogeneous in the field [9].To structure the various approaches, we employ the categorization of [9] who consider the following types of evaluation measures for utility: (1) per-instance similarity, (2) visual comparison, and (3) statistical similarity (see Figure 2).
Per-instance similarity measures require the possibility to link a synthetic trajectory to its original counterpart to compare the two trajectories directly.Different authors make use of the Hausdorff distance [22,56,67,92], the average distance between point pairs [53,81,109], or the BLEU (bilingual evaluation understudy) score and its successor the natural language machine translation [25].The Jaccard index is used to compute the similarity of activity spaces (i.e., areas individuals move within in the course of the day) between two trajectories [81,92], and dynamic time warping is used to evaluate the reconstruction accuracy between two trajectories [94].However, most synthesis algorithms are not created such that a synthetic trajectory has a unique raw trajectory counterpart, thus per-instance similarity measures cannot be applied.Moreover, for privacy reasons, it is usually not an aspired goal to closely reproduce original trajectories 4 .From an application perspective, the utility should be considered high if distributions of relevant mobility characteristics of the dataset remain intact.
Visual comparisons of graphs provide a straightforward approach to comparing such distributions, as done in the coded literature by looking at histograms displaying the number of visits [62,108,122], length or speed of trajectories [23,66], or a selection of trajectories that indicates how well they match the road network [61,108].While such graphs and maps provide a great intuition about distributions of different mobility characteristics and deviations within the synthetic dataset, they are not suited to compare multiple synthesis approaches precisely and objectively.
Statistical (dis)similarity measures aim to provide single numbers that capture the similarity or dissimilarity of a certain characteristic between two datasets.For brevity, we include both under the term similarity measure.The mobility characteristics that were evaluated in the coded literature are presented in the next Section 4.4.1 and the statistical similarity measures used to compare them will be discussed in the following Section 4.4.2.2).Each of these mobility characteristics can be operationalized by a specific feature that represents the mobility characteristic.See Figure 3 for an explanation of the taxonomy used here.Trip lengths are mostly operationalized as the distance with each trip which can either refer to the straight-line distance from the origin to the destination or the summed length of the distances between consecutive waypoints.
Alternatively, the sequence length (i.e., number of points) or the trajectory diameter are considered, where the diameter is defined by any two (not necessarily consecutive) most distant points within a trajectory.Blanco-Justicia et al. [13] further consider the interesting aspects of the average and maximum distance between consecutive points, as well as the distinct locations per trajectory, though they do not compute similarity measures for these statistics.
As many approaches discard timestamps, the evaluation of temporal distributions is only seen rarely, usually by In addition to the presented similarity measures that can be applied to compare any two mobility datasets, some measures are only suited for specific approaches because additional information is assumed or needs to be inferred, such as point of interest (POI) categories [92], gender [93], subscription payment plans [93], closest matched road segment [17], proportion of ordinary and express ways [17], inferred home and work locations [10], proportion of traffic violations [102], cluster [28], and inferred friendship between users [122].

Statistical similarity.
A variety of measures is utilized in the coded works to capture statistical similarity between operationalizations of mobility characteristics in real and synthetic data.
The Kullback-Leibler divergence (KLD) [63], also called relative entropy, is a widely used statistic to measure how far a probability distribution  deviates from a reference probability distribution  on the same probability space X.For discrete distributions  and , it is formally defined as .
For example, the spatial distribution can be evaluated by comparing the share of records in the synthetic data () per tile () of the tessellation (X) to the share of records computed on the real data ().The larger the deviation of  from , the larger the value of the resulting KLD, with a minimum value of 0 for identical distributions.Note that KLD is not symmetric, i.e.,   ( ||) ≠   ( ||), which is why KLD is best applicable in settings with a reference model  and a fitted model .However, the lack of symmetry implies that it is not a distance metric in the mathematical sense.
The Jensen-Shannon divergence (JSD) solves this issue and builds on the KLD to calculate a symmetrical score.
Additionally, JSD provides a smoothed and normalized version of KLD, with scores between 0 (identical) and 1 Manuscript submitted to ACM (maximally different) when using the base-2 logarithm, thus making it easier to relate the resulting score within a fixed finite range.
It is worth noting that KLD is only defined if  () ≠ 0 for all  in the support of , while this constraint is not required for JSD.In practice, both KLD and JSD are computed for discrete approximations of continuous distributions, e.g., histograms approximating the number of trips over time based on daily or hourly counts.However, the choice of histogram bins has an impact in two respects: Say we want to compare the number of visits per tile.Depending on the granularity of the chosen tessellation, there might be tiles with 0 visits in the real dataset but > 0 visits in the synthetic dataset, thus KLD would not be defined for such cases.Additionally, the resulting values for both KLD and JSD vary according to the choice of bins, e.g., by reducing the granularity of the tessellation, the values of KLD and JSD will tend to be smaller.
Both KLD and JSD do not account for a distance of instances in the probability space X.Consider the spatial distribution of a mobility dataset that differs from another dataset only within the values of two tiles.In one case these two tiles are direct neighbors, in a second case they are far apart.Intuitively, the two distributions that differ in neighboring tiles are more similar, but JSD and KLD yield the same value in both cases.In contrast, the earth mover's distance (EMD) between two empirical distributions allows to take the underlying geometry of the space into account.
The EMD is proportional to the minimum amount of work required to convert one distribution into the other [65].
The amount of work is determined by the defined distance between instances (e.g., tiles or histogram bins), thus, it allows for an intuitive interpretation.For example, if the EMD of two spatial distributions is computed based on the geographic straight-line distance between tile centers in meters, an EMD of 100 signifies that on average each record of the first distribution needs to be moved 100 meters to reproduce the second distribution.On the downside, there is no fixed range as for the JSD which provides values between 0 and 1.Thus the EMD always needs to be interpreted in the context of the dataset and the EMD of different datasets cannot be compared directly.
The mean squared error (MSE) is a common error measure, computing the average squared difference between observed (i.e., real) and predicted (i.e., synthetic) values.Unlike KLD and JSD, it is not explicitly designed for probability The cosine similarity is also used, but less commonly.The general idea is to consider two vectors as similar, if their angle is small, thus both pointing in a similar direction, while two orthogonal vectors have zero similarity, and the corresponding distributions would be considered unrelated.It is frequently applied in text analysis to, e.g., measure document similarity, having the advantage of working well even with very sparse data such as document representations based on word counts.The tendency of mobility data to develop sparse distributions as well makes cosine similarity a good candidate also for this context [47].
The Kendall's  coefficient, also known as Kendall rank correlation coefficient, is a measure of the strength and direction of association that exists between two variables measured on an ordinal scale.It returns a value between 0 and 1, where 0 means no relationship and 1 is a perfect relationship, determining the strength of association based on the pattern of concordance (ordered in the same way) and discordance (ordered differently) between all pairs.In the case of mobility data, it can, for example, be used to measure discrepancies in locations' popularity ranking [45], where the popularity of a location  is defined as the number of times  is visited by trajectories in the considered dataset.

ASSESSMENT AND COMPARISON
As described in Section 4.1, mobility data can take various forms and represent different aspects of human mobility.
Most of the coded works formulate, mostly rather implicitly, assumptions regarding trajectory semantics or additional knowledge typically required to implement a certain algorithmic idea or to achieve solid performance for a certain type of use case.These additional constraints, however, are also a major criterion for the choice of method in practice and additionally yield limitations to the level of comparability of existing approaches.We identified the 'granularity level' of mobility as a suitable criterion to compare the coded methods within the respective groups from an application perspective, supporting practitioners to identify the most suitable approach for their use case.More specifically, in an iterative process we defined the following three categories and assigned the synthesis methods to one of them based on careful reading and inspection of motivational examples, used datasets, and model assumptions: (1) trips, (2) user movements, and (3) city population.We find these categories to provide a suitable starting point to narrow down the selection of potential methods, given a dataset and an aspired application 5 .Some approaches are too generic or do not provide enough information to group them according to one of our categories, and have thus been classified as 'unspecified'.Table 3 presents all coded publications grouped by these categories.
A taxi trip comprises the typical understanding of single-trip mobility: a trajectory lasting a couple of minutes where an origin and destination location are connected with a route following the street network.On the contrary, a trajectory in terms of user movements is comprehended as a sequence of semantically meaningful stay locations over one or multiple days, for example, recorded by social network check-in data like the Foursquare dataset.The city population category utilizes trajectories to create representative mobility data for a (group of the) citywide population, as for example queried in a survey.The original trajectory data is thereby not necessarily seen as the ground truth but only as a (biased) piece of information next to other data sources used for a realistic mobility representation.In addition to the categorization, this chapter provides a basic understanding of each method and, if possible, compares them to further facilitate the choice of the algorithm within a category.
Figure 4 presents an overview of the coded works, sorted by the publication year and indicating the number of citations.The works of Chen et al. [19,20] are (one of) the first and most widely cited approaches (308 and 297 citations according to Google Scholar, 06.03.2023) in this field that have been used as benchmarks by multiple authors [32,45,64].
As many further publications build on this rather general method, we start by introducing the main idea.
The authors initially [20] proposed a prefix tree structure to store the counts of each sequence of a given length, with noise added to ensure differential privacy.From the noisy prefix tree synthetic trajectories are reconstructed.This has been later extended [19] to an approach commonly referred to as Ngram by allowing variable-length -grams; their main purpose is to prevent quickly decreasing utility (due to the implemented DP mechanism) when the noisy prefix tree grows by pruning the prefix tree.The algorithm is potentially applicable to any type of trajectory, though it does not scale well with long trajectories and many different unique locations, thus being better suited for shorter staypoint trajectories (like public transport smart card data used for evaluation in the respective publication).

Trips
Models in this category consider mobility data as a collection of trips where each trip consists of a start and an end location that are connected with a route that (more or less) follows the road network.All approaches that require waypoints as input fall into this category.
This (implicit) requirement can be deduced from various algorithmic designs: The models of [42,102,109] are based on the assumption that two consecutive locations stem from neighboring tiles; [17,25,34,108] make use of a fixed sampling rate between all consecutive points; the approaches by [44,58,69] are tailored to connect a previously selected origin and destination location, and [32,102] makes use of distributions of origins and trip lengths for trip generation; [34,48,67] rely on length or acceleration, and thus features that make use of waypoints.
Most approaches focus on reconstructing meaningful spatial sequences and dismiss the temporal dimension.If timestamps are generated, then usually only a start time, since successive locations follow in a short time interval when waypoints are assumed.
Two basic approaches use heuristics and probability distributions without providing any privacy guarantees: (1) FTS (feature-based trajectory synthesis) [66] cuts the trajectories into subtrajectories and resamples them based on heuristics that utilize the computed length, speed, acceleration, u-turn rate and density.( 2) TraG [58] generates synthetic traces by drawing OD pairs from a distribution and then connecting the two points with 'context-aware waypoints'.They define waypoints as stop points from a continuous trace where the moving object stops for longer than a given threshold, which they set to 60 seconds within their evaluation, aimed to detect waiting times due to traffic.
They argue that urban hotspots should correlate with bad traffic and thus longer waiting times.With this approach they reduce the space of potential locations, only focusing on origins and destinations (based on a grid) and intense traffic nodes.If only such hotspots are of interest this might be a suitable approach.If traffic flows are additionally of interest, this will likely not provide satisfying results.
A set of models strives to create differentially private synthetic trajectories based on Markov models and probability distributions: Roy et al. [93] synthesize bike sharing data but consider only OD trips, meaning each trajectory consists of just two points, which is much simpler than synthesizing entire sequences.He et al. [48] present a system called DPT (differentially private trajectories) based on the work of Chen et al. [19] using a prefix structure.They argue that [19] is only usable for a small domain of locations and introduce a novel hierarchical grid that adapts the resolution based on speed to optimize the number of counts maintained in the prefix tree.DPT is a highly recognized early work in this area with 207 citations (Google Scholar, 06.03.2023) and has served as a popular benchmark [42,44,45,118].similarity measure), (2) a utility module, that contains four categories of error metrics, (3) a front-end web interface.
Liu [69] extended the Ngram approach with OD distributions and trajectory length distributions.Additionally, they focus on optimizing differential privacy budget on the hierarchical structure of the prefix tree using a sparse vector technique.TGM [42] (trajectory generative mechanism) represents trajectories which a graph, each node stores the prefix of the last k stays and its stay time (understood in terms of how often the same node was also the consecutive node).Possible next steps are limited to the eight neighboring grid cells plus the current cell, considerably reducing the number of parameters of the Markov model.DP-MODR [32] (differentially private mechanism for synthetic moving objects database release) includes an extension DP-MODRT that allows the gerneration of time-dependent locations.To also include temporal information, DP-MODRT disaggregates start domain cells, their median lengths, and the cost matrix additionally according to start time windows.LDPTrace [34] is the first approach to implement local differential privacy.SPRT [102] (Synthesizing Private and Realistic Trajectories) includes the constraint of a geography-aware grid to only allow transitions that reassemble the road network.PrivTrace [106] employs first and second-order Markov chains.The authors claim to thereby reduce the disadvantages of AdaTrace which fails to obtain enough transition information with only a a first-order Markov chain model and DPT which introduces excessive noise due to DP with its high-order Markov chain model.As many of these models use one another as benchmarks, we can compare their performances to a certain degree: According to Gursoy et al., the authors of AdaTrace and DP-Star, both approaches outperform DPT [48] and Ngram, especially in terms of similarity measures capturing OD flows and the diameter of trajectories.The prefix tree approach of the latter two is not well suited to maintain OD relations; also, they find many loops and u-turns in their trajectories.
Ghane et al. [42] criticize that DPT limits a trajectory to its first few points which cannot represent the trajectory movement pattern properly.They find that their algorithm TGM outperforms DPT in maintaining travel distance and stay times.In terms of performance, they find DPT to be highly inefficient, as it consumes more than 13 GB of memory (vs 0.26 GB for TGM) and over 673h of runtime (vs.233s for TGM) for Porto Taxi.To properly assess TGM, evaluations Manuscript submitted to ACM of the spatial distribution, OD flows, and trip lengths would be useful, which are lacking in its publication.Also, it would be interesting to see how it benchmarks against AdaTrace/DP-Star.

DP-Loc [64] and DP-MODR both use Ngram and AdaTrace as benchmarks and both find that Ngram outperforms
AdaTrace in terms of frequent patterns, while the authors of AdaTrace, on the other hand, find their model to be superior.In the context of frequent patterns, however, it should be noted that both Ngram and AdaTrace tend to create rather short trajectories [64].The results for the similarity of the spatial distribution are also inconsistent: while the According to Deldar and Abadi, their model DP-MODR provides better results than AdaTrace for GeoLife and about similar results for Porto Taxi, though, they do not mention that their results for AdaTrace highly deviate from Gursoy et al.'s evaluations on the same dataset (GeoLife) 6 and similarity measures (location popularity ranking, trip error, and diameter error).For example, for  = 0.5, Gursoy et al. obtain a trip error of 0.048 for AdaTrace, while Deldar and Abadi report a value of 0.61 for AdaTrace and a value of 0.26 for DP-MODR.As the trip error is based on JSD, a smaller value means higher similarity, thus, according to one evaluation, AdaTrace would be superior, according to the other DP-MODR.We found similar inconsistencies in the location popularity ranking and the diameter error.Based on the provided evaluation it cannot be determined which approach outperforms the other.The same is true for Liu et al. [69] who also state to provide a model superior to AdaTrace but do not mention the inconsistency with Gursoy et al. 's results.
Both use the same evaluation measures (query error, length error, and diameter error) and the same dataset Brinkhoff.
Even though they explicitly extend the existing Ngram model to consider OD distributions, they, unfortunately, do not include such an evaluation.
PrivTrace outperforms AdaTrace and DPT on three datasets in terms of diameter, length, and spatial distribution, as well as travel patterns according to a set of evaluation parameters, such as histogram bin size and grid size, which result in a higher granularity than those chosen in the AdaTrace paper.It should be noted that, unlike AdaTrace, PrivTrace does not construct origin-destination pairs of synthetic traces directly according to the raw dataset's distribution.
Instead, the generation of a trace ends when a virtual end in the Markov chain is reached.Thus, an evaluation of the OD distribution in comparison to AdaTrace would have been desirable.SPRT also outperforms DPT by far and AdaTrace slightly on various measures evaluating mobility characteristics on OD, diameter, length, spatial distribution, and travel patterns.Additionally, as their approach focuses on road network matching, they evaluate traffic violations, which they compute as the share of trajectories that deviate more than 20 meters from the road network.Unsurprisingly, the benchmarks fail with violations between 30% to 50%, while SPRT only produces 4% to 8% of trajectories that violate traffic rules.
A set of models use deep learning architectures without any privacy guarantees: Huang et al. [53] combine a VAE and sequence-to-sequence (seq2seq) model to create SVAE (Sequential Variational Autoencoder).The evaluation is limited to comparing the distance between synthetic trajectories and their real counterparts which is generally smaller than 800 meters.The authors also investigate the diversity of trajectories, pointing out that similarity and diversity oppose one another.It is noteworthy that the approach allows for different presentations of the locations, namely as coordinates, grid IDs, and a combination of the two.Their evaluations indicate that using coordinates only is the worst choice.
Badu-Marfo et al. [7] propose a model which is unique in including demographics.The model is split into two components, a tabular demographics component, and a sequence component, which are learned separately.In particular, this implies that demographics and trajectories are modeled separately.According to their evaluation, the distributions of trip lengths and route segment usages are well preserved, though no information is provided on how the route segment usage is determined.According to the paper, it is to be assumed that staypoints of the used survey dataset are synthesized and a routing algorithm creates routes between the staypoints.Thus, the waypoints are not created by the proposed model, only OD pairs are.An analysis of OD distributions is not included.
Wei et al. [109] propose MoveSD (human movement with system dynamics), which models agents' decision processes with a generator that learns a movement policy.The possible actions are limited to those that are defined ahead, i.e., turn right, left, u-turn, go straight or stop.The evaluation is based on the downstream task of next-location prediction.
The authors motivation is not directed at providing privacy, instead they aim to build a model of human mobility and synthetic trajectories are rather perceived a byproduct for evaluation purposes.
TrajGAIL (trajectory generative adversarial imitation learning) is proposed by Choi et al. [25].Although they motivate their work with privacy issues, they do not include any privacy guarantees or analyses in their work.Like [109], possible actions are restricted to a provided action set.Unlike previously stated models that are based on mapping points to grid cells, TrajGAIL maps coordinates to a road network and can thus provide synthetic data which is in accordance with the road network instead of centroids of grid cells.On the downside, it is only designed for a chessboard-like road network.More actions could be defined, though this would be non-trivial for diverse road network layouts.As Cao et al. [17] point out, TrajGAIL bases the decision for the next location on the current location only.
TSG (Two-Stage GAN) by Wang et al. [108] uses one GAN to capture spatial patterns of trajectories on a grid and a second to fine-position these to ensure realistic coordinates on the road network.Thus, unlike most grid-based approaches, TSG explicitly includes road mapping.The evaluation includes screenshots of selected road segments that suggest a superiority in road network accuracy compared to other approaches.Unfortunately, they did not quantify road network accuracy with a similarity measure.The evaluation also lacks OD and travel pattern analyses.
Cao et al. [17] present TrajGen, where spatial and temporal information are learned in two separate steps: the spatial information learning is formulated as an image generation problem by mapping trajectories to image pixels.Points on generated images are matched onto road segments of a given map (i.e., OpenStreetMap) which are used to create a sequence with a sequence-to-sequence model.Thus, like [108], they include map matching for plausible trajectories in terms of a road network.The start time of each trajectory is determined with an artificial neural network that takes the length and the first location of a trajectory as input and outputs the time slot with the highest probability.
Thus, it is one of the few 'trips' models that considers the temporal component.According to their evaluation, the spatial distribution per hour of day is well maintained.It is, however, noteworthy that the evaluation is conducted on a taxi dataset where each trajectory spans the entire daily ride of a taxi, with a median of 307 per trajectory.Thus, as only the start time of the entire trip is generated according to the spatial distribution, it is unlikely that this would correctly reproduce spatio-temporal patterns over the entire day considering this data input.The authors do not visualize or quantify the deviation of the spatial distribution over the course of a day, therefore it is possible that the spatial patterns do not change a lot and those are learned by the mode.It would be interesting to extend the evaluation to other measures and datasets.Also, single taxi rides might be better suited than daily taxi trajectories.Xiong et al.
Manuscript submitted to ACM [110] developed TrajSGAN which is the first to model travel mode and purpose, using a semantic-guiding GAN with a CNN discriminator to determine how 'real' the generated 2-D image appears.TrajSGAN outperforms LSTM-TrajGAN and DeltaGAN based on the MTL Trajet dataset of GPS trips, though we consider the two benchmark approaches part of the 'user movements' category and thus not necessarily suited for this type of data.Jiang et al. [56] propose TS-TrajGen, which aims to provide spatial continuity so that points do not 'jump' in space as well as match the road network.Specifically, a two-stage GAN is built, based on prior domain knowledge of human mobility integrated with model-free learning paradigm.According to their evaluation, they outperform SVAE, MoveSim, TSG and TrajGen by far, covering mobility characteristics of distance, radius of gyration, location frequency and per-instance similarity measures.Though, the very small per-instance errors, which compare raw trajectories with synthetic counterparts with similar origin and destination, gives rise to privacy concerns, especially as a privacy evaluation is lacking.
Only two approaches provide differential privacy guarantees in combination with a deep learning stack: Yu [118] proposes DeepSynthesizer, an LSTM architecture that includes a start and stop symbol within the vocabulary to enable the model to learn distributions of origins and destinations.DP-Loc (Differentially Private Synthetic Trace Generator) was recently proposed by Lestyán et al. [64], which also includes the temporal component of trajectories: a VAE is used to generate OD pairs together with a start time.Transition probabilities are learned with a feed-forward network to approximate the distribution of all paths between the source and destination (at a given time).They include extensive evaluations and compare their results with Ngram and AdaTrace, finding that DP-Loc outperforms AdaTrace on all similarity measures while Ngram is superior in terms of spatial density and similar with respect to frequent patterns.It should be noted that the trip length is only evaluated in terms of the number of points per sequence and not according to the actual distance.

General considerations:
The spatial granularity of trip data is a relevant utility consideration from a practitioner's perspective.For example, a resolution according to 500500 cells might be sufficient to identify highly frequented neighborhoods, but too coarse for tasks that require mapping to the street network, like determining which streets are preferred routes of cyclists.Using differentially private mechanisms, high dimensionality typically comes with high levels of noise which also applies here: according to the analysis of [45], the utility decreases due to increased noise, for finer grids than 1111 cells and [34] even find such an increase for grids finer than 66 for three of their evaluated datasets 7 .As the covered area highly varies depending on the city the dataset origins from, which ranges from small towns like Oldenburg to huge cities like Beijing within the evaluations, it is difficult to translate the grid resolution into kilometers.Though, to obtain an idea of the scale, let's consider an area of 15 x 15 which roughly covers the denser part of a large city like Berlin but not even nearly the area of the GeoLife dataset in Beijing.It would need 3030 of 500 cells, and grid sizes of 1111 (66, respectively) would signify a cell edge length of 1.4 (2.5), questioning the usefulness of such trip data.This utility-privacy trade-off is tackled in different ways: Adaptive grids are used by AdaTrace and PrivTrace.DPT and TGM derive the grid resolution based on the dataset's sampling rate.LTPTrace, TGM, TrajGAIL, and SPRT limit the number of possible transitions to neighboring cells, thus, the dimensionality of the transition matrix is immensely reduced.DP-Loc only uses the top visited cells that cover 95% of all visits and drops the remaining cells.
SPRT additionally addresses the issue of plausibility with regard to matching the street network, by removing all implausible cells and transitions from the transition matrix.Without considerations of privacy guarantees, street mapping is further addressed by the following approaches: reinforcement learning implementations like [25,109] use street segments as location representations, but they are limited to a pre-defined action set (i.e., 'left', 'right', etc.), thus they would require additional adjustments to be able to map non-uniform road networks.Two GAN [108,110] approaches use a CNN-based discriminator to learn plausible images, TrajGen [17] makes use of OpenStreetMap to map locations onto the closest road segments, and TS-TrajGen [56] uses an adjusted A* path search algorithm to ensure spatial continuity and includes topological constraints to enforce road matching.
Table 4 provides an overview of mobility characteristics considered for statistical similarity evaluations.Next to the spatial distribution, relevant mobility characteristics for trips include OD flows, trip length, and the chosen route, i.e., the travel pattern.Accordingly, many of the coded publications within this category consider respective similarity measures.As many models for trips omit the temporal dimension, we rarely see evaluations of the (spatio-)temporal mobility characteristic, although this would entail relevant information such as traffic volume per hour of day or average speed per road segment.User patterns are not considered in any of the corresponding works; they are also likely not relevant for many use cases that consider trip data.OD flows [17,32,34,44,45,48,56,64,102,118] [10] travel patterns [25,32,34,42,44,45,48,64,64,102,106,118] [5] [19] user patterns [39,112,125] [5, 10, 84]

User movements
A different perspective considers mobility data as a sequence of stay locations of a user, potentially over a longer period of time.Commonly considered datasets are LBSN data (e.g., Foursquare) or mobile phone GPS traces.Realistic datasets should entail users that re-visit the same locations multiple times, and travel distances between locations should be reasonably small.The palette of algorithms for user movements is diverse, especially in terms of deep learning approaches.Many are directed at a specific type of mobility dataset or application.The heterogeneity of applications and datasets makes it rather difficult to compare the approaches with one another.Specifically, useful similarity measures include the activity space (e.g., radius of gyration), number of daily visited locations per user, I-Rank, or stay durations per location.
Bindschaedler and Shokri [11] published a model in 2016, commonly referred to as SGLT, which has been widely known and cited since (197 citations on Google Scholar, 06.03.2023).For each user, represented by a single trajectory in their approach, a mobility model is computed that encompasses their spatio-temporal behavior (e.g., speed or duration of visit) and all models are combined into an aggregate mobility model.A seed trace, i.e., a trajectory from the original Manuscript submitted to ACM data, is transformed into a 'semantic seed' by replacing each location with their semantic class, such as 'home'.The semantic seed in combination with the aggregate mobility model provides the base to create a synthetic trajectory that is semantically and geographically plausible.The authors provide the privacy guarantee of plausible deniability.As [45] showed, SGLT is extremely slow which aligns with the authors' statement that one trajectory takes about two minutes to be generated and thus does not scale well for any larger dataset (only 30 user trajectories are used within their evaluation).
Dandekar et al. [28] use smart card data, modeled as staypoints, to simulate trajectories in communities of commuters.
The communities are identified by clustering and for each community, a weighted directed graph is built to generate synthetic trajectories.
Zhao et al. [122] choose a rather different approach: they propose W 3 -tess which synthesizes privacy-preserving traces by enhancing the plausibility of synthetic traces by incorporating information from social networks.They argue that friends have a strong influence on one's mobility, for example, meeting at a restaurant and social locations where users are geographically in contact with their friends are included in their model.This is an interesting approach as it tries to capture the complexity of people's behavior beyond home and work locations.They compare their approach to SGLT [11] and find that W 3 -tess better preserves social behavior.In terms of spatial distribution they perform highly similar, W 3 -tess showing slightly better results.Even though the temporal dimension is part of the model, there is no information on how it is operationalized (likely similar to time windows in SGLT), and there is no evaluation of the temporal distribution.Rao et al. [92] propose LSTM-TrajGAN, an LSTM-based GAN, which, unlike most other approaches, does not discretize the geodata input but instead models location coordinates as continuous variables.
Additionally to coordinates, LSTM-TrajGAN takes the attributes user ID, weekday, hour of day, and location category as input.The location category refers to point-of-interests, thus staypoints, such as 'food', 'gym', 'outdoors or recreation' in the utilized dataset.Unlike most other approaches, the weekday is also considered as temporal information next to the hour of day.Spatial similarity is only computed on a per-instance level; statistical similarity is only evaluated for temporal properties.It is worth noting that the generator in LSTM-TrajGAN uses not only a noise vector but also original trajectories as input, thus trying to mimic a specific trajectory when generating a synthetic one.In particular, the resulting synthetic trajectory has the same length as the original one.It is further worth noting that in the generation phase, the model requires the original data to produce synthetic ones.(It is thus worth discussing whether the model classifies as a synthetic mobility data generation model in the sense of our definition.) Similar to Cao et al. [17], Ouyang et al. [80] propose a GAN based on a CNN that represents trajectories as images.
Each pixel contains information about the location, the start time and duration of the visit.A pixel can be visited up to  times by an individual (i.e., there are multiple layers of the same 2D space).Unlike other approaches, this model especially puts a focus on stay durations and includes evaluations of the spatio-temporal distribution.Note that this model consequently requires input data entailing stay durations.The publication does not provide information on whether it ensures strictly increasing arrival times within the trajectory.Additionally, even though the authors emphasize the importance of the semantic sequence and point out that their model captures the entire trajectory for generation, there is no utility evaluation of travel patterns.
Xu et al. [112] propose DeltaGAN, another GAN-based approach for stay locations which especially focuses on continuous time, meaning, strictly increasing arrival times of events happening at irregular time intervals, and timeconditioned location generation, thus, creating location visits depending on the time of day.Among other benchmarks, it compares itself to Ouyang [80] and outperforms by far in terms of total trajectory distance per day, radius of gyration, and number of unique locations per daily trajectory, all of which are different measures of user patterns.It also performs Zhou et al. [125] created STULIG (Semi-supervised Trajectory-User Linking model with Interpretable representation and Gaussian mixture prior) which is desgined for check-in data.The paper is rather focused on tackling the trajectoryuser linkage issue and thereby also generates synthetic trajectories.
Tamura et al. [103] propose Agent2Vec, an approach inspired by Word2Vec.User trajectories are divided into 30minute time intervals with corresponding locations.Locations are clustered using -means with  = 4 to represent the clusters 'residential', 'office', 'restaurant' and 'other'.A Word2Vec approach creates for every user a vector representation that captures the user's tendency to move and stay (over the course of day).Synthetic data is created based on the clusters, stay densities in cells, and travel distance distributions.According to the authors' evaluation, their approach still needs improvements in terms of maintaining travel distances.
Chiesa and Taraglio [23] focus on staypoints and stop durations using a VAE model.They neither benchmark against other models nor include similarity measures, instead, they rely on visual comparison of distributions of real and synthetic data.
As mentioned at the beginning of this section, the proposed models for user movements are rather heterogeneous in their approaches and thus difficult to compare: some focus on a specific type of dataset, like CDR data [75], a specific mobility topic, such as commuting patterns [28] or stay durations [80], or include specific aspects, e.g., social behavior [122].Only three [11,75,122] provide privacy guarantees, two other approaches include a privacy evaluation [92,125] while the rest of the publications does not further consider privacy.Time is considered by almost all publications to some degree and many include evaluations concerning the spatio-temporal distribution (see Table 4).Unfortunately, some only provide little information on how time is operationalized.Only three publications consider user patterns and none evaluate travel patterns in terms of frequent sequences.This is striking as maintaining such patterns should be a central element for models in this category, which is also argued by many of the authors.Thus, for model validation, a respective evaluation would be helpful.If it is only of interest to maintain the spatio-temporal characteristic in the generated dataset, essentially the information, how many people are at a location at a certain time, the effort of creating plausible sequences could be omitted.

City population
Unlike synthesizing a single (likely not representative) mobility dataset, the goal of this setting is to create representative mobility data for a city population over the course of a day (as stay trajectories), for example as input for traffic models.
Mobility datasets are often just one piece of information used in those approaches.Additional information include census data and models about mobility behavior like the exploration and preferential return mechanism [101].Also, Manuscript submitted to ACM many models rely on the deduction of home and work [10,16,75,84] locations and make assumptions about commuting behavior.Thus, it should be noted that the performance of these models likely differs for unusual mobility behavior, for example, during the COVID-19 lockdowns.As elaborated in Section 4.3, many of these models do not motivate their work with privacy concerns but rather stem from the research field of traffic modeling.Thus, only one includes differential privacy guarantees [75], two coded publications include a privacy evaluation [10,100] and one further publication uses privacy to motivate their work [5].producing temporally more fine-granular trajectories.Thus, this approach might not be suited anymore to generate CDR data as the single records are limited to work and home locations of users.When considering this model its suitability for more fine-granular data should therefore be tested previously.
3W (WHO-WHERE-WHEN) follows the same idea of assigning synthetic persons a home and work location and sampling points according to temporal patterns.Though, instead of only considering home and workplace as locations, they define an action space around these two locations, and for every hour for a one-month-long period locations are sampled, either 'home', 'work' or a random sample from the action space.Privacy is evaluated by comparing the similarity between every trajectory pair.As there is no dependency between consecutive points, it is not surprising that the authors find the approach needs further development to also reproduce individual characteristics of human mobility in addition to characteristics on a population level.Bwambale et al. [16] create a synthetic population based on household and census data.Then, travel survey data and aggregated CDRs are used for trip generation modeling.
The following models consider dependencies in the sequence of locations.Papparlado and Simi [84] propose DITRAS which works in two steps: first a diary is generated (temporal pattern) with abstract locations based on a Markov chain (non-parametric), secondly, the abstract locations are replaced with geographical information using a weighted spatial tessellation and an exploration and preferential return model (parametric).The authors state that exploration and preferential return comes with the downside of overestimating long-distance trips, also they envision the usage of more complex typical diaries.Digital Twin Travellers [5] is also based on Markov models and aims to recreate realistic daily schedules of sequences of staypoints.They benchmark against DITRAS [84] which, according to their evaluation, is outperformed on all measures: start time distribution, duration distribution, number of trips, frequent patterns ('tour network'), spatial error, distance traveled, activity space, and mobility entropy.Pang et al. [81] (also [82,83]) use a reinforcement learning approach and additionally include information on context features, namely the number of offices, the number of employers, the number of schools, the number of evacuation facilities, the number of amusement facilities, the length of roads, railway stations, and the residential density.They use fixed time intervals which can span only seconds or longer periods -within their evaluation 30 minute intervals are used, focusing rather on staypoints than waypoints.Privacy is evaluated by measuring the distance between any two trajectories.
For this category, it is especially relevant to maintain the spatio-temporal distribution, thus this is evaluated by 4 of the 7 publications.Additionally, trip length and user patterns are evaluated by a subset of publications.Generally, the authors of [5,84,100] include extensive evaluations and provide detailed information on (dis)advantages.Bwambale et al. [16] and Pang et al. [81] require specific additional user input, which makes them only suitable if the specific prerequisites are given.For an in-depth assessment of Berke et al. [10], further evaluations are needed.A comparison between their RNN approach and the other approaches of this category would provide interesting insights.

Unspecified category
Next to Chen et al. [19,20] (as already introduced earlier), all other publications within this category are generic or do not provide enough information to classify them according to one of our categories.For example, [61,62,118] use generic RNN architectures that can potentially take any arbitrary sequence as input.
Li et al. [67] propose a model which first applies a trajectory generalization step by grouping all spatial points at each timestamp with -means clustering.To provide differential privacy, it uses the exponential mechanism to select the partitions when constructing new trajectories, and finally adds Laplace noise to the counts of trajectory sequences.
-means requires to pre-define the number of clusters, though the authors do not elaborate on how this can be set mindfully.For their evaluation,  is set to 50 which appears to be a rather coarse setting.The short evaluation limits the possible conclusion on the model's utility.Chen et al. [22] propose RNN-DP, an RNN-based approach which is mainly directed at an entirely different scenario of handling real-time trajectory data instead of the full historical data.Still, they include a section about trajectory release which basically follows the same clustering approach as [67].
Sakuma et al. 's CANDAR [94] uses a Seq2Seq autoencoder model that adds differentially private noise in the latent space.Next to [53,92], this model is the one of the few that does not discretize locations but instead keeps a continuous representation.The GPS smartphone data used for evaluation is preprocessed by grouping it per user and day, then trajectories with less than 5 points are removed, and those with more than 10 points are split again, thus the maximum length of a trajectory is 10.Given this preprocessing, it is difficult to determine a suitable real-life dataset that would match these data characteristics.
Kulkarni and Garbinato [61] propose a basic LSTM model that takes a sequence of discretized locations in addition with selected time-series features (which are not further specified) and outputs a sequence of locations.According to the authors, the ordering of commonly visited places is not well preserved by their model.Generally, RNNs could be used for any type of trajectory, though this implementation is not well suited for the category trips, as they use a pre-defined fixed length sequence as input and output, thus neither distributions of OD pairs nor trip lengths are likely to be well maintained.Also, there is no information on how the initial location to initiate the generation process is determined.
Using the actual initial locations is a privacy issue while taking random locations is a utility problem.Blanco-Justicia et al. [13] propose a similar model based on a bidirectional LSTM, thus the same shortcomings apply.To ensure that the model does not reproduce original trajectories too closely, they collect the top- predictions for the next point Manuscript submitted to ACM and choose one of them uniformly at random.In 2018, Kulkarni et al. [62] also compared different implementations of RNNs, GANs, and copulas, though providing only limited information on model descriptions and evaluations.
Zhan [121] recently proposed LSTM-PAE, an LSTM-based location privacy protection mechanism via representation learning and adversarial learning, to learn a privacy-preserving feature extraction encoder.They focus on the trade-off between utility (measured as the accuracy of next-location prediction) and privacy (measured as re-identification risk).
While such a detailed comparison is an appreciated approach, the paper lacks information on data preprocessing such as how discrete locations are obtained for the one-hot encoding or how trajectories are transformed into a pre-defined sequence length.Even though the model takes one-hot-encoded timestamps as input, the time representation of the synthetic data remains unclear.The utility is only evaluated in terms of prediction accuracy.
Alatrista-Salas et al. [4] propose a differentially private GAN and compare this to a regular GAN, only providing little information about the model and its evaluation.PrivTC [117] propose a local differentially private model without requiring specific trajectory semantics.It is evaluated with the Gowalla LBSN dataset as well as a taxi dataset and outperforms Ngram based on query error, frequent pattern similarity and distance error.

DISCUSSION AND CONCLUSION
Generating synthetic urban mobility data holds much potential, but it is a complex task with no one-size-fits-all solution that could serve this heterogeneous field.This is also reflected in the diversity of the reviewed models.However, the absence of a clear definition for high utility in synthetic datasets makes model comparison challenging.There are no straightforward downstream tasks suited for evaluation; instead, mobility data is used for a variety of non-standardized analyses and models that pose different requirements to the data, reflected by different statistical distributions, potentially entailing complex interactions.Thus, a single valid definition of high utility can hardly be established.In addition, varying dataset formats make standardized evaluations even more difficult.For example, a temporal representation is either omitted completely, modeled as a start timestamp, or based on fixed time windows.
It is likely impossible to create a model that preserves all distributions equally well while also maintaining privacy.
We suggest that researchers developing new models clearly state their intended use cases and provide meaningful evaluations covering all relevant mobility characteristics with at least one measure each.Details on implementations of utility evaluations would be desirable, such as utilized grid resolutions for spatial aggregations and histogram bin sizes for all other aggregations.Such choices are often not stated or can only implicitly be deduced from the implementation.
This facilitates the results' interpretation, as the utility can easily be increased by the choice of a coarser resolution (for example, the spatial distribution might easily be maintained well on a 2 x 2 grid but not nearly as well on a 250 x 250 grid).Without respective details, practitioners do not have sufficient information to decide whether the reported utility satisfies their needs.A granularity standard can hardly be given in general as it highly depends on the intended use case.Also, the provision of the model's source code and generated synthetic datasets would facilitate the evaluation and comparison with existing models.Additionally, this field of research would benefit from real-world test cases of organizations that provide data and use synthesized trajectories for their actual purposes, which would reveal shortcomings in practical settings.This could also offer valuable insights into the necessary levels of similarity across different mobility characteristics in varying contexts.
While there are specific limitations for each model, there are (so far) also global limitations: many models entirely omit the temporal dimension.If included, it is only considered as the time of day (potentially also the day of the week).
This means that the data does not reflect information on mobility behavior differences over the course of the year, e.g., seasonal differences, event-related mobility behavior (e.g., a sports event), or weather-related behavior adaptions.
Manuscript submitted to ACM None of the publications provides evaluations in relation to the dataset size.This is especially relevant for deep learning-based models that commonly need a lot of data to properly learn latent distributions.Generally, not all publications provide sufficient information about the origin, format, size, and availability of the utilized datasets.
Potential biases or dataset-specific characteristics are rarely stated by authors, even though they might impact the generalizability of the results.More careful considerations of trajectory semantics, such as the type (i.e., waypoints or staypoints) or the length of trajectories, and how many trajectories a user contributes, would be desirable, as this information is crucial for determining the suitability of models for specific contexts.
Privacy is one of the main motivations for synthetic data generation.Yet, many models lack privacy guarantees or privacy evaluations.As recent approaches tend to rely more on deep learning models, they also tend to omit privacy guarantees (see Figure 4).This might not be surprising, as differentially private deep learning has only recently been provided as part of open-source libraries (e.g., Tensorflow Privacy in 2019 8 ).Two deep learning models already make use of differentially private stochastic gradient descent methods [64,118] and likely more models will follow.
Differential privacy is no silver bullet and it comes with certain limitations.Protecting the privacy of individuals leads to a utility reduction of data from minority groups [90].For example, a dataset might only include a small share of children who have a different mobility behavior than, say, commuters, who typically constitute a large subpopulation in mobility data.In such a case, a differentially private synthetic dataset might well resemble commuters' mobility behavior but not the children's movements.It should be noted that deep learning models without any additional privacy guarantee mechanisms likely do not properly learn minority group behavior either [14].Additionally, differential privacy provides guarantees on the level of items (trips in this context) or users, and not on the level of groups.For example, the well-known incident of the released aggregated Strava data which revealed secret military bases [49] would not have been prevented by differential privacy.Thus, the publication of differentially private synthetic data still requires careful thought about the information that is still entailed.
In summary, for the contribution of meaningful new models, we consider the following aspects as valuable parts of the respective publication: Information about applicable use cases; a detailed description of the input and output data format of the proposed model; datasets and configurations of measures used for evaluation; openly available source code as well as the created synthetic dataset(s).We see the potential for future research in the following aspects: A comprehensive comparison of different models based on a heterogeneous set of datasets and similarity measures would foster use case specific recommendations.Standardized benchmarking tools, such as the Synthetic Data Vault [85] that provides a set of standard datasets and evaluations for tabular data or time series synthetic data generation models, could contribute to better comparability, accelerate the development of new models and create confidence in practitioners' decision processes.We have developed a Python package dp_mobility_report9 capable of computing a comprehensive set of measures for various mobility characteristics (so far only considering staypoints), allowing custom configurations of histogram bin sizes and tessellations.This could be extended to provide standard test cases comprising datasets and configurations.Moreover, comparing deep learning and traditional methods would provide meaningful insights, with special regard to the achieved utility concerning the input dataset size.Finally, recent developments in the fusion of differential privacy and deep learning open the way for further research.

4 .
Fig. 1.Overview of all steps of the literature search and resulting number of included publications.

Fig. 2 .
Fig. 2. Categorization of utility evaluation measures in the coded literature.

4. 4 . 1
Mobility characteristics.We categorize the different mobility characteristics as spatial distribution, trip lengths, temporal distribution, spatio-temporal distribution, OD flows, travel patterns and user patterns (see first column in Table
comparing the number of trips per hour of day or the stay duration.The interlink between spatial and temporal distributions is mostly analyzed by comparing the visits per location and hour of day or the stay duration per location.OD flows are a vital information for many urban mobility analyses, though only eight of the coded papers include such a similarity measure.By travel patterns we refer to the order of locations in a dataset.It is typically operationalized through analyses of empirical distributions of the most frequent subsequences of a given size, e.g., the top 20 subsequences of length 3, or by comparing the rankings.User mobility patterns describe mobility behavior of individuals.They are captured by the comparison of the user activity space (e.g., radius of gyration), number of distinct locations per user (and day), I-Rank (i.e., the frequency of visiting the personal top one or more locations), number of trips per day and user, spatio-temporal visitation patterns per user and hour of day, the mobility entropy (which informs about the degree of predictability of an individual's whereabouts; it characterizes the heterogeneity of individuals' visitation patterns), or the semantic similarity of user movements according to temporal patterns.
distributions.The number of averaged values can correspond to single records, e.g., comparing the real observed trip length of a trajectory to its synthetic counterpart, or to histogram bins, e.g., grouping trip lengths into 5 bins and comparing the count of each corresponding bin.Apparently, the range of the MSE highly varies based on the implementation choice.The reviewed literature also uses variations of the MSE such as the root MSE (RMSE) and standard root MSE (SRMSE).

Fig. 4 .
Fig. 4. Timeline of all coded publications, displaying the model's name (if existent) or the name of the first author.The size of the bubble indicates how often the publication was cited (based on Google Scholar, accessed 06.03.2023).Models providing privacy guarantees and use of deep learning algorithms are indicated, also, arrows indicate when a previous model was used as a benchmark.
authors of AdaTrace state that it outperforms Ngram in terms of query error and location ranking, the DP-Loc paper shows a clear superiority of Ngram based on the EMD, and Deldar and Abadi (DP-MODR) find that it depends on  and the grid resolution which approach outperforms the other.Gursoy et al. explain that Ngram prunes sparse entries which leads to a spatial distribution that overestimates frequent intermediate regions and the remaining regions become excessively sparse.
Manuscript submitted to ACM better on the same measures proposed by Ouyang et al., namely, the total stay duration of each location and the visit probability of each location (per time unit).Feng et al.[39] established the GAN framework MoveSim.The generator consists of two parts: a model-free (i.e., not based on prior knowledge) self-attention mechanism and a model-based part, which provides information about the physical distance of locations, function similarity (which refers to POI category distribution), and historical transition matrix between all locations.To further capture important mobility regularities, the discriminator includes a correction term to regulate temporal periodicity and spatial continuity.Temporal periodicity describes that people tend to visit the same locations at the same hour of the day, and spatial continuity encourages the model to limit travel distance.Like DeltaGAN, they benchmark with Ouyang et al. and show better results for user patterns, specifically total trajectory distance per day, radius of gyration, I-Rank, and the number of unique locations per daily trajectory.They also outperform Ouyang et al. on the stay duration per location and the overall location ranking.
Berke et al.[10] use a text-generation-based model in their recent proposition, putting locations into analogy with letters in a text.They discretize time and geographic space with one-hour time windows and census cells, respectively.Home and work locations are deduced from the data and used as input to train an RNN that generates a cell for each time window.Null values allow the representation of missing values.Either the raw data distribution of home and work locations or publicly available census data can serve as input to generate synthetic data.The utility evaluation is rather short, only comparing trip distance, locations per user, and the proportion of aggregate time spent per location.

Table 1 .
Datasets used for model evaluations, including size and time range.The dataset size is either provided as the number of users, number of trajectories, or number of records, i.e., single spatial points.* Dataset is publically available; (*) dataset is available on request.Links are provided in the supplemental materials.
consists of 1.2 million records in the Montreal transportation system.

Table 2 .
Statistical similarity measures used by the coded publications and categorized by mobility characteristics.Sometimes mobility characteristics are operationalized similarly but named differently, thus different name variations are listed.

Table 3 .
Overview of coded publications, grouped by category, including the model name, publication year, privacy considerations, representation of time ('?' signifies that no clear information is provided), and applied algorithms.* Source code is available and respective links are provided in the supplemental materials.

Table 4 .
Overview of mobility characteristics that are evaluated (through similarity measures) by respective publications in each category.
[54] is commonly considered as fixed time windows and a location is assigned for each window and user.While approaches for user movements could potentially work with temporally sparse data, or such that do not include all types of locations (e.g., LBS check-in datasets might not entail home or work locations), this would not suit the aim of the city population category.Original datasets usually consist of mobile phone or household survey data.Mir et al.[75]proposed DP-WHERE in 2013, which has been widely cited since (157 citations on Google Scholar, 06.03.2023).It modifies the algorithm WHERE (Work and Home Extracted REgions)[54]by adding noise to achieve differential privacy.WHERE and DP-WHERE are tailored to call detail record data and construct cumulative distribution functions as a base to generate new trajectories, using distributions of home locations, commuting distances per home region, work locations, distribution of calls in a day, probabilities of a call at each minute of a day, probabilities of a call at each location per hour.As mobile phone usage has increased over the past 10 years, CDR data is less sparse,