Analysing Fairness of Privacy-Utility Mobility Models

Preserving the individuals' privacy in sharing spatial-temporal datasets is critical to prevent re-identification attacks based on unique trajectories. Existing privacy techniques tend to propose ideal privacy-utility tradeoffs, however, largely ignore the fairness implications of mobility models and whether such techniques perform equally for different groups of users. The quantification between fairness and privacy-aware models is still unclear and there barely exists any defined sets of metrics for measuring fairness in the spatial-temporal context. In this work, we define a set of fairness metrics designed explicitly for human mobility, based on structural similarity and entropy of the trajectories. Under these definitions, we examine the fairness of two state-of-the-art privacy-preserving models that rely on GAN and representation learning to reduce the re-identification rate of users for data sharing. Our results show that while both models guarantee group fairness in terms of demographic parity, they violate individual fairness criteria, indicating that users with highly similar trajectories receive disparate privacy gain. We conclude that the tension between the re-identification task and individual fairness needs to be considered for future spatial-temporal data analysis and modelling to achieve a privacy-preserving fairness-aware setting.


INTRODUCTION
Understanding human mobility based on collected locations from mobile devices has become a fundamental part of urban and environmental planning in cities [28].These GPS traces enable the scientific community and policymakers to model citizens' daily mobility patterns (e.g., crowd-sensed car sharing, ride sharing, city bicycles sharing, and RFID-card-based public transportation, or build predictive algorithms to estimate people's flows and community structure [13].However, location-based traces corresponding to human mobility, even at an aggregate level, have raised numerous privacy concerns [8,38], mainly when the data contains sensitive and revealing insights about people's identity, behaviour, associations, religion, and others [23]. In the past decades, the research community has examined various ways of ensuring the privacy of mobility traces.Previous work, ranging from k-anonymity [1], differential privacy (i.e., DP) [35,41], to information-theoretic metrics [32,45], explore scientific guarantees that data subjects cannot be re-identified while the data remain practically useful.More recently, privacy-utility trade-off (PUT ) models based on machine learning or deep learning techniques that aim to optimize both privacy and utility (i.e., inference accuracy) have been studied and shown to be superior to the previous approaches [11].These techniques can be summarized as representation learning [22], generative adversarial network (GAN) [31,33], reinforcement learning [10,11], etc.In these works, researchers have shown that it is possible to design and implement frameworks that enhance the privacy protection of individual trajectories without a significant reduction of the trace's utility.
A dimension that has been vastly overlooked is whether privacy-preserving algorithms work equally for all users or whether they could lead to unexpected consequences of protecting the privacy of only a group of people.Indeed, as recent evidence from the broader machine learning domain has shown, the systematic discrimination in making decisions against different groups has been shifted from people to autonomous algorithms [19,24].In many applications, discrimination may be defined by different protected attributes, such as race, gender, ethnicity, and religion, that directly prevent favourable outcomes for a minority group in societal resource allocation, education equality, employment opportunity, etc [36].Similarly, in the context of spatial-temporal data, mobility demand prediction algorithms have been • We examine the privacy-preserving algorithms in terms of both individual fairness and group fairness on two representative mobility datasets, and show their deficiencies in accounting for fairness can lead to undesired consequences.
• We systematically discuss why individual fairness and group fairness are competing in the privacy-aware setting.

Fairness in Machine Learning
Literature on fairness in machine learning (Fair-ML) tends to focus on the absence of any prejudices or favoritism toward an individual or group based on their inherent or acquired characteristics [30].The majority of fairness research strives to avoid the decision made by automated systems skewed toward the advantaged groups or individuals.In [15], authors proposed a framework for understanding different definitions of fairness through two views of the world: i) we are all equal (WAE, mostly ensuring the group fairness), and ii) what you see is what you get (WYSIWYG, mostly ensuring the individual fairness).The framework shows that the fairness definitions and their implementations correspond to different axiomatic beliefs about the world, described as two fundamentally incompatible worldviews.A single algorithm cannot satisfy either definition of fairness under both worldviews [15].
The most adopted metrics for fairness in machine learning are widely based on the WAE assumption and denoted as group fairness, which is also known as statistical parity and demographic parity [9].These metrics aim to ensure that there is independence between the predicted outcome of a model and sensitive attributes of age, gender, and race.If variations of statistical parity exist, Fair-ML will concentrate on relaxation of this measure by ensuring that groups from sensitive attributes and non-sensitive attributes meet the same misclassification rate (false negative rate, also known as equalized odds [17]), or equal true positive rate (also known as equal opportunity [17]).
In the context of mobility data and its applications, such as equitable transportation, research attention also mainly devoted to group fairness.Transportation equity heavily employs statistical tests for equity analysis, which is appropriate for discovering unfairness [43].The such metric is often defined based on census tract information, which offers an aggregate demographic characteristic of the residing population.(author?) [42] defined fairness in terms of the regionbased fairness gap and assesses the gap between mean per capita ride-sharing demand across groups over time.The two metrics differ from each other.One is based on a binary label associated with the majority of the sub-population (e.g., white), and the other is based on a continuous distribution of the demographic attributes.Similarly, (author?) [18] proposed a graph-based approach for integrating group-based (census) into e-scooter demand prediction.Through the integration of an optimization regularizer, they showed that it is possible for their model to jointly learn the flow patterns and socio-economic factors, and returns socially-equitable flow predictions.Hosford et al. [20] investigated the equity of access to bike sharing in multiple cities in Canada.Ge et al. [16] studied racial and gender discrimination in the expanding transportation network companies.These handfuls of recent works all focus on group-based fairness metrics and collective methods (e.g., demand or flow prediction).
On the other hand, individual fairness claims that similar individuals should be treated similarly concerning the target task [9].For example, in making hiring decisions, the algorithm has to possess perfect knowledge of comparing the "qualification" of two individuals.In most cases, the difficulty with individual fairness lies in the notion of measuring similarity.For example, (author?) [43] used the population and employment density of each city area for achieving individual fairness in bike-sharing demand prediction.The difficulty, again, lies in the fact that there is often a lack of perfect knowledge to determine the similarity in demand between two areas.In broader spatial-temporal data and application, the definitions of mobility similarity are almost non-existence, so as individual fairness of spatial-temporal data.Although previous work in fairness literature [6] has examined the boundary of fairness and privacy, these works have been applied to low dimensional datasets (e.g., COMPAS) that differ greatly from complex mobility data of people.
In this work, we offer a new perspective on how to measure individual fairness metrics defined based on the literature on mobility and examine its application in assessing the fairness of the privacy-preserving algorithms applied to mobility traces.

Privacy Methods for Spatial-Temporal Data
Large-scale human mobility data contain crucial insights into understanding human behaviour but are hard to share in non-aggregated form due to their highly sensitive nature.Decades of research on privacy examined various anonymous mechanisms on human trajectories [1,35,41].A mobility privacy study conducted by De Montjoye et al [8] illustrates that four spatial-temporal points are enough to identify 95% of the individuals in a certain granularity, demonstrating the necessity of the anonymous mechanism against the re-identification attack.Previous work, ranging from kanonymity [1], differential privacy [35,41], to information theoretic metrics [32,45], explore scientific guarantees that the subjects of the data cannot be re-identified while the data remain practically useful.More recently, PUT models based on machine learning, which simultaneously aim to optimize for data privacy protection and utility, are emerging.
In these lines of work, researchers have focused on the objective of training neural network models that optimize for reducing privacy leakage risk of individual trajectories while at the same time minimizing the depreciation in the mobility utility.These models have been shown to be superior to differential privacy techniques.In this paper, we selected two machine learning-based PUT models based on two different strategies of GAN and Representation Learning, but both with promising high performance in terms of both utility and privacy.These two PUT models mainly focus on temporal correlations in time-series data and aim to reduce the user re-identification risk (i.e., privacy) while minimizing the downgrade in the accuracy of mobility prediction task (i.e., utility).We describe the details of these two privacy-aware spatial-temporal models: TrajGAN [33]: it is an end-to-end deep learning model to generate synthetic data that preserves the real trajectory data's essential spatial, temporal, and thematic characteristics.Compared with other standard geo-masking methods, TrajGAN can better prevent users from being re-identified.TrajGAN claims to preserve essential spatial and temporal characteristics of the original data, verified through statistical analysis of the generated synthetic data distributions, which is in a line with the data utility assessment based on the mobility prediction task in our work.Hence, we train a TrajGAN-based PUT model to evaluate the mobility predictability and privacy protection of synthetic data generated by TrajGAN.[44]: it is a privacy-preserving adversarial feature encoder.In contrast to the TrajGAN that aims to generate synthetic data, Mo-PAE trains an encoder   that forces the extracted representations f to convey maximal information about data utility while minimizing private information about user identities via adversarial learning.It consists of a multi-task adversarial network to learn an LSTM-based encoder   , which can generate the optimized feature representations  =   ( ) via lowering the privacy disclosure risk of user identification information (i.e., privacy) and improving the mobility prediction accuracy (i.e., utility) concurrently.

FAIRNESS DEFINITION AND METRICS
In this section, we first define the mathematical representation of fairness in spatial-temporal applications before we incorporate it into our analysis.

Formulation of the Problem
In this work, we aim to measure and evaluate the fairness of the privacy-preserving algorithms applied to mobility traces.We seek to figure out whether these models equally preserve the user privacy and inference accuracy of similar users.We try to determine whether fairness metrics benefit from a privacy-preserving model simultaneously, laying a theoretical foundation for further research on the privacy-preserving fairness-aware mechanism for human mobility.
Both individual-and group-based fairness are discussed.
We first introduce some basic notations and abbreviations utilized in this work: individuals are labelled as u, if individuals   and   are similar, that is   ∽   ; sensitive or protected attributes are denoted as S; raw data without sensitive attributes is denoted as  ;  is the ground-truth labels for a specific inference task and  ′ is the predicted one, which is the variant that depends on  and  .The true positive rate (i.e., TPR, recall, or sensitivity) is utilized to judge the performance of the multi-categorical classifiers, which refers to the proportion of who should be predicted accurately that received a positive result.TPR is also utilized in the inference tasks' quality of the examined models and is denoted as task accuracy.

Individual Fairness
Individual fairness [9] states that individuals who are similar, with respect to a specific task, should be treated similarly (i.e.,    ∽    when   ∽   ) [34]: As we have mentioned in Section 2.1, the difficulty with individual fairness lies in the notion of measuring .
To measure individual fairness in the context of spatial-temporal data, we need two sets of definitions corresponding to i) the similarity between users' trajectories (  ); and ii) the similarity of the outcome of the PUT models (  ), as well as their generalizability for different mobility datasets and PUT models.We define each next: 1 Similarity of Trajectories.Grounded on the literature on mobility [13,29,38], we mathematically denote the notion of trajectory similarity (  ) based on i) the structural similarity index of mobility heatmap images; and ii) the entropy of trajectories.
Structural Similarity Index Measure (SSIM): SSIM was initially designed to quantify image quality degradation caused by processing, such as data compression or losses in data transmission, which leverages the differences between the reference image and the processed image [40].To apply SSIM metrics in this work, we construct heatmap images from the raw geo-located data with the methodology proposed by [13].Figure 1 shows some sample heatmap images with spatial granularity coarsening from 50 meters to 900 meters by the left to right.These heatmap images structurally represent mobility features extracted from mobility traces, which use pixel intensity to encode the frequency of the visit spent in a given area; hence, the brighter pixels denote the more frequently visited locations of the user.SSIM has been shown to be a well-suited metric to compute the image similarity of the heatmap images apecifically when applied to mobility heatmap images [13,29].Unlike Mean Square Error, the SSIM metric has been shown not to be significantly impacted by the changes in luminosity and contrast.
In this work, we formulate the SSIM measure as the perceptual difference of two similar users' heatmap images,   and   .See the Appendix for full definitions.We then leverage the integrated heatmap image, which combines all user trajectories, to calculate the effective SSIM index (    ) that indicates the overall trajectory similarity of users.
The SSIMs between individual (  ) and integrated trajectory (    ) are denoted by calculating the SSIM  (i.e., local values of the SSIM,   =  (    −   )).  is utilized to lower the impact of the unreached area, that is, only the swept area in the integrated heatmap image was selected for further analysis.
Hence, the average SSIM value of the selected points is what we define as     .Additionally, as this metric relies on heatmap images, it is highly influenced by spatial granularity, where each pixel in the image corresponds to the spatial boundary of the data.Intuitively, in Figure 1, as the granularity coarsens, the trajectories become blurry and, thus, more similar.The impact of the spatial granularity on the SSIM index will discuss in Section 5.1.1.

Entropy of Trajectories (EOTs):
Mobility literature defines the highest potential accuracy of predictability of any individual, termed as "maximum predictability" (Π  ) [27].Maximum predictability is determined by the  of a person's trajectory information (e.g., frequency, sequence of location visits, etc.).Hence, some similar characteristics of user spatial-temporal patterns are able to be captured by leveraging the entropy of trajectory.In this paper, we conclude and define four types of entropy to measure trajectory similarity for spatial-temporal applications, denoted as Shannon Entropy (SE), LonLat Entropy (LE), Heatmap Entropy (HE), Actual Entropy (AE).The integrated entropy of these  iii) Heatmap Entropy (HE): the entropy of the users' heatmap images.In contrast to the aforementioned entropy models, we define a two-dimensional entropy ( 2 ) to quantify the irregularity (i.e., unpredictable dynamics) of the user's heatmap image.The entropy of trajectory heatmap images is calculated using the two-dimensional sample entropy method ( 2 ) [37].In a trajectory heatmap image, the features are extracted by accounting for the spatial distribution of pixels in different -length square windows.iv) Actual Entropy (AE): the entropy of capturing entire spatial-temporal order present in user's mobility pattern.To capture AE, (author?) [38] proposed an actual entropy model using the Lempel-Ziv algorithm.Different to other types of entropy, AE depends not only on the frequency of visited locations but also on the order in which the nodes were visited and the time spent at each location [38].In this work, the given area is segmented using structured grids, where each grid is initialized as 0. Then the visited locations and whether the person reached the cell previously are tracked.
If the person visits an unreached cell, the location is marked as 1, generating time-series binary data to characterize the trajectory.
See the Appendix for full definitions and related equations of these four different entropies.

Similarity of Users.
With the aforementioned definition of trajectory similarity, we mathematically define the users with similar trajectories as the similar users by two techniques: i) -thresholding: setting the threshold  to filter similar users based on their trajectories' similarity.To be specific, if the trajectory similarity of   and   is greater than a threshold , that is   (  ,   ) > , this pair of users will be selected out as the similar users, that is   ∽   .
ii) clustering: grouping similar users together via clustering techniques.We use k-means clustering to cluster users based on their SSIM and EOTs features.We apply the Elbow and Silhouette method [26] to determine the number of clusters (k values).The resulting clusters present a group of highly similar users together.

Similarity of Outcome.
To understand whether users with similar trajectories receive similar outcomes from the models, we first need to define what it means to receive a similar outcome mathematically.As the objective of the PUT models is to optimize privacy gain and minimize utility loss, we consider privacy gain and utility gain as positive outcomes.After selecting out the similar users, we then measure the difference of them in privacy gain outcome, Δ  = 1 −   (  )/  (  ), and utility gain outcome, Δ  = 1 −   (  )/  (  ).Both Δ  and Δ  contribute to the evaluation of   .When with the clustering approach, the average pairwise differences of Δ  and Δ  for all the members of each cluster are assessed.
Regardless of the grouping technique in similar users, we argue that Δ  or Δ  satisfies fairness if it is within 1 − , otherwise, the PUT model is considered to be violating individual fairness for user pair   and   .The threshold of different combinations of SSIM and EOTs are utilized to distinguish similar users and map all users into a list of pairs with trajectory similarity and performance discrepancy.To measure the fairness of systems as a whole for each model and outcome, we report the percentage of user pairs for whom fairness was violated (i.e., violation% or V %).As we will show, in our experiments, we set  = 0.8 to correspond to users with at least 80% similarity of trajectory which imposes the model's outcome to be within 20% difference between the similar users.The choice of  = 0.8 is based on the various literature in fairness and literature [2,12].We discuss the impact of this threshold on policy making in the Discussion section of this article.

Group Fairness
Different to individual fairness lies heavy on the similarity definition, group fairness has been vastly discussed and shares a systematic analysis approach in broader Fair-ML study.In this work, we bridge the gap between the standard group fairness metrics and the specific privacy-preserving mechanism of spatial-temporal data.
Group fairness as also referred to as Demographic Parity [15] states that demographic groups should receive similar decisions, inspired by civil rights laws in different countries [3].To be specific, group fairness argues that a disadvantaged group (in terms of the sensitive attributes) should receive similar treatment to the advantaged group, that is: It is worth nothing that PUT spatial-temporal models are by definition group unaware that is  indicating a sensitive attribute (e.g., race, or gender) is not an explicit feature into these models.However specific demographic groups of users may exhibit certain properties in their mobility behaviour (e.g., students) that could still impact the outcome of the PUT models.For instance, age and employment status can highly influence peoples' day-to-day trajectory.A user whose trajectory data is limited to his home and office location could be highly predictable by the PUT model, however, also highly re-identifiable (with low privacy gain).This means the notion of group fairness in the context of this study is highly dependent on the examined dataset.We elaborate more on this discussion in Section 6.
In order to quantify the group fairness in a more statistical approach, group fairness score (i.e., GFS) for spatial-temporal data are calculated by disparate impact for disadvantaged groups:

EXPERIMENT SETUP
In this section, we describe the datasets we used to evaluate the fairness of PUT models and the steps we took to set up the PUT models for examination.

Datasets
In order to evaluate the fairness of the examined models, we use two datasets that the original papers used to assess the privacy level of their models.In addition to the trajectory data, MDC includes individual user demographic information: categorical age groups, gender, and employment status.To the best of our knowledge, MDC is the only dataset that has published users' demographic information along with their trajectories.
4.1.2Geolife.This dataset is collected by Microsoft Research Asia from 182 users in the four-and-a-half-year period from April 2007 to October 2011 and contains 17,621 trajectories [46].As the Geolife dataset does not include demographic attributes of individuals, we are unable to measure the group fairness for this dataset and our analysis suffices for the individual fairness dimension.
As mentioned in Section 3.2.1, in Figure 1, with the granularity coarsens, the trajectories become blurry and thus more similar to each other.Figure 2 confirms this observation by illustrating the SSIM-and EOTs-based similarity of all the users for varying spatial granularity for both datasets.As the spatial granularity coarsens, we observe an increase in the SSIM values, with users becoming more similar to each other.Furthermore, as different types of entropy are considering different features of the spatial-temporal data, Figure 2 presents the expected similarity of users for various EOTs-based measures.In addition to the distribution of the entropy values presented in the Figure 2 for each dataset, we observe that across both datasets, SSIM along with SE and AE correspond to the most relaxed measure of similarity, LE and HE correspond to stricter measures of similarity.The corresponding percentage of user pairs that meet each similarity criterion is described in Table 1 1.Individual fairness among diverse models and datasets with SSIM and EOTs.% of pairs represents the ratio of the pairs that meet the thresholding requirements.The maximum/minimum instances of each column are highlighted in bold font.

Original Properties of the Trajectory
Before describing the privacy and utility trade-off for mobility trajectories of the PUT models, we first give brief definitions of two popular inference tasks (i.e., user re-identification and mobility prediction), which are also applied to assess the privacy gain and utility decline in the PUT models we discussed.These two popular inference tasks are named original tasks in this paper, where the original demonstrates the nature of the data before being processed by any privacy-aware model.These original tasks are leveraged to assess the native data characteristics in terms of user re-identification (UR) and mobility predictability (MP), respectively.See the Appendix for full definitions.

FAIRNESS ANALYSIS
In this section, we present our analysis in studying whether the PUT models can be considered fair.To do so, we analyze these models in terms of individual fairness and group fairness.The similarity   applied in the individual fairness is defined by SSIM and EOTs, and group fairness is grouping users based on demographic attributes such as gender, age, and employment status.

Individual Fairness
The metrics of trajectories' similarity   are crucial for quantifying individual fairness.As definitions in Section 3.2, the    can be quantified by SSIM and EOTs.In this section, we discuss individual fairness with two different similarity quantification approaches.First, the   discriminated based on -thresholding metrics of SSIM and EOTs directly.
Second, the k-means clustering approach, based on the characteristics of SSIM and EOTs aforementioned, is leveraged to classify similar users.

Similarity Based on 𝜖-Thresholding. Table 1 presents the individual fairness of different models by the 𝜖-
thresholding metrics based on SSIM and EOTs.The threshold  of different combinations of SSIM and EOTs are utilized to distinguish similar users (  ∽   ) and map all users into a list of pairs with trajectory similarity and performance discrepancy.Based on fairness thresholding criteria defined in Section 3.2.3,similar users (i.e., user pairs) imply at least 80% pairwise similarity of their trajectories."% of pairs" in the table represents the percentage of the user pairs that meet the corresponding metric threshold requirements.For instance, with the MDC dataset, 36.17% of user pairs have a more than 80% similarity when under the  metric.That is, under the  metric, 36.17%user pairs are qualified for further analysis of outcome similarity.
The user pair is defined to achieve individual fairness when the outcome difference (Δ  or Δ  ) between   and   is within 20%.Table 1 shows the percentage of user pairs that commit fairness violation (i.e., V% = % of (Δ>0.2)).For instance, in Table 1, with the MDC dataset under the  metric, there are only 10.50% and 11.11% of the qualified user pairs violate the fairness criteria in two original tasks, which implies that the individual fairness is achieved, as both V% are within 20%.Different from the original tasks, two PUT models have V% that are all higher than 20%, hence, they violate individual fairness.The higher V% indicates that the model causes more disparities in performance.The values in the italic format present the cases where the outcome to meet individual fairness (i.e.,  % ≤ 20%) in the Table 1.
Overall, individual fairness is not achieved in the two selected PUT models, especially for the unfairness of the privacy gain, which is generally higher than the utility decline.When comparing two different privacy models in a row, TrajGAN achieves less fairness violation rate than Mo-PAE in both privacy gain and utility decline outcomes.For instance, in the MDC dataset, when 45.26% and 29.42% of user pairs commit fairness violations in privacy gain and utility decline, respectively, the Mo-PAE reports twice as many fairness violations for both outcomes.While both the Geolife and MDC data exhibit individual unfairness, the Geolife is worse in both the PUT models and the accuracy of the original tasks.In both original tasks, Geolife's unfairness rate is as high as 60%, and this inequity is exacerbated when with PUT models.In contrast to Geolife, the performance of the MDC in the original tasks conforms to the definition of individual fairness, that is, the performance difference of task accuracy in MDC is within 20% in both user re-identification tasks and mobility prediction tasks.
Impact of Spatial Granularity on Similarity: After the overall comparison of threshold metrics, we discuss the model discrepancy when trajectory similarity is based on the SSIM index under varying granularity.As a crucial metric in distinguishing the trajectory similarity, the SSIM index could be affected by different parameters, which will result in subtle performance disparities in the quantification of individual fairness.The spatial granularity of trajectory is the most important one among these parameters.These disparities could be intuitively observed in the heatmaps (Figure 1).In contrast to the SSIM, the spatial granularity has less impact on different types of entropy, hence, they are not discussed here.
The Figure 3 then shows the impact of varying spatial granularity on the model discrepancy.The model that achieves individual fairness should perform less discrepancy with higher SSIM.The accuracy of original tasks and two PUT models are compared in granularity at 100 meters, 300 meters, 500 meters, and 900 meters.In conclusion, different models have diverse sensitivities of varying granularities.Both original tasks (UR and MP) in the two datasets have an increasing difference with a higher SSIM index, which means they violate individual fairness.For the Mo-PAE, individual fairness is met on MDC data but not on Geolife.The Mo-PAE is also the most sensitive model for varying granularities.For instance, when granularity changes from 100-meter (Figure 3a) to 900-meter (Figure 3d), Mo-PAE has the most obvious change in its line trend on the UR (i.e., privacy gain), and the decreasing trend at 100-meter granularity is lost at 900-meter.Overall, the selection of SSIM granularity has a significant impact on the judgement of the individual fairness of a model.However, these impacts become subtle when the SSIM is applied to the trajectory similarity distinction, as the user pairs table reduced the granularity impact to some extent.For the remaining of the analysis, the granularity of the SSIM is chosen as 100-meter.

Metrics Cluster Size
Original, V% of (DIFF>0.2. K-means-clustering-based individual fairness among diverse models and datasets.The numbers present the percentage of users for whom individual fairness was violated based on their difference in the outcome being greater than 0.2.The fair instances are highlighted in italic font.The maximum/minimum instances of each column are highlighted in bold font.

Similarity Based on K-means Clustering.
Alternative to the results presented based on the similarity thresholding, Table 2 demonstrates the results of individual fairness based on the clustering technique described in Section 3.2.3.
Applying the Elbow and Silhouette methods, we decide the number of clusters (k) to be 4 and 5 for MDC and Geolife, respectively.For each cluster, the table reports the percentage of users whose individual fairness was violated for a given outcome and under various models.More precisely, the results presented here indicate that the original model that objectifies a single task (prediction or privacy) is able to meet the individual fairness criteria for the MDC dataset.
We can observe that in the case of the Mo-PAE model, the privacy gain exhibits high variations across users in the same clusters.Even in the cases where the model satisfies individual fairness by performing similarly in terms of utility

Group Fairness
Group fairness states that groups across different sensitive attributes should receive similar outcomes.To be specific, group fairness argues that a disadvantaged group should receive similar treatment to the advantaged group.Figure 4 presents the discrepancy of the privacy gain from two PUT models for different demographic groups, and Figure 5 presents the utility decline.We observe that both Mo-PAE and TrajGAN perform equally for different gender attributes, as shown in Figure 4a, where the orange boxes (labelled as All Groups) on both are very small.That is while the privacy gain varies across individuals within the same gender, the model achieves group fairness when grouping individuals by gender.The same observations could be made for the age and employment status, where we see that there exist bigger differences across the classes than the gender, but they still achieve group fairness as Δ < 20%.Similarly, in Figure 5, we can observe that both models equally meet the group fairness criteria on the utility decline.
In order to quantify the group fairness of the disadvantaged groups in a more statistical approach, the results of the group fairness score (GFS) are shown in Table 3 3. Group fairness scores (GFS) of three models with different demographic attributes.GFS ≥ 80% indicates the fairly treating the minority subgroup; GFS < 80% indicates the unfairly treating.
In conclusion, except for two subgroups with age attributes (i.e., "<21" and ">39") violating the four-fifths rule, the other subgroups satisfy the group fairness.Finally, it is worth noting that the results presented here are highly dependent on the studied dataset, as we discuss in the next section.

DISCUSSION
In this section, we describe the limitations and implications of our work and discuss possible future directions.

Limitation
Despite our efforts, the presented work also has its limitations.Firstly, the collected mobility dataset is often biased as they only present a subset of the population who took part in data collection.In many cases, the users are limited to students or those affiliated with the research team that has collected the dataset.This limitation means the examined trajectories are not representative of everyone's mobility behaviour.Furthermore, the demographics of the participants are also limited in terms of age and socio-economic diversity.
Secondly, in our paper, we reported that we did not observe any violation of group fairness across gender, age and employment level for the examined PUT models.However, we acknowledge that the results presented regarding group fairness are highly influenced by the city and societal structures in which the data was collected.In the case of MDC, users' traces correspond to a level of socio-economic and cultural freedom associated with life in Switzerland.Such observations will indeed differ if we examine other cultures, such as those in the United States or Asian countries, where there is a broader socio-economic and gender inequality gap.We also believe the availability of datasets with rich demographic information could enable future work to examine the intersection of individual fairness within demographic groups.Finally, it is worth noting that, unlike online datasets, offline mobility datasets come in limited size due to the great burden the data collection imposes on participants and are handful.Although this limitation could impact the generalization of our results (e.g., that is we cannot claim that Mo-PAE is always fairer than TrajGan), the methods proposed in this study are generalizable and applicable to other PUT models and across mobility datasets.Indeed, we believe future work would focus on creating a toolkit for computing spatial-temporal fairness of datasets and models.We expand on the implications of our work next.

Implication
Our paper has multiple important implications: first, our work offers a novel methodology for defining fairness in the context of spatial-temporal datasets.We believe works such as ours will help shape the future roadmap of Fair-ML studies by offering possibilities to measure equity within different systems such as those of mobility based ones (e.g., transportation).The choice of which of the proposed similarity metrics to select for evaluating individual fairness is another critical dimension that could be highly context and application dependent.For example, for applications where there is a need for strict fairness measurement, corresponding to the WYZIWIG worldview [15], a strict similarity measure such as combined entropy (EOTs) could be chosen.In contrast, for applications where the groups are not necessarily equal, but for the purposes of the decision-making process, we would prefer to treat them as if they were, a less sensitive similarity measure such as coarse grain SSIM could be used.
Although our focus in this work was on fairness analysis of the PUT models, we believe our study can be the first step towards implementing fairness interventions embedded in these models.For example, in-processing approaches rely on adjusting the model during the training to enforce fairness goals to be met and optimized in the same manner as accuracy.This goal is often achieved through adversarial networks or fair representation learning approaches such as [21], model induction, model selection, and regularization [43].Of course, designing such mitigation strategies requires access to the underlying architecture of the PUT models which is most of the time not possible, and is in contrast to taking these models as black-box as we did in this study.
Regarding the relationship between privacy versus fairness, location privacy-preserving mechanisms generally prevent information leakage against protected attributes, and these attributes are also essential to fairness analysis, they are used to ensure little discrimination against protected population subgroups.This dimension also explains why the PUT models achieve group fairness but not individual fairness, as these sensitive attributes considered by group fairness are in protection.The competing trend between individual-and group-fairness also implies another interesting trade-off in Fair-ML.From the individual perspective, the re-identification risk and individual fairness are in tension.We believe designing privacy-preserving models to become fairness-aware is a research direction that will receive significant attention in the future.

CONCLUSION
Intuitively, fairness has a close relationship to privacy, no matter structural data or unstructured data in machine learning.But the quantification between them is still unclear.In this paper, we proposed different metrics for measuring individual fairness in the context of spatial-temporal mobility data.We compared different location privacy-protection mechanisms (PUT models) on the defined individual-and group-based metrics.Our results on two real trajectory datasets show that the privacy-aware models achieve fairness at the group level but violate individual fairness.Our findings raise questions regarding the equity of the privacy-preserving models when individuals with similar trajectories receive a very different level of privacy gain.We leverage the empirical results of our work to make valuable suggestions for the further integration of fairness objectives into the PUT models.Especially when discussing the individual perspective, the tension between the user re-identification task and individual fairness needs to be considered for future spatial-temporal data analysis and modelling to achieve a privacy-preserving fairness-aware setting.

APPENDIX 7.1 Inference tasks
Here we list some basic definitions of inference tasks in mobility literature.

User
Re-identification Task (UR).The accuracy of the user re-identification task is leveraged to assess the trajectory uniqueness of the mobility trajectory.With more and more intelligent devices and sensors being utilized to collect information about human activities, the trajectories also expose increasing intimate details about users' lives, from their social life to their preferences.A mobility privacy study conducted by De Montjoye et al [8] illustrates that four spatial-temporal points are enough to identify 95% of the individuals in a certain granularity.As human mobility traces are highly unique, a mechanism capable of reducing the user re-identification risk can offer enhanced privacy protection in mobility data sharing.The enhanced privacy protection is referred to privacy gain (or PG) in the PUT models.indicate that a bit of mobility prediction accuracy is sacrificed in exchange for higher privacy protection.The sacrificed prediction accuracy is referred to utility decline in the PUT models.

Fig. 2 .
Fig. 2. Overview of SSIM and entropy distribution of trajectories of MDC and Geolife datasets.Different granularities of SSIM are compared in a row, where the granularity are ranging from 100-meter to 900-meter.

4. 1
.1 MDC.This deadset is recorded from 2009 to 2011, contains a large amount of continuous mobility data for 184 volunteers with smartphones running a data collection software, in the Lausanne/Geneva area.Each record of the gps-wlan dataset represents a phone call or an observation of a WLAN access point collected during the campaign[25].

Fig. 3 .
Fig. 3.The model performance discrepancy when trajectory similarity is based on the SSIM in different granularities.Figure (a) to Figure (d) are the results of MDC dataset, Figure (e) to Figure (h) are of Geolife.The performance discrepancy (i.e., Performance DIFF) of each model in different granularities compares in each sub-figure.
Fig. 3.The model performance discrepancy when trajectory similarity is based on the SSIM in different granularities.Figure (a) to Figure (d) are the results of MDC dataset, Figure (e) to Figure (h) are of Geolife.The performance discrepancy (i.e., Performance DIFF) of each model in different granularities compares in each sub-figure.

Fig. 4 .
Fig. 4. The privacy protection outcome of PUT models across different demographic groups for the MDC dataset.

Fig. 5 .
Fig. 5.The prediction accuracy outcome of PUT models across different demographic groups for the MDC dataset.

Fig. 6 .
Fig. 6.Pareto Frontier trade-off of Utility and Privacy on two datasets.The hollow squares and diamonds present the results of the Mo-PAE models.The solid points present the results of the TrajGAN.Blue color presents sequence length SL = 5.Black color presents SL = 10.

7. 1 . 2
Mobility Prediction Task (MP).The accuracy of the mobility prediction task is leveraged to assess the predictability of the mobility trajectory.Mobility datasets are of great value for understanding human behaviour patterns, smart transportation, urban planning, public health issue, pandemic management, etc.Many of these applications rely on the next location forecasting of individuals, which in the broader context can provide an accurate portrayal of citizens' mobility over time.For the mobility prediction task in this work, the raw geolocated data or other mobility data commonly contain three elements: user identifier u, timestamps t, location identifiers l.Hence, each location records r could be denoted as   = [  ,   ,   ], while each location sequence S is a set of ordered location records   = { 1 ,  2 ,  3 , • • • ,   }, namely mobility trajectory.Therefore, given the past mobility trajectory   = { 1 ,  2 ,  3 , • • • ,   }, the mobility prediction task is to infer the most likely location l +1 at the next timestamp t +1 .The results of two PUT models .
. For instance, for different age groups, the subgroup with ages between 22 and 27 (i.e., "22 -27") is regarded as the advantaged group, as it has the dominant user number for all age groups.The other age groups' GFSs are calculated based on the disparate impact between them and the advantaged group.Then, compare all GFSs against the fairness threshold of 0.8, which is defined in Section 3.2.3, that is, GFS ≥ 80% indicates fairly treating the disadvantaged group and GFS < 80% indicates the unfairly treating.For example, the result of "28-33" group (i.e., GFS = 98.65%) then indicated that the model satisfy the group fairness as 98.65% > 80%.